I wrote a streaming CSV parser yesterday because I couldn't find a CSV parsing module that does what I want (despite the plethora of available choices). The parsing rules are pretty simple:

1) At the start of a field, if you find a quote string, eat the quote string and go to the state that handles quoted strings. If you find a separator, add the current field (which is blank) to the line, and start over at this step. If you find an end of line string, add the current field to the current target line, push the target line onto the list of parsed lines, and start over. For anything else, go to the state that handles unquoted strings.

2) In the state that handles quoted strings, search the string sequentially for the first instance of the quote string. Add the string up to that point (not including the quote) to the string. If what immediately follows is another quote string, start this step over. Otherwise, go to the unquoted string state. In the event no quote string is found, append the remainder of the string being processed to the current field. Parsing of the next chunk of data will resume in this state.

3) In the state that handles unquoted strings, search the string sequentially for either the first instance of the separator string, or the first instance of the end of line string. If neither is found, append what's in the string being processed to the current field, and note that the parser will resume in this state. Otherwise, append the part of the string up to, but not including, the separator or end of line that was found to the current field, then, append the current field to the current target line. If the found string was the end of line string, append the current target line onto the list of parsed lines, too. Then, return to the initial state.

This set of rules happens to produce results that match how Microsoft's Excel handles CSV files. As you may have determined from my use of "separator string", "quote string", and "end of line string", each of these entities is a string (',', '"', and "\r\n", by default). They also happen to be parameters, so you can parse other simple text formats, too. For example, to parse a standard /etc/passwd file, you could use ":", "\0", and "\n" as the separator, quote, and end of line strings.

If anyone knows of a module on CPAN that does all this, please let me know. Otherwise, I'll upload my module sometime in the next week or two.

BTW, the name I'm currently using for this module is "CSV::Parse" - let me know if you have a specific suggestion for a name you like better.

Reply via email to