I wrote a streaming CSV parser yesterday because I couldn't find a CSV
parsing module that does what I want (despite the plethora of available
choices). The parsing rules are pretty simple:
1) At the start of a field, if you find a quote string, eat the quote
string and go to the state that handles quoted strings. If you find a
separator, add the current field (which is blank) to the line, and start
over at this step. If you find an end of line string, add the current
field to the current target line, push the target line onto the list of
parsed lines, and start over. For anything else, go to the state that
handles unquoted strings.
2) In the state that handles quoted strings, search the string
sequentially for the first instance of the quote string. Add the string
up to that point (not including the quote) to the string. If what
immediately follows is another quote string, start this step over.
Otherwise, go to the unquoted string state. In the event no quote
string is found, append the remainder of the string being processed to
the current field. Parsing of the next chunk of data will resume in
this state.
3) In the state that handles unquoted strings, search the string
sequentially for either the first instance of the separator string, or
the first instance of the end of line string. If neither is found,
append what's in the string being processed to the current field, and
note that the parser will resume in this state. Otherwise, append the
part of the string up to, but not including, the separator or end of
line that was found to the current field, then, append the current field
to the current target line. If the found string was the end of line
string, append the current target line onto the list of parsed lines,
too. Then, return to the initial state.
This set of rules happens to produce results that match how Microsoft's
Excel handles CSV files. As you may have determined from my use of
"separator string", "quote string", and "end of line string", each of
these entities is a string (',', '"', and "\r\n", by default). They
also happen to be parameters, so you can parse other simple text
formats, too. For example, to parse a standard /etc/passwd file, you
could use ":", "\0", and "\n" as the separator, quote, and end of line
strings.
If anyone knows of a module on CPAN that does all this, please let me
know. Otherwise, I'll upload my module sometime in the next week or two.
BTW, the name I'm currently using for this module is "CSV::Parse" - let
me know if you have a specific suggestion for a name you like better.
- YA CSV parser Jim Schneider
-