On Nov 25, 2007 10:59 AM, Jim Schneider <[EMAIL PROTECTED]> wrote: > I wrote a streaming CSV parser yesterday because I couldn't find a CSV > parsing module that does what I want (despite the plethora of available > choices). The parsing rules are pretty simple: > > 1) At the start of a field, if you find a quote string, eat the quote > string and go to the state that handles quoted strings. If you find a > separator, add the current field (which is blank) to the line, and start > over at this step. If you find an end of line string, add the current > field to the current target line, push the target line onto the list of > parsed lines, and start over. For anything else, go to the state that > handles unquoted strings. > > 2) In the state that handles quoted strings, search the string > sequentially for the first instance of the quote string. Add the string > up to that point (not including the quote) to the string. If what > immediately follows is another quote string, start this step over. > Otherwise, go to the unquoted string state. In the event no quote > string is found, append the remainder of the string being processed to > the current field. Parsing of the next chunk of data will resume in > this state. > > 3) In the state that handles unquoted strings, search the string > sequentially for either the first instance of the separator string, or > the first instance of the end of line string. If neither is found, > append what's in the string being processed to the current field, and > note that the parser will resume in this state. Otherwise, append the > part of the string up to, but not including, the separator or end of > line that was found to the current field, then, append the current field > to the current target line. If the found string was the end of line > string, append the current target line onto the list of parsed lines, > too. Then, return to the initial state. > > This set of rules happens to produce results that match how Microsoft's > Excel handles CSV files. As you may have determined from my use of > "separator string", "quote string", and "end of line string", each of > these entities is a string (',', '"', and "\r\n", by default). They > also happen to be parameters, so you can parse other simple text > formats, too. For example, to parse a standard /etc/passwd file, you > could use ":", "\0", and "\n" as the separator, quote, and end of line > strings. > > If anyone knows of a module on CPAN that does all this, please let me > know. Otherwise, I'll upload my module sometime in the next week or two.
Didn't you just reinvent Text::CSV_XS? The only tweak required is saying "binary" to enable the use of newlines inside quoted fields. ->new({ binary => 1, # defaults eol => qq(\r\n), sep_char => q(,), quote_char => q("), escape_char => q("), }) Josh