On Nov 25, 2007 10:59 AM, Jim Schneider <[EMAIL PROTECTED]> wrote:
> I wrote a streaming CSV parser yesterday because I couldn't find a CSV
> parsing module that does what I want (despite the plethora of available
> choices).  The parsing rules are pretty simple:
>
> 1)  At the start of a field, if you find a quote string, eat the quote
> string and go to the state that handles quoted strings.  If you find a
> separator, add the current field (which is blank) to the line, and start
> over at this step.  If you find an end of line string, add the current
> field to the current target line, push the target line onto the list of
> parsed lines, and start over.  For anything else, go to the state that
> handles unquoted strings.
>
> 2)  In the state that handles quoted strings, search the string
> sequentially for the first instance of the quote string.  Add the string
> up to that point (not including the quote) to the string.  If what
> immediately follows is another quote string, start this step over.
> Otherwise, go to the unquoted string state.  In the event no quote
> string is found, append the remainder of the string being processed to
> the current field.  Parsing of the next chunk of data will resume in
> this state.
>
> 3)  In the state that handles unquoted strings, search the string
> sequentially for either the first instance of the separator string, or
> the first instance of the end of line string.  If neither is found,
> append what's in the string being processed to the current field, and
> note that the parser will resume in this state.  Otherwise, append the
> part of the string up to, but not including, the separator or end of
> line that was found to the current field, then, append the current field
> to the current target line.  If the found string was the end of line
> string, append the current target line onto the list of parsed lines,
> too.  Then, return to the initial state.
>
> This set of rules happens to produce results that match how Microsoft's
> Excel handles CSV files.  As you may have determined from my use of
> "separator string", "quote string", and "end of line string", each of
> these entities is a string (',', '"', and "\r\n", by default).  They
> also happen to be parameters, so you can parse other simple text
> formats, too.  For example, to parse a standard /etc/passwd file, you
> could use ":", "\0", and "\n" as the separator, quote, and end of line
> strings.
>
> If anyone knows of a module on CPAN that does all this, please let me
> know.  Otherwise, I'll upload my module sometime in the next week or two.

Didn't you just reinvent Text::CSV_XS? The only tweak required is
saying "binary" to enable the use of newlines inside quoted fields.

->new({
    binary => 1,

    # defaults
    eol => qq(\r\n),
    sep_char => q(,),
    quote_char => q("),
    escape_char => q("),
})

Josh

Reply via email to