Andrew Savige wrote:
It definitely runs a lot faster. However, [it screws up for] the original test data:
Mea culpa. I'm a little distracted just at the moment and I botched the upgrade to (?>...). Here's a solution that actually works for both data sets (and, I devoutly hope, for all others too!):
-----cut----------cut----------cut----------cut----------cut----- use re 'eval'; our $quoted = qr/ ' [^'\\]* (?> (?> \\. [^'\\]* )* ) ' # Match 'str' | " [^"\\]* (?> (?> \\. [^"\\]* )* ) " # Match 'str' /x; our $element = qr/ (?> (?: (?> [^'"{},]+ ) # Match non-special characters | \\. # Or match escaped anything | $quoted # Or match quoted anything | (??{$nested}) # Or match {...,...,...} )+ ) # ...as many times as possible /xs; our $nested = qr/ [{] # Match { (?> (?: $element , )* ) # Match list of subelements $element? # Match last subelement [}] # Match } /x; while (<DATA>) { @fields = m/\G ( $element ) ,? /gx; # Capture elements repeatedly use Data::Dumper 'Dumper'; print Dumper( @fields ); print "=================\n"; } __DATA__ abc, ',def' "\"ab'c,}" xyz , fred IN { 1, "x}y",3 } x, 'z' {1}, hello one two three four five six seven eight nine heat-death-of-the-universe -----cut----------cut----------cut----------cut----------cut-----
Will the new Perl 6 pattern matching be a vast improvement for these sort of parsing problems?
No. Because it still won't do our thinking for us. ;-) However, syntactically it will be much cleaner, and it will probably feature much better debugging tools. Both of which *will* make a big difference. Here's the same thing in Perl 6: -----cut----------cut----------cut----------cut----------cut----- grammar CSV::Nested { rule quoted { ' <-['\\]>* [ \\. <[^'\\]>* ]*: ' # Match 'str' | " <-["\\]>* [ \\. <[^"\\]>* ]*: " # Match 'str' } rule element { [ <-['"{},]>+: # Match non-special characters | \\. # Or match escaped anything | <quoted> # Or match quoted anything | <nested> # Or match {...,...,...} ]+: # ...as many times as possible } rule nested { \{ # Match { [ <element> , ]*: # Match list of subelements <element>? # Match last subelement \} # Match } } } while <$*DATA> { @fields = m:ec/ ( <CSV::Nested.element> ) ,? /; # Capture elements repeatedly use Data::Dumper 'Dumper'; print Dumper( @fields ); print "=================\n"; } __DATA__ abc, ',def' "\"ab'c,}" xyz , fred IN { 1, "x}y",3 } x, 'z' {1}, hello one two three four five six seven eight nine heat-death-of-the-universe -----cut----------cut----------cut----------cut----------cut----- Damian