Liz told me to post here In my work to post Text::CSV_XS from perl5 to perl6, am am close to feature complete now, but the performance is not what I hoped for:
seconds to parse 10000 lines of CSV
perl5
Text::CSV::Easy_XS 0.016
Text::CSV::Easy_PP 0.017
Text::CSV_XS w/ bindc 0.034
Text::CSV_XS 0.039
Text::CSV_PP 0.525
Pegex::CSV 1.340
perl6
csv.pl 7.270 state machine
csv-ip5xs 17.267 Inline::Perl5 with Text::CSV_XS
csv-ip5xsio 17.243 Inline::Perl5 with Text::CSV_XS w/ IO
csv-ip5pp 18.218 Inline::Perl5 with Text::CSV_PP
csv_gram.pl 14.226 A Grammar-based parser
test.pl 44.541 A reference parser (when I started)
test-t.pl 39.887 Current parser, all options implemented
csv-parser.pl 25.712 Tony-o's parser
So, currently for this kid of work, perl6 is between 2780 and 5.2 times
slower than perl5 (worst vs best / best vs worst)
As Text::CSV is allowing all setting to be changed between any call,
a static grammar engine is out of the question. As I started working
alongside someone else, we decided that I would explore the regular
expression approach and he would explore that grammar approach. The
latter never really happened :(
Currently, the regular expression is causing the parse line to be
returned as chunks of interest, where I take advantage of the first in
alternative is most important so having a quotation sequence that is
equal to part of eon-of-line sequence is still valid.
my sub chunks (Str $str, Regex:D $re) {
$str.defined or return ();
$str eq "" and return ("");
$str.split ($re, :all).flat.map: {
if $_ ~~ Str {
$_ if .chars;
}
else {
.Str if .Bool;
};
};
}
and then later
my Regex $chx = $!eol.defined
?? rx{ $eol | $sep | $quo | $esc }
!! rx{ \r\n | \r | \n | $sep | $quo | $esc };
$buffer.defined and @ch.push: chunks ($buffer, $chx);
@ch or return parse_error (2012);
as it stands, the chunks function could be reconstructed into using a
grammar that only changes whenever any of $eol, $sep, $quo, or $esc
would change. None of the other options - in the current program flow -
would be of influence on the parser, as long as chunks would return the
same list of "tokens"
Is it worth while to try to reconstruct chunks to use a dynamic grammar
or do I wait for the regex engine to become faster.
As a side note: currently none of these four parts are allowed to be a
regular expression. If I stick to regular expressions, that could be an
option for future enhancements. All four are to be considered fixed
strings, where an undefined $eol means either \r\n, or \n, or \r
Code is available in the perl6 ecosystem http://modules.perl6.org/
GIT repo is at https://github.com/Tux/CSV
Documentation is https://github.com/Tux/CSV/blob/master/Text-CSV.pod
The csv *function* is still work in progress.
The style used is not a point of discussion.
--
H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.21 porting perl5 on HP-UX, AIX, and openSUSE
http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/
http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
pgplR7UTSYSYb.pgp
Description: OpenPGP digital signature
