Re: Outlook CSV Parser

Chas Owens Wed, 30 May 2007 09:15:54 -0700

On 5/30/07, Laxminarayan G Kamath A <[EMAIL PROTECTED]> wrote:
snip

Any ways of optimising it further?

snip


Premature optimization is the root of all evil.  Have you profiled the
code yet?  If not then here is some documentation that will point you
in the right direction

http://www.perl.com/pub/a/2004/06/25/profiling.html
http://search.cpan.org/~nwclark/perl-5.8.8/utils/dprofpp.PL

But while I am looking lets see what is going on.

snip

        1. One line need not be one record. They may cointain multine
fields.
        2. A sigh of relief but : only multi-line fields are wrapped in
double quotes.
        3. commas are both inside and outside the fields. the ones in
the fileds must not be treated as "seperator" - again fields with
commans are wrapped in double quotes.

snip

The following code seems to speed up the parsing by two orders of
magnitude (2.214 seconds for the old code and 0.036 seconds for this
code on 100 records).  Also, there seems to be a bug in your original
code.  I setup a test file with a 100 records of 30 fields each and it
found

found 33 fields in 1 records
found 34 fields in 1 records
found 36 fields in 3 records
found 37 fields in 5 records
found 38 fields in 10 records
found 39 fields in 9 records
found 40 fields in 12 records
found 41 fields in 17 records
found 42 fields in 15 records
found 43 fields in 13 records
found 44 fields in 7 records
found 45 fields in 5 records
found 46 fields in 1 records
found 48 fields in 1 records

===code to generate test file===
#!/usr/bin/perl

use strict;
use warnings;

my $fields    = 30;
my $fieldlen  = 30;
my @fieldtype = qw(normal quoted comma);
my $records   = shift;

for my $rec (1 .. $records) {
       for my $field (1 .. $fields) {
               my $type = $fieldtype[int rand @fieldtype];
               if ($type eq 'normal') {
                       print 'n' x $fieldlen, ",";
               } elsif ($type eq 'quoted') {
                       print '"';
                       my $i = 0;
                       until ($i < $fieldlen) {
                               my $len = int rand $fieldlen;
                               print 'q' x $len, "\n";
                               $i += $len;
                       }
                       print '",';
               } elsif ($type eq 'comma') {
                       print '"';
                       my $i = 0;
                       until ($i == $fieldlen) {
                               my $len = int rand $fieldlen;
                               $len = $fieldlen - $i if $i+$len > $fieldlen;
                               print 'c' x ($len/2), ',', 'c' x ($len/2), "\n";
                               $i += $len;
                       }
                       print '",';
               }
       }
       print "\n";
}

===code to parse test file===
#!/usr/bin/perl

use strict;
use warnings;

my $record = "";
my $quotes = 0;
my @records;
while (defined (my $line = <>)) {
       next if $record eq "" and $line =~ /^\s*$/;

       $record .= $line;

       #count the number of quotes
       $quotes += () = $line =~ /"/g;

       #if $quotes is even then we have a full record
       if ($quotes % 2 == 0) {
               $quotes = 0;
               chomp $record;
               my @fields;
               my $unbalanced = 0;
               for my $field (split /,/, $record) {
                       my $count = $field =~ s/"//g;
                       if ($count % 2) {
                               if ($unbalanced) {
                                       $unbalanced = 0;
                                       $fields[-1] .= ",$field";
                                       next;
                               }
                               $unbalanced = 1;
                               push @fields, $field;
                               next;
                       }
                       if ($unbalanced) {
                               $fields[-1] .= ",$field";
                       } else {
                               push @fields, $field;
                       }
               }
               push @records, { whole => $record, fields => [EMAIL PROTECTED];
               $record = "";
       }

}

for my $rec (@records) {
       print join "|", @{$rec->{fields}},"\n===\n";
}

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: Outlook CSV Parser

Reply via email to