Re: Outlook CSV Parser

2007-08-10 Thread Laxminarayan G Kamath A
On Wed, 30 May 2007 12:14:52 -0400, Chas Owens [EMAIL PROTECTED]
wrote:

snip/
...
 http://www.perl.com/pub/a/2004/06/25/profiling.html
 http://search.cpan.org/~nwclark/perl-5.8.8/utils/dprofpp.PL

snip/

 The following code seems to speed up the parsing by two orders of
 magnitude (2.214 seconds 

snip/

 found 38 fields in 10 records
 found 39 fields in 9 records
 found 40 fields in 12 records
 found

snip/

 ===code to parse test file===
 #!/usr/bin/perl
 
 use strict;
 use warnings;
 
 my $record = ;
 my $quotes = 0;
 my @records;
 while (defined (my $line = )) {
 next if $record eq  and $line =~ /^\s*$/;
 
 $record .= $line;
 
 #count the number of quotes
  

snip/

Did not I thank you for this ? Shame on me.. Thanks all the same.. your
code actually helped me several folds.. 
 * It also taught me to think of optimisation more seriously. 
 * your code did not time out my cgi :-P

Thanks a tonne!

-- 
Cheers,
Laxminarayan G Kamath A
e-mail: [EMAIL PROTECTED]
Work URL: http://deeproot.in

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: Outlook CSV Parser

2007-05-31 Thread Laxminarayan G Kamath A
On Wed, 30 May 2007 10:38:40 +0200, Dr.Ruud [EMAIL PROTECTED]
wrote:

 You forgot to supply a link to such a file. Or show a __DATA__ section
 for testing.

http://download.deeproot.in/~kamathln/outlook-encrtypted-sample.csv

-- 
Cheers,
Laxminarayan G Kamath A
e-mail: [EMAIL PROTECTED]
Work URL: http://deeproot.in

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: Outlook CSV Parser

2007-05-31 Thread Mumia W.

On 05/31/2007 02:32 AM, Laxminarayan G Kamath A wrote:


http://download.deeproot.in/~kamathln/outlook-encrtypted-sample.csv



Well I asked for it. :-)

It's impossible to tell where one record ends and another record begins 
with that file.





--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: Outlook CSV Parser

2007-05-31 Thread Dr.Ruud
Laxminarayan G Kamath A:
 Ruud:

 You forgot to supply a link to such a file. Or show a __DATA__
 section for testing.
 
 http://download.deeproot.in/~kamathln/outlook-encrtypted-sample.csv

OK, lets check how wellformed it is:

perl -we'
  local $/;
  $_ = ;
  s/[^]*//g;
  s/(?=,)[^,]+(?=,)//g;
  s/^[^,]*,+//;
  print
' outlook-encrtypted-sample.csv 


That prints:
--- 
3snji

[EMAIL PROTECTED]
--- 

so AFAICS there is a problem somewhere at the end. 

Maybe try a zipped version? 

-- 
Affijn, Ruud

Gewoon is een tijger.

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: Outlook CSV Parser

2007-05-31 Thread Dr.Ruud
Mumia W. schreef:
 Laxminarayan G Kamath A:

 http://download.deeproot.in/~kamathln/outlook-encrtypted-sample.csv

 Well I asked for it. :-)

 It's impossible to tell where one record ends and another record
 begins with that file.

Maybe not, because the rule was that it ends at newline, unless inside
quotes.

-- 
Affijn, Ruud

Gewoon is een tijger.


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: Outlook CSV Parser

2007-05-30 Thread Mumia W.

On 05/30/2007 12:40 AM, Laxminarayan G Kamath A wrote:
Hi PERLers, 
	We here at DeepRoot Linux were trying to parse Outlook's csv so

that I can add them to ldap addressbook.. [...]


The Perl module Text::CSV_XS would make your work much simpler, and it 
might execute a little faster.






--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: Outlook CSV Parser

2007-05-30 Thread Laxminarayan G Kamath A
On Wed, 30 May 2007 01:26:30 -0500, Mumia W. mumia.w.18.spam
[EMAIL PROTECTED] wrote:


 The Perl module Text::CSV_XS would make your work much simpler, and
 it might execute a little faster.

Thank you for pointing out .. but we have already tried it!
Unfortunately, it failed to seperate the records in the right fashion.
We have also tried the several more modules from CPAN.. and they were
not able to parse the OutLook's CSV. 

If you read my mail again, you might find that I already mentioned that
we tried several modules before falling back to writing our own code.

What I am expecting is help with the variant of the regex I used as the
condition for while loop. I am sure If we modify that regexp a little
bit, then we can just use it on the record like this :

$_ = $record;
@fields = /regexp/g;

I tried a lot of different ways but just could not get the right
regexp :-(. 

-- 
Cheers,
Laxminarayan G Kamath A
e-mail: [EMAIL PROTECTED]
Work URL: http://deeproot.in

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: Outlook CSV Parser

2007-05-30 Thread Dr.Ruud
Laxminarayan G Kamath A schreef:

 The stubling blocks : there are several types of problems in
 Outlook's CSV ..

You forgot to supply a link to such a file. Or show a __DATA__ section
for testing.

-- 
Affijn, Ruud

Gewoon is een tijger.


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: Outlook CSV Parser

2007-05-30 Thread Mumia W.

On 05/30/2007 03:04 AM, Laxminarayan G Kamath A wrote:

[...]
I tried a lot of different ways but just could not get the right
regexp :-(. 



I reiterate what the eminent Dr. Ruud said. I need some data to play 
with before I play with the code you posted.





--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: Outlook CSV Parser

2007-05-30 Thread Ken Foskey
On Wed, 2007-05-30 at 13:34 +0530, Laxminarayan G Kamath A wrote:

 What I am expecting is help with the variant of the regex I used as the
 condition for while loop. I am sure If we modify that regexp a little
 bit, then we can just use it on the record like this :
 
 $_ = $record;
 @fields = /regexp/g;
 
 I tried a lot of different ways but just could not get the right
 regexp :-(. 

CSV is a horrible format.  Far too unreliable,  we have exported CSV
from excel that imported differently into excel.

Is there another option,  eg connecting to Outlook via a remote
connection?

Is there another format available?

I doubt a simple regex will do it if the CSV modules do not work.

What data do you have problems with?  Without samples there is nothing
we can do.


-- 
Ken Foskey
FOSS developer


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: Outlook CSV Parser

2007-05-30 Thread Chas Owens

On 5/30/07, Laxminarayan G Kamath A [EMAIL PROTECTED] wrote:
snip

Any ways of optimising it further?

snip

Premature optimization is the root of all evil.  Have you profiled the
code yet?  If not then here is some documentation that will point you
in the right direction

http://www.perl.com/pub/a/2004/06/25/profiling.html
http://search.cpan.org/~nwclark/perl-5.8.8/utils/dprofpp.PL

But while I am looking lets see what is going on.

snip

1. One line need not be one record. They may cointain multine
fields.
2. A sigh of relief but : only multi-line fields are wrapped in
double quotes.
3. commas are both inside and outside the fields. the ones in
the fileds must not be treated as seperator - again fields with
commans are wrapped in double quotes.

snip

The following code seems to speed up the parsing by two orders of
magnitude (2.214 seconds for the old code and 0.036 seconds for this
code on 100 records).  Also, there seems to be a bug in your original
code.  I setup a test file with a 100 records of 30 fields each and it
found

found 33 fields in 1 records
found 34 fields in 1 records
found 36 fields in 3 records
found 37 fields in 5 records
found 38 fields in 10 records
found 39 fields in 9 records
found 40 fields in 12 records
found 41 fields in 17 records
found 42 fields in 15 records
found 43 fields in 13 records
found 44 fields in 7 records
found 45 fields in 5 records
found 46 fields in 1 records
found 48 fields in 1 records

===code to generate test file===
#!/usr/bin/perl

use strict;
use warnings;

my $fields= 30;
my $fieldlen  = 30;
my @fieldtype = qw(normal quoted comma);
my $records   = shift;

for my $rec (1 .. $records) {
   for my $field (1 .. $fields) {
   my $type = $fieldtype[int rand @fieldtype];
   if ($type eq 'normal') {
   print 'n' x $fieldlen, ,;
   } elsif ($type eq 'quoted') {
   print '';
   my $i = 0;
   until ($i  $fieldlen) {
   my $len = int rand $fieldlen;
   print 'q' x $len, \n;
   $i += $len;
   }
   print ',';
   } elsif ($type eq 'comma') {
   print '';
   my $i = 0;
   until ($i == $fieldlen) {
   my $len = int rand $fieldlen;
   $len = $fieldlen - $i if $i+$len  $fieldlen;
   print 'c' x ($len/2), ',', 'c' x ($len/2), \n;
   $i += $len;
   }
   print ',';
   }
   }
   print \n;
}

===code to parse test file===
#!/usr/bin/perl

use strict;
use warnings;

my $record = ;
my $quotes = 0;
my @records;
while (defined (my $line = )) {
   next if $record eq  and $line =~ /^\s*$/;

   $record .= $line;

   #count the number of quotes
   $quotes += () = $line =~ //g;

   #if $quotes is even then we have a full record
   if ($quotes % 2 == 0) {
   $quotes = 0;
   chomp $record;
   my @fields;
   my $unbalanced = 0;
   for my $field (split /,/, $record) {
   my $count = $field =~ s///g;
   if ($count % 2) {
   if ($unbalanced) {
   $unbalanced = 0;
   $fields[-1] .= ,$field;
   next;
   }
   $unbalanced = 1;
   push @fields, $field;
   next;
   }
   if ($unbalanced) {
   $fields[-1] .= ,$field;
   } else {
   push @fields, $field;
   }
   }
   push @records, { whole = $record, fields = [EMAIL PROTECTED];
   $record = ;
   }

}

for my $rec (@records) {
   print join |, @{$rec-{fields}},\n===\n;
}

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: Outlook CSV Parser

2007-05-30 Thread Chas Owens

On 5/30/07, Ken Foskey [EMAIL PROTECTED] wrote:
snip

CSV is a horrible format.  Far too unreliable,  we have exported CSV
from excel that imported differently into excel.

snip

Just pedantic nitpick, but CSV is an incredibly reliable format, the
problem is find programs that actually use CSV rather than a CSV-like
format.  It works out to the same thing, but it isn't CSV's fault.
For an example of a programmer using a CSV-like format where he/she
should be using the real thing look at my other post on this thread.
My code fails to handle escaped double quotes correctly.

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Outlook CSV Parser

2007-05-29 Thread Laxminarayan G Kamath A

Hi PERLers, 
We here at DeepRoot Linux were trying to parse Outlook's csv so
that I can add them to ldap addressbook.. After several futile tries
around with lots of built in packages, we decided it was up to us to
device an algorithm. As time was of importance , we wrote a well it
works! character by character parsing perl algorithm. 
Wondering about how I could make it run faster, I thought I
will give regular expressions a try. The attached file is what I have
come up to.. but it still takes more than 10 seconds on my 1.6 GHZ P4
with 1G RAM to parse a 6500+ lines of CSV . and seperate them  out ..
Let alone importing them to LDAP. Any ways of optimising it further?
The stubling blocks : there are several types of problems in
Outlook's CSV .. 
1. One line need not be one record. They may cointain multine
fields.
2. A sigh of relief but : only multi-line fields are wrapped in
double quotes.
3. commas are both inside and outside the fields. the ones in
the fileds must not be treated as seperator - again fields with
commans are wrapped in double quotes.
 
I hope I am on the right mailing list.. Else, please direct me to the
proper one.

-- 
Cheers,
Laxminarayan G Kamath A
e-mail: [EMAIL PROTECTED]
Work URL: http://deeproot.in

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: Outlook CSV Parser

2007-05-29 Thread Laxminarayan G Kamath A
On Wed, 30 May 2007 11:10:00 +0530, Laxminarayan G Kamath A
[EMAIL PROTECTED] wrote:

 The attached file is what I have
 come up to.. but it still takes more 

... Had forgotten to attach the file.. 

-- 
Cheers,
Laxminarayan G Kamath A
e-mail: [EMAIL PROTECTED]
Work URL: http://deeproot.in
#!/usr/bin/perl

my $readline, $line;
$readline = ;
$line = ;
@records = ();
@fields = ();
while (STDIN) {
	$readline = $_;
	if (($line eq )  ($readline =~ /^\s*$/)) {
		next;
	}
	
	$line = $line . $readline;
	@fields=();
	if ( $line =~ /(^[^\]*$)|(^([^\]*(\[^\]*\)[^\]*)+$)/s ) {
		chomp($line);
		print $line\n;
		print ---\n;
		push (@records, $line);
		$last = ;
		$_ = $line;
		while (/\B\([^\]*)\\B|([^\,\]*)/g) {
			$res = $1$2;
			push (@fields, $res);
			if ($res ne ) {
/\B\([^\]*)\\B|([^\,\]*)/g;
			}
		}
		map({print $_\n\n} @fields);
		print ===\n; 
		$line = ;
	}
	
}

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/