Re: extracting data from table embedded in Word document

2004-03-21 Thread Andrew Gaffney
R. Joseph Newton wrote:
Andrew Gaffney wrote:



The above code won't work because it splits on a comma. A lot of the fields contain a
comma somewhere within the actual data. If it was easy as that, I would have had this 
done
long ago ;)
--
Andrew Gaffney


Hi Andrew,

Don't count on it not being that easy.  Are you using the native capacities of the 
application to
their best.  I'm presuming here that a lawyer will have a full M$ Office suite, if 
they are u8sing
the tools at all.  You can paste tables from Word docs into Excel, and exoprt as CSV 
from there.
Excel should have a much broader range of data export filters.
It sound, though, like you will have a major job of normalization ahead.  I would 
foresee a bit of
hand work in the data design.  One interim step you might take, for multivalued 
fields, is to
concatenate them with some nuetral delimiter, such as a semi-colon.  This way, as you 
normalize to
break out any given field, you can use the Text::CSV module to get your fields, then 
split the
fields of interest on the semicolons.
If this material is already in an Office format, though, I would definitely recommend 
that you do as
much as possible within Office.  There is so much solid built-in functionality there 
that it
wouldn't make sense to low-level something you can do with a macro.
I tried copying from Word and pasting to Excel first. Excel would only let me export as 
tab delimeted text, which was fine. The problem was the fields with embedded '\r\n' which 
completely fscked up the output.

--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: extracting data from table embedded in Word document

2004-03-21 Thread R. Joseph Newton
Andrew Gaffney wrote:


>
> The above code won't work because it splits on a comma. A lot of the fields contain a
> comma somewhere within the actual data. If it was easy as that, I would have had 
> this done
> long ago ;)
>
> --
> Andrew Gaffney

Hi Andrew,

Don't count on it not being that easy.  Are you using the native capacities of the 
application to
their best.  I'm presuming here that a lawyer will have a full M$ Office suite, if 
they are u8sing
the tools at all.  You can paste tables from Word docs into Excel, and exoprt as CSV 
from there.
Excel should have a much broader range of data export filters.

It sound, though, like you will have a major job of normalization ahead.  I would 
foresee a bit of
hand work in the data design.  One interim step you might take, for multivalued 
fields, is to
concatenate them with some nuetral delimiter, such as a semi-colon.  This way, as you 
normalize to
break out any given field, you can use the Text::CSV module to get your fields, then 
split the
fields of interest on the semicolons.

If this material is already in an Office format, though, I would definitely recommend 
that you do as
much as possible within Office.  There is so much solid built-in functionality there 
that it
wouldn't make sense to low-level something you can do with a macro.

Joseph


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: extracting data from table embedded in Word document

2004-03-21 Thread Andrew Gaffney
Charles K. Clarkson wrote:
Andrew Gaffney <[EMAIL PROTECTED]> wrtoe:
: 
: I managed to get Word to export it into a format where
: the fields are separated by '\r'. Yeah, kinda weird. I
: wrote the following:
: 
: open FILE, $file;
: my $counter = 0;
: 
: while () {
:while (/\r?([^\r]*)/sgc) {
:  $counter++;
:  if($counter == 10) {
:print "\n\n";
:$counter = 1;
:  }
:  my $temp = $1;
:  $temp =~ s/\n/~~~/sg;
:  $temp =~ s/\"//g;
:  $temp =~ s/\'/\\'/g;
:  print " $temp ";
:}
: }
: 
: This should print the contents of each field as it
: reads it, which it does seem to be doing. The only
: problem is that it seems to be printing "\n\n"
: after only 7 fields. Also, will that regex get
: everything I want it to? I need to capture
: everything between each set of '\r' including the
: first field which only has a trailing '\r'.

   Can you provicde some test data?
No, I can't. The data I'm working with is a lawyer's client list. I figured it out anyway. 
There were embedded '\r\n' in some of the field. My program was interpreting this as 2 
more fields, one with a '\n'. I just wrote a regex to filter any occurences of '\r\n' and 
the problem went away.

--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



RE: extracting data from table embedded in Word document

2004-03-21 Thread Charles K. Clarkson
Andrew Gaffney <[EMAIL PROTECTED]> wrtoe:
: 
: I managed to get Word to export it into a format where
: the fields are separated by '\r'. Yeah, kinda weird. I
: wrote the following:
: 
: open FILE, $file;
: my $counter = 0;
: 
: while () {
:while (/\r?([^\r]*)/sgc) {
:  $counter++;
:  if($counter == 10) {
:print "\n\n";
:$counter = 1;
:  }
:  my $temp = $1;
:  $temp =~ s/\n/~~~/sg;
:  $temp =~ s/\"//g;
:  $temp =~ s/\'/\\'/g;
:  print " $temp ";
:}
: }
: 
: This should print the contents of each field as it
: reads it, which it does seem to be doing. The only
: problem is that it seems to be printing "\n\n"
: after only 7 fields. Also, will that regex get
: everything I want it to? I need to capture
: everything between each set of '\r' including the
: first field which only has a trailing '\r'.


   Can you provicde some test data?


HTH,

Charles K. Clarkson
-- 
Mobile Homes Specialist
254 968-8328


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: extracting data from table embedded in Word document

2004-03-21 Thread Ralf Schaa
Andrew Gaffney wrote:

I'm writing a web-based client information system for a lawyer. His 
current client list is in a 137 page Word document with an embedded 
table. I can get it into a somewhat usable format by copying the 
entire table, pasting into Excel, and saving as tab delimeted text, 
but this has its problems.

[...]

This should print the contents of each field as it reads it, which it 
does seem to be doing. The only problem is that it seems to be 
printing "\n\n" after only 7 fields. Also, will that regex get 
everything I want it to? I need to capture everything between each set 
of '\r' including the first field which only has a trailing '\r'.
perhaps you want to a have look at the html exported from the word-doc. 
when the table is exported 'nice', you might want to start from there, 
since there are very powerfull html/xml parsers in perl.

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: extracting data from table embedded in Word document

2004-03-20 Thread Andrew Gaffney
Andrew Gaffney wrote:
I'm writing a web-based client information system for a lawyer. His 
current client list is in a 137 page Word document with an embedded 
table. I can get it into a somewhat usable format by copying the entire 
table, pasting into Excel, and saving as tab delimeted text, but this 
has its problems.

Some of the cells in the table have newlines in them. Because of this, 
when it's exported from Excel, the 2nd line will appear in the correct 
field, but on a line by itself:

Row 1FirstnameLastnameAddressCityStateZip
Phone   
AnotherPhone
Row 2First2LastaddyCityStateZip555-

So it looks like 3 records instead of 2. Does anyone have any ideas on 
how to pick apart the data to get it into the DB?
I managed to get Word to export it into a format where the fields are separated by '\r'. 
Yeah, kinda weird. I wrote the following:

open FILE, $file;
my $counter = 0;
while () {
  while (/\r?([^\r]*)/sgc) {
$counter++;
if($counter == 10) {
  print "\n\n";
  $counter = 1;
}
my $temp = $1;
$temp =~ s/\n/~~~/sg;
$temp =~ s/\"//g;
$temp =~ s/\'/\\'/g;
print " $temp ";
  }
}
This should print the contents of each field as it reads it, which it does seem to be 
doing. The only problem is that it seems to be printing "\n\n" after only 7 fields. Also, 
will that regex get everything I want it to? I need to capture everything between each set 
of '\r' including the first field which only has a trailing '\r'.

--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: extracting data from table embedded in Word document

2004-03-20 Thread Andrew Gaffney
R. Joseph Newton wrote:
Andrew Gaffney wrote:


I'm writing a web-based client information system for a lawyer.


[OT}
Eek!!  Be very, very careful about security, please.  I trust your employer has taken 
"duty of
care" isuues into account.
[/OT]

His current client list is
in a 137 page Word document with an embedded table. I can get it into a somewhat usable
format by copying the entire table, pasting into Excel, and saving as tab delimeted 
text,
but this has its problems.
Some of the cells in the table have newlines in them. Because of this, when it's 
exported
from Excel, the 2nd line will appear in the correct field, but on a line by itself:
Row 1,   Firstname,   Lastname,Address, City,State,   Zip, Phone,
Don't export as space,-delimited.  Use CSV, which is much more portable.
That was what I wanted to do in the first place, but Word doesn't support exporting a 
document with a table as CSV.

Row 2,   First middle,  Last,   addy,City,State,   Zip, 555-

So it looks like 3 records instead of 2. Does anyone have any ideas on how to pick 
apart
the data to get it into the DB?
The problem here is more one of datqa design, specifically data normalization, than
formatting.  Neither documents nor spreadsheets are really good storage methods for 
large
scale data.  For long-term robustness, this data should probably be in a normalized 
database.
Multi-valued fields generally indicate data that should be broken out into a table of 
its own:
Believe me, I cringed when I saw this Word doc. I'm writing this script to import the data 
in the Word doc into a MySQL DB in order to do away with the doc. I will personally be 
overwriting the Word docs with /dev/zero multiple times once I'm done ;)

Now as far as the particular mehtod to use for picking the data apart, that depends. 
Is the
phone number the only  multivalued field?  If so it should be fairly straightforward.  
  It
would help if we could actually see some of the data you have to work with.  Might be 
a bit of
work, since you will have to sustitute aliases--but realistic please, for any 
identifying
information.
The phone number isn't the only field like that. Just scanning through the doc, I see this 
problem in most of the fields at some point in the doc.

Greetings! C:\Documents and Settings\rjnewton>perl -w
my @rows;
push @rows, $_ while ;
chomp @rows;
my %phones;
my $current_row_tag;
my %data_rows;
foreach $row (@rows) {
   last unless $row =~ /\S/;
   if ($row =~ /^\w/) {
  my ($row_tag, $given_name, $last_name, $streetaddress, $city, $state,
  $zip, $phone) = split /,\s*/, $row;
 $phones{$row_tag} = [];
 push @{$phones{$row_tag}}, $phone;
 my $customer_data = {};
 $customer_data->{'given name'} = $given_name;
 $customer_data->{'last name'} = $last_name;
 $customer_data->{'street address'} = $streetaddress;
 $customer_data->{'City'} = $city;
 $customer_data->{'State'} = $state;
 $customer_data->{'Zip'} = $zip;
 $data_rows{$row_tag} = $customer_data;
 $current_row_tag = $row_tag;
   } else {
  $row =~ s/^\s*//;
  $row =~ s/\s*$//;
  push @{$phones{$current_row_tag}}, $row;
   }
}
foreach $client_id (sort keys %data_rows) {
   my $client = $data_rows{$client_id};
   print "$client->{'given name'} $client->{'last name'}\n";
   print "$_\n" foreach @{$phones{$client_id}};
}
The above code won't work because it splits on a comma. A lot of the fields contain a 
comma somewhere within the actual data. If it was easy as that, I would have had this done 
long ago ;)

--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: extracting data from table embedded in Word document

2004-03-20 Thread R. Joseph Newton
Andrew Gaffney wrote:

> I'm writing a web-based client information system for a lawyer.

[OT}
Eek!!  Be very, very careful about security, please.  I trust your employer has taken 
"duty of
care" isuues into account.
[/OT]

> His current client list is
> in a 137 page Word document with an embedded table. I can get it into a somewhat 
> usable
> format by copying the entire table, pasting into Excel, and saving as tab delimeted 
> text,
> but this has its problems.
>
> Some of the cells in the table have newlines in them. Because of this, when it's 
> exported
> from Excel, the 2nd line will appear in the correct field, but on a line by itself:
>
> Row 1,   Firstname,   Lastname,Address, City,State,   Zip, Phone,
> AnotherPhone

Don't export as space,-delimited.  Use CSV, which is much more portable.

>
> Row 2,   First middle,  Last,   addy,City,State,   Zip, 
> 555-
>
> So it looks like 3 records instead of 2. Does anyone have any ideas on how to pick 
> apart
> the data to get it into the DB?
>
> --
> Andrew Gaffney

Hi Andrew,

The problem here is more one of datqa design, specifically data normalization, than
formatting.  Neither documents nor spreadsheets are really good storage methods for 
large
scale data.  For long-term robustness, this data should probably be in a normalized 
database.
Multi-valued fields generally indicate data that should be broken out into a table of 
its own:

Assuming some unique ID for each client [*not* the name; names do not guarantee 
uniquesness],
such a table could be very simple.

client_id auto_number unique primary key
client_phone character[20]
primary_number boolean
I'm a little rusty on my SQL today, so the types I list above are logical types.  You 
will
have to check your SQL reference to find the appropriate available types.  These can 
easily be
linked with simple SQL for presentation purposes.  This structure also better supports
working with the data of the fields broken out this way.

I know this isn't a very Perl-ish response, but if you are going to rake on the task of
working with this data anyway, you might as well do it right, and build something that 
will
hold up over time.  Then you can use querying tools or speadheets to prepare any needed
presentation format.

Now as far as the particular mehtod to use for picking the data apart, that depends. 
Is the
phone number the only  multivalued field?  If so it should be fairly straightforward.  
  It
would help if we could actually see some of the data you have to work with.  Might be 
a bit of
work, since you will have to sustitute aliases--but realistic please, for any 
identifying
information.


Greetings! C:\Documents and Settings\rjnewton>perl -w
my @rows;
push @rows, $_ while ;
chomp @rows;
my %phones;
my $current_row_tag;
my %data_rows;
foreach $row (@rows) {
   last unless $row =~ /\S/;
   if ($row =~ /^\w/) {
  my ($row_tag, $given_name, $last_name, $streetaddress, $city, $state,
  $zip, $phone) = split /,\s*/, $row;
 $phones{$row_tag} = [];
 push @{$phones{$row_tag}}, $phone;
 my $customer_data = {};
 $customer_data->{'given name'} = $given_name;
 $customer_data->{'last name'} = $last_name;
 $customer_data->{'street address'} = $streetaddress;
 $customer_data->{'City'} = $city;
 $customer_data->{'State'} = $state;
 $customer_data->{'Zip'} = $zip;
 $data_rows{$row_tag} = $customer_data;
 $current_row_tag = $row_tag;
   } else {
  $row =~ s/^\s*//;
  $row =~ s/\s*$//;
  push @{$phones{$current_row_tag}}, $row;
   }
}

foreach $client_id (sort keys %data_rows) {
   my $client = $data_rows{$client_id};
   print "$client->{'given name'} $client->{'last name'}\n";
   print "$_\n" foreach @{$phones{$client_id}};
}

^Z
Row 1,   Firstname,   Lastname,Address, City,State,   Zip, Phone,
AnotherPhone
Row 2,   First middle,  Last,   addy,City,State,   Zip,
555-

^Z
Firstname Lastname
Phone
AnotherPhone
First middle Last
555-

HTH,

Joseph


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




RE: extracting data from table embedded in Word document

2004-03-20 Thread Charles K. Clarkson
Andrew Gaffney <[EMAIL PROTECTED]> wrote:
: 
: Some of the cells in the table have newlines in them.
: Because of this, when it's exported from Excel, the
: 2nd line will appear in the correct field, but on a
: line by itself:
: 
[snip]
: 
: So it looks like 3 records instead of 2. Does anyone
: have any ideas on how to pick apart the data to get
: it into the DB?


How about replacing the newlines in a copy of the
Word file. Swap them for something that is not in the
table, like "~~~". Then add them back as you read the
records into your database.


HTH,

Charles K. Clarkson
-- 
Mobile Homes Specialist
254 968-8328


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




extracting data from table embedded in Word document

2004-03-20 Thread Andrew Gaffney
I'm writing a web-based client information system for a lawyer. His current client list is 
in a 137 page Word document with an embedded table. I can get it into a somewhat usable 
format by copying the entire table, pasting into Excel, and saving as tab delimeted text, 
but this has its problems.

Some of the cells in the table have newlines in them. Because of this, when it's exported 
from Excel, the 2nd line will appear in the correct field, but on a line by itself:

Row 1   Firstname   LastnameAddress CityState   Zip Phone   
AnotherPhone
Row 2   First2  LastaddyCityState   Zip 555-
So it looks like 3 records instead of 2. Does anyone have any ideas on how to pick apart 
the data to get it into the DB?

--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]