Re: Parsing a PDF with empty fields

2011-04-14 Thread Ted Schuerzinger
(Apologies to the OP for first sending this to him and not the group.)

On Wed, 13 Apr 2011 19:16:25 -0400, in perl you wrote:

> Disclaimer: I'm using Acrobat 10 and Office 2010, so YMMV.
> 
> I opened the PDF in Acrobat and selected File, Save As, then chose
> "Tables in Excel Spreadsheet (*.xml)". I opened the resulting xml in
> Excel, and the data was properly lined up in columns. There were only
> 50 entries per sheet (one PDF page per sheet), but there were only 5
> sheets, and the fifth sheet had the last page of the PDF. So it was
> missing most of the data from the PDF.

I'm using Foxit, which doesn't have the options you describe.  What it 
does have is something I found incredibly irritating: a function called 
"Text Viewer".  Select this function, and the document appears as 
fixed-width plain text, which is exactly a format I could use: I'd just 
have to count the number of characters in what would be each column, and 
create the array that way.  However, the "Text Viewer" function *doesn't 
allow you to copy text*.  (At least, not with the freeware version.)

And it's claimed that PDF stands for "Portable" document format.  At 
every turn I find PDFs to be less portable than plain text.

-- 
Ted S.
fedya at hughes dot net
Now blogging at http://justacineast.blogspot.com
___
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs


RE: Parsing a PDF with empty fields

2011-04-13 Thread Greg Aiken
other ideas:

ocr the page with an app that saves not only the ascii text, but each words
x,y position within the page (position info will tell if field is empty or
data)
http://code.google.com/p/keenerview/wiki/keenerocr

a gui tool that helps one investigate the inner structure of a pdf (again,
before text is rendered in a PDF, theres a concept of placing the position
of the char string to be rendered)
http://www.cheapimpostor.com/PDFInspector/

adobes PDF sdk page, which somewhere has the actual PDF spec
http://www.adobe.com/devnet/acrobat.html

as you've discovered, theres a concept of 'compressing' the various object
types in a pdf file.  one may compress ascii text content, which has been
done here, which makes easy parsing of the file non-trivial.

to make matters worse, it might be that not each 'word' is directed to be
printed as a 'string' (eg a set of characters making up a human
distinguishable 'word), but in fact, because both Postscript and PDF support
precision leading and kerning controls, you might find that literally each
'character' is set to a specific location before being rendered (thus making
it hard to 'find a word' in the PDF text object stream.

please let me know what you found as a best solution.

-Original Message-
From: perl-win32-users-boun...@listserv.activestate.com
[mailto:perl-win32-users-boun...@listserv.activestate.com] On Behalf Of Ted
Schuerzinger
Sent: Wednesday, April 13, 2011 1:53 PM
To: Perl-Win32-Users@listserv.ActiveState.com
Subject: Parsing a PDF with empty fields

>From time to time I need to parse the rankings PDF files from the
Women's Tennis Association.  I would do it by copying all the text,
pasting it into a text file, and then using a Perl script to take each
line, splitting it by spaces into an array, and then manipulating those
arrays as need be.  (Some tennis players have multiple spaces in their
names, such as the Spanish players who use their matronymics, but
reversing the array gets around that fairly easily.)  That meant getting
rid of fields I don't need (eg. a player's nationality), putting in
tabs, and printing the modified list to a different text file so that I
could use a spreadsheet to open the modified PDF as a tab-delimited
spreadsheet.

However, the WTA has changed its PDFs.  There used to be numerical
fields where some of the values would be a 0.  The WTA has repaced these
0's with *empty fields*.  Consider a player who has values for every
field:

1 (1) WOZNIACKI, CAROLINE DEN 9930 23 470 200 280 200

The various fields are in order, 
This week's ranking, 
Last week's ranking, 
Name, 
Nationality, 
Total ranking points, 
Events played, 
Points earned last week, 
Points coming off the rankings for next week, 
Points earned in 16th-best result, and 
Points earned in 17th-best result.

Now it's fairly obvious that not everybody plays every week, so there
are going to be players with 0's in either the "points earned last week"
field or the "points coming off the rankings" (the rankings are a
rolling 52-week system, so this field is the number of points the player
earned in the same calendar week last year).  Also, there are players
who haven't played 16 events and so have a 0 for the 16th or 17th
result.  In the past, such fields would have an actual 0, which makes
parsing the PDF easy.  Now, those are blank fields.

In a PDF file, it's visually obvious which fields are empty:

 

However, copying to text gets rid of these empty fields, making the old
Perl script I used now useless.  If you download the ~140KB PDF, you'll
see that #3 Zvonareva and #11 Peer each have one field (and they're
different for the two players) that would have a 0, but when copied to a
text file, you get these results:

3 (3) ZVONAREVA, VERA RUS 7815 20 320 125 60
11 (11) PEER, SHAHAR ISR 3030 22 60 60 60

You can't tell which field would have had the 0.  (And if you look far
enough down the rankings, wait until you get to #172 Sloane Stephens,
who has exactly 16 events, which means she's got a 0 for the 17th
tournament, but this particular week has an event added but none coming
off the rankings.)

Any good idea on how to get around this?  I presume there's a
PDF-parsing module, but I don't do all that much Perl programming,
limiting myself to text parsing, regexes and a bit more, so I'm not very
good with modules.

(Many years ago, the WTA rankings used to be fixed-with text files.  Boy
do I miss those days.)

-- 
Ted S.
fedya at hughes dot net
Now blogging at http://justacineast.blogspot.com
___
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs




___
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs


RE: Parsing a PDF with empty fields

2011-04-13 Thread Howard Tanner
Disclaimer: I'm using Acrobat 10 and Office 2010, so YMMV.

I opened the PDF in Acrobat and selected File, Save As, then chose "Tables
in Excel Spreadsheet (*.xml)". I opened the resulting xml in Excel, and the
data was properly lined up in columns. There were only 50 entries per sheet
(one PDF page per sheet), but there were only 5 sheets, and the fifth sheet
had the last page of the PDF. So it was missing most of the data from the
PDF.

I was able to highlight the table in Acrobat, right click in one of the
highlighted fields, and select "Open Table in Spreadsheet".  This opened
Excel automatically with the table data properly populated. I was then able
to export the table as a CSV, which gives you the data in a format you can
use. However, this only worked one page at a time, but it worked for every
page in the PDF. This, therefore, is your best bet.

This will be tedious for all 25 pages in the PDF, especially considering you
have to do this more than once. I recommend looking into AutoIt for
automating the conversion to CSV. AutoIt is my favorite automation tool, and
it's free.

Good luck!

-Original Message-
From: perl-win32-users-boun...@listserv.activestate.com
[mailto:perl-win32-users-boun...@listserv.activestate.com] On Behalf Of Ted
Schuerzinger
Sent: Wednesday, April 13, 2011 4:53 PM
To: Perl-Win32-Users@listserv.ActiveState.com
Subject: Parsing a PDF with empty fields

>From time to time I need to parse the rankings PDF files from the
Women's Tennis Association.  I would do it by copying all the text, pasting
it into a text file, and then using a Perl script to take each line,
splitting it by spaces into an array, and then manipulating those arrays as
need be.  (Some tennis players have multiple spaces in their names, such as
the Spanish players who use their matronymics, but reversing the array gets
around that fairly easily.)  That meant getting rid of fields I don't need
(eg. a player's nationality), putting in tabs, and printing the modified
list to a different text file so that I could use a spreadsheet to open the
modified PDF as a tab-delimited spreadsheet.

However, the WTA has changed its PDFs.  There used to be numerical fields
where some of the values would be a 0.  The WTA has repaced these 0's with
*empty fields*.  Consider a player who has values for every
field:

1 (1) WOZNIACKI, CAROLINE DEN 9930 23 470 200 280 200

The various fields are in order,
This week's ranking,
Last week's ranking,
Name,
Nationality,
Total ranking points,
Events played,
Points earned last week,
Points coming off the rankings for next week, Points earned in 16th-best
result, and Points earned in 17th-best result.

Now it's fairly obvious that not everybody plays every week, so there are
going to be players with 0's in either the "points earned last week"
field or the "points coming off the rankings" (the rankings are a rolling
52-week system, so this field is the number of points the player earned in
the same calendar week last year).  Also, there are players who haven't
played 16 events and so have a 0 for the 16th or 17th result.  In the past,
such fields would have an actual 0, which makes parsing the PDF easy.  Now,
those are blank fields.

In a PDF file, it's visually obvious which fields are empty:

 

However, copying to text gets rid of these empty fields, making the old Perl
script I used now useless.  If you download the ~140KB PDF, you'll see that
#3 Zvonareva and #11 Peer each have one field (and they're different for the
two players) that would have a 0, but when copied to a text file, you get
these results:

3 (3) ZVONAREVA, VERA RUS 7815 20 320 125 60
11 (11) PEER, SHAHAR ISR 3030 22 60 60 60

You can't tell which field would have had the 0.  (And if you look far
enough down the rankings, wait until you get to #172 Sloane Stephens, who
has exactly 16 events, which means she's got a 0 for the 17th tournament,
but this particular week has an event added but none coming off the
rankings.)

Any good idea on how to get around this?  I presume there's a PDF-parsing
module, but I don't do all that much Perl programming, limiting myself to
text parsing, regexes and a bit more, so I'm not very good with modules.

(Many years ago, the WTA rankings used to be fixed-with text files.  Boy do
I miss those days.)

--
Ted S.
fedya at hughes dot net
Now blogging at http://justacineast.blogspot.com
___
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

___
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs