RE: How to pull Text from a PDF using Perl?

2007-01-05 Thread Wagner, David --- Senior Programmer Analyst --- WGO
-Original Message-
From: Dave Gray [mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 04, 2007 17:01
To: beginners@perl.org
Cc: Wagner, David --- Senior Programmer Analyst --- WGO
Subject: Re: How to pull Text from a PDF using Perl?

On 1/4/07, Wagner, David --- Senior Programmer Analyst --- WGO
<[EMAIL PROTECTED]> wrote:
> I need to look at the text from page 1 of a couple of thousand
pdf's and do a regex on searching for the data.
> Before sending I tried a number of other things, but either
died or showed me data like the above.
>
> Any insight or simple script which will display the text would
be greatly appreciated.

I had to do this the other day and got frustrated with the modules I
found and ended up using pdftotext which comes with xpdf, like so:

  my @pages = split /^L/, `$pdftotext -layout $inputfile -`;
  for my $page (@pages) {
# do stuff
  }

Without the -layout switch, parsing any sort of tabular data becomes a
lot more annoying.

Cheers,
Dave
-

Thanks much, David. I searched on internet and found the
download site. Brought it down and unzipped. Then made a initial run and
it got the data expected. Now will give it try on a larger scale.

I aprpeciate the time and response.

  Wags ;)
David R Wagner
Senior Programmer Analyst
FedEx Freight
1.408.323.4225x2224 TEL
1.408.323.4449   FAX
http://fedex.com/us 

**
This message contains information that is confidential and proprietary to FedEx 
Freight or its affiliates.  It is intended only for the recipient named and for 
the express  purpose(s) described therein.  Any other use is prohibited.
**


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: How to pull Text from a PDF using Perl?

2007-01-04 Thread Dave Gray

On 1/4/07, Wagner, David --- Senior Programmer Analyst --- WGO
<[EMAIL PROTECTED]> wrote:

I need to look at the text from page 1 of a couple of thousand pdf's 
and do a regex on searching for the data.
Before sending I tried a number of other things, but either died or 
showed me data like the above.

Any insight or simple script which will display the text would be 
greatly appreciated.


I had to do this the other day and got frustrated with the modules I
found and ended up using pdftotext which comes with xpdf, like so:

 my @pages = split /^L/, `$pdftotext -layout $inputfile -`;
 for my $page (@pages) {
   # do stuff
 }

Without the -layout switch, parsing any sort of tabular data becomes a
lot more annoying.

Cheers,
Dave

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




How to pull Text from a PDF using Perl?

2007-01-04 Thread Wagner, David --- Senior Programmer Analyst --- WGO
I have tried both PDF::API2 and CAM::PDF and I must be misunderstanding how to 
use these modules. Here is the way I attempted using CAM::PDF

Source portion:
…
use CAM::PDF;


$MyPDF = CAM::PDF->new($MyFileIn); # a PDF file which has text

$MyPDFPgCnt = $MyPDF->numPages();

my $contentTree = $MyPDF->getPageContentTree(1);
$contentTree->render("CAM::PDF::Renderer::Text");

I get a lot of blank lines and the characters I do get, look like:

 3 U L Q W ♥ ' D W H ↔ ♥ ¶ § ↕ § § ↕ § ‼ ‼ ↓


   & K L O G ♥ $ F F R 
X Q W V
7 L P H ↔ ♥ ¶ § ↔ ¶ ∟ 3 0
I need to look at the text from page 1 of a couple of thousand pdf's 
and do a regex on searching for the data.
Before sending I tried a number of other things, but either died or 
showed me data like the above.

Any insight or simple script which will display the text would be 
greatly appreciated.

 Thanks.

  Wags ;)
David R Wagner
Senior Programmer Analyst
FedEx Freight
1.408.323.4225x2224 TEL
1.408.323.4449   FAX
http://fedex.com/us 


**
This message contains information that is confidential and proprietary to FedEx 
Freight or its affiliates.  It is intended only for the recipient named and for 
the express  purpose(s) described therein.  Any other use is prohibited.
**