hi,

I've been looking for a tool to convert pdf > txt so a miner bot can
track data of interest.

The text data in the pdf files is formatted by columns, which is
important for parsing the text.  I'm dealing with about 1000 pdf files
so I'd really like to avoid converting them by hand.  

The script tools I've tried so far all mangle the formatting.  Pdfedit
is the only tool I've found so far that respects whitespace, and it does
what I want, except version 0.4.5 is clipping lines, and I can't get
pdf_to_text to compile under 0.4.1.

A specimen pdf file I'm having a problem with is here:
http://tinyurl.com/29a4xhw

if I use v 0.4.1 and save as text it works fine

if I use pdf_to_text (v 0.4.5) the text is clipped around column 117
   and saving as text from the 0.4.5 gui gives the same clipped results

attached is a part of the diff file comparing the output from 0.4.1 and
0.4.5

I went to 0.4.5 specifically to use the stand alone tool pdf_to_text.

I'm wondering if you have any insight/fix/workaround for this.


        thanks,
                JE



-- 
john edstrom <[email protected]>
27,29c27,29
<                                                                               
                                   Federal Communications Commission
<                                                                               
                                   445 Twelfth Street SW
<                       PUBLIC NOTICE                                           
                                   Washington, D.C. 20554
---
>                                                                               
>                                    Federal Co
>                       PUBLIC NOTICE                                           
>                                    445 Twelfth
>                                                                               
>                                    Washingto
31,32c31,32
<      REPORT NO. 27286                             Broadcast Applications      
                                                       7/27/2010
< STATE FILE NUMBER         E/P CALL LETTERS      APPLICANT AND LOCATION        
         N A T U R E        O F     A P P L I C A T I O N
---
>      REPORT NO. 27286                             Broadcast Applications
> STATE FILE NUMBER         E/P CALL LETTERS      APPLICANT AND LOCATION        
>          N A T U R E        O F     A P P L I
35c35
<                                                 FELLOWSHIP                    
          From: HORIZON CHRISTIAN FELLOWSHIP
---
>                                                 FELLOWSHIP                    
>           From: HORIZON CHRISTIAN FELLOW
.... etc.
------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Pdfedit-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pdfedit-support

Reply via email to