Re: Script to extract text from PDF files

Svenn Are Bjerkem Wed, 26 Sep 2007 13:58:28 -0700

On Sep 25, 9:18 pm, [EMAIL PROTECTED] wrote:
> On Sep 25, 3:02 pm, Paul Hankin <[EMAIL PROTECTED]> wrote:
>
> > Googling for 'pdf to text python' and following the first link 
> > giveshttp://pybrary.net/pyPdf/
>
> Doesn't work that well, I've tried it, you should too... the author
> even admits this:
>
> extractText() [#]
>
>     Locate all text drawing commands, in the order they are provided
> in the content stream, and extract the text. This works well for some
> PDF files, but poorly for others, depending on the generator used.
> This will be refined in the future. Do not rely on the order of text
> coming out of this function, as it will change if this function is
> made more sophisticated. - 
> sourcehttp://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html


I have downloaded this package and installed it and found that the
text-extraction is more or less useless. Looking into the code and
comparing with the PDF spec show a very early implementation of text
extraction. Luckily it is possible to overwrite the textextraction
method in the base class without having to fiddle with the original
code. I tried to contact the developer to offer some help on
implementing text extraction, but he didn't answer my emails.
--
Svenn

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Script to extract text from PDF files

Reply via email to