On Sep 25, 9:18 pm, [EMAIL PROTECTED] wrote: > On Sep 25, 3:02 pm, Paul Hankin <[EMAIL PROTECTED]> wrote: > > > Googling for 'pdf to text python' and following the first link > > giveshttp://pybrary.net/pyPdf/ > > Doesn't work that well, I've tried it, you should too... the author > even admits this: > > extractText() [#] > > Locate all text drawing commands, in the order they are provided > in the content stream, and extract the text. This works well for some > PDF files, but poorly for others, depending on the generator used. > This will be refined in the future. Do not rely on the order of text > coming out of this function, as it will change if this function is > made more sophisticated. - > sourcehttp://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html
I have downloaded this package and installed it and found that the text-extraction is more or less useless. Looking into the code and comparing with the PDF spec show a very early implementation of text extraction. Luckily it is possible to overwrite the textextraction method in the base class without having to fiddle with the original code. I tried to contact the developer to offer some help on implementing text extraction, but he didn't answer my emails. -- Svenn -- http://mail.python.org/mailman/listinfo/python-list