On Wed Sep 26 23:50:16 CEST 2007, byte8bits wrote: > On Sep 26, 4:49 pm, Svenn Are Bjerkem <svenn.bjer... at googlemail.com> > wrote: > > > I have downloaded this package and installed it and found that the > > text-extraction is more or less useless. Looking into the code and > > comparing with the PDF spec show a very early implementation of text > > extraction. Luckily it is possible to overwrite the textextraction > > method in the base class without having to fiddle with the original > > code. I tried to contact the developer to offer some help on > > implementing text extraction, but he didn't answer my emails.
That's disappointing to hear, but it's understandable. I must have one or two outstanding requests to add features to pdftools from a year ago. I keep meaning to look into making the necessary changes, but it's not something I'm looking forward to. > Well, feel free to send any ideas or help to me! It seems simple... Do > a binary read. Find 'stream' and 'endstream' sections. > zlib.decompress() all the streams. Assuming that they're FlateEncoded... > Find BT and ET markers (Begin Text > & End Text) and finally locate the parens within those and string the > text together. Which works fine if the generator put in space characters. Otherwise, it seems to me that you need to figure out where any spaces should go. > This works great on 3 out of 10 PDF documents, but my > main issue seems to be the zlib compressed streams. Some of them don't > seem to be FlateDecodeable (although they claim to be) or the header > is somehow incorrect. But, once I get a good stream and decompress it, > things are OK from that point on. Seriously, if you have ideas, please > let me know. I'll be glad to share what I've got so far. You need to take a good parser and work on a higher level text extraction library. > Not many people seem to be interested. I'll stop adding to this > thread... I don't want to beat a dead horse. Anyone interested in > helping, can contact me via emial. On the contrary, lots of people are interested in this sort of thing: http://phaseit.net/claird/comp.text.pdf/PDF_converters.html http://sourceforge.net/projects/pdfplayground http://www.adaptive-enterprises.com.au/~d/software/pdffile/ http://pybrary.net/pyPdf/ http://www.boddie.org.uk/david/Projects/Python/pdftools/ I discussed working with the author of pdfplayground, but things never really got going. I'd like to be part of a team working on a PDF library for Python, but my views on software licensing mean that I'd prefer to use a strong copyleft license rather than the permissive licenses found attached to most of the above libraries. David -- http://mail.python.org/mailman/listinfo/python-list