On Sep 26, 11:50 pm, [EMAIL PROTECTED] wrote: > On Sep 26, 4:49 pm, Svenn Are Bjerkem <[EMAIL PROTECTED]> > wrote: > > > I have downloaded this package and installed it and found that the > > text-extraction is more or less useless. Looking into the code and > > comparing with the PDF spec show a very early implementation of text > > extraction. Luckily it is possible to overwrite the textextraction > > method in the base class without having to fiddle with the original > > code. I tried to contact the developer to offer some help on > > implementing text extraction, but he didn't answer my emails. > > -- > > Svenn > > Well, feel free to send any ideas or help to me! It seems simple... Do > a binary read. Find 'stream' and 'endstream' sections. > zlib.decompress() all the streams. Find BT and ET markers (Begin Text > & End Text) and finally locate the parens within those and string the > text together. This works great on 3 out of 10 PDF documents, but my > main issue seems to be the zlib compressed streams. Some of them don't > seem to be FlateDecodeable (although they claim to be) or the header > is somehow incorrect. But, once I get a good stream and decompress it, > things are OK from that point on. Seriously, if you have ideas, please > let me know. I'll be glad to share what I've got so far.
So far I have found that extracting text from the IEEE journal papers is not as simple as described above. The IEEE journals are typesetting things in typical journal style with two columns body text and one column abstract and a blob of header and author information. Take figures and formulas and footnotes and spread them around in the journal and you are basically using all block text layout commands there is in PDF. I wanted to to get the pdftotext from xpdf package to see what that tool does to the IEEE pdfs in order to see if I should dive into the sources to see what they do to get things right. So far I have not got this far. Purpose of my work was to extract the abstract of each paper to put into a database for later search, but IEEE also has a search engine on their journal DVD => postpone python work. Got my gentoo machine back on track so that may maybe change again...... -- Svenn -- http://mail.python.org/mailman/listinfo/python-list