He there,

I'm using pdfbox (just switched to 0.8, so some of this might be true only for 0.7.3) for a couple of weeks now. What I'm trying to do is analyze papers and extract the document title and authors as well as the list of references in order to establish relationships between several documents. Like, who references whom, and what is the one paper you got to read.

The problems I stumbled upon:
Columns - quite often those docs use a two column-layout. Often it is recognized and text is extracted one column after the other, which is good. But there are documents which apparently do not contain what you call beads, even though they use two columns. Text is extracted line by line ignoring the columns. I realized, turning of sortByPosition, resolves part of the problem, but only if the order is correct. Don't know if this is due to invalid documents or an error in code. I'm using a custom extension of PDFTextStripper. As a workaround for the sorting problem, I wrote a method to analyze and sort the text (List of TextPosition) while respecting the two column layout, which is called in flushText() instead of Collections.sort(). I also changed the TextPositionComparator to use a larger value (2) for the tolerance comparison, so superscripts are on the correct line.

Next - Font size
TextPosition comes with several attributes, like height, yScale and FontSize. So far I couldn't figure out which one to use to determine the font size. Most of the time, getFontSize() retuns 1, which is no really useful. I also came across large areas of text with height set to 0. So I went for yScale, but for some documents this returns 1 for the whole text as well. I don't need absolute values, just interested in the biggest font, which usually is used for the title of the paper.


Torsten

Reply via email to