+1 Andreas There can never be an automatic way of reassembling structured text from arbitrary PDFs. In our https://bitbucket.org/petermr/svg2xml-dev/ project we are trying to do this for English language scientific documents, and we use a number of heuristics based on whitespace, font sizes, weights, English dictionaries, etc.
To show the impossibility here is a chunk of "text" PET AGE dog 3 cat 5 rat 1 Most people would interpret that as a table, but only because there are implicit signals (uppercase labels, central white space) and linguistic coherence (all PETs seem to be animals, all AGEs seem to be numbers). But imagine if it was in Cyrillic, CJK, etc. It might even be read vertically. Other common problems include: * rotated text (e.g. along the sides of the page) * floating boxes (e.g. a box surrounded by whitespace in the middle of running text) * S P A C E S F O R E F F E C T * hyphens at line end (do you remove them? not in chemistry!) * indentation or outdentation * numbering (e.g. 1.2.3 at para start) On Sun, May 4, 2014 at 12:15 PM, Andreas Lehmkuehler <[email protected]>wrote: > Hi, > > Am 02.05.2014 13:18, schrieb Qingchao Kong: > > Paul, >> I think I am aware the difference of >> "stripper.setSortByPosition(true)" and >> "stripper.setSortByPosition(false)". It is best explained when you try >> to extract a PDF who has multiple columns, e.g. two columns. >> >> When you have "stripper.setSortByPosition(false)", the extraction >> result is usually the reading procedure which is fine. But when you >> have "stripper.setSortByPosition(true)", PDFBox will extract text from >> top to bottom, ignoring the columns, which is not expected by me. >> > I'm afraid there is a misunderstanding. PDFBox can't extract text context > sensitive. e.g. detecting columns, header or footer. > > Just for clarification: > > sortByPosition = false > > PDFBox extracts the text following their appearance in the pdf. In most > cases the text will be sorted ny default, but that must not be true for > every pdf. Especially updated pdfs are not sorted anymore. > > sortByPosition = true > > PDFBox extracts the text and tries to sort it using the position o each > character. This works fine for simple texts. It gets more complicated and > may lead to a false result if one of the following is used: > > - different text sizes in the same line > - different font sizes in the same line > - super/subscripts > - multicolumns > - .... > > BR > Andreas Lehmkühler > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

