On Tue, Feb 4, 2014 at 9:03 AM, Johnny Bekkestad < [email protected]> wrote:
> Hi, I have a big problem trying to read a "table" within a pdf. > > There is a problem when the so content of a cell wraps over multiple rows, > > I am not able to associate the correct text with the correct value. > > This becomes extra hard when there is also a page break. > > Here is an example > > > > ID > > Title > > Name > > 1 > > Text 1 > > Name 1 > > 2 > > A very very long text 2 > > Name 2 > > 3 > > A very very very long text 3 > > This is also a very long name > > 4 > > Short text 4 > > Another very long name > > > > I am trying to get these as a text and it quite hard to associate the > correct values with the columns > > > > Anyone had this problem too? > Yes - everyone. The problem is that PDF has no concept of "table". We have to guess it's a table because it has some "lines" and aligned text. (The lines are probably "paths" - a more primitive approach). The characters may be in any order. We have to deduce that your cell content consists of single sentences and not two independent items (e.g. by the lack of full stops, the lowercase second line and (in desperate cases) that an NLP parser can make sense of it. There is no standard way of doing this. TabulaPDF (which uses PDFBox) - http://tabula.nerdpower.org/ - is among the most advanced open source projects. I do some of this myself in https://bitbucket.org/petermr/ami2. We hope to pool our software and experiences so we don't all have to reinvent algorithms and heuristics. It's mindbogglingly tedious to do this. > > /Johnny > > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

