Re: Identify not visible characters - Overlapped characters

John Logan Wed, 28 Dec 2016 22:22:59 -0800

Hi Paulo,

Is your layout analysis focused on extracting tabular data (records) from a PDF 
file?  Or are you trying to handle more general layouts?

PDFBOX-2998 contains detailed discussion about enhancing the extraction 
algorithms, including adding advanced layout analysis.  The argument against 
this is that it's very hard to simultaneously achieve high quality and general 
applicability.

The current text extractor allows a developer to override the text output 
methods, but the core is fairly monolithic.  It'd be nice to rework the text 
extraction so that the process was more modular, and so that alternate 
processes could be include components from externally-developed classes and 
libraries.  This way, PDFbox doesn't need to solve the general layout analysis 
problem, but it would be easier to develop extensions that solve specific 
problems well.

For what it's worth, the way I currently approach it is to define a 
PdfTextFeatureExtractor that extends PDFStreamEngine.  In particular, the new 
class overrides the showGlyph() method to write a YAML file that contains 
detailed information for each rendered glyph.

>From there one can develop whatever one wants for layout extraction and all of 
>the other segmentation and classification tasks.  The core layout analysis 
>techniques I chose for my work are based on the paper "Two Geometric 
>Algorithms for Layout Analysis", by Thomas Breuel.

Best regards,

John

________________________________
From: [email protected] <[email protected]>
Sent: Wednesday, December 28, 2016 1:29:54 PM
To: [email protected]
Subject: RE: Identify not visible characters - Overlapped characters

Hi Manuel,

I'm sorry for my mistake and many thanks for your help and attention.

The best tool that I know to extract text from a PDF ( I didn't test Monarch), 
maintaining the correct layout, is inside a CAAT software: Caseware IDEA. 
However this software is very expensive and does a lot of other things.

All the others tools that I tested (and I tested several) do wrong positioning 
analysis.

It will be good to develop a tool to produce similar results obtained with IDEA.

The work that you developed can help others to achieve that result.

Paulo
-----Mensagem original-----
De: Manuel Aristarán [mailto:[email protected]] Em nome de Manuel Aristarán
Enviada: quarta-feira, 28 de dezembro de 2016 20:37
Para: [email protected]
Assunto: Re: Identify not visible characters - Overlapped characters

Hi Paulo,

> On Dec 28, 2016, at 9:52 AM, [email protected] wrote:
>
> Unfortunately, Tabula uses a totally different approach (image
> analysis) [...]

Sorry for going (sort of) off-topic, but that's not correct. In fact, Tabula 
does not support images. Thanks to PDFBox, it "mines" text and graphical 
elements, and uses a set of heuristics that attempt reconstruct a tabular 
structure.

> Tabula also do incoherent analysis when a table is larger than one
> page, for that reason Tabula is far from being a good tool for text
> extraction with correct positioning.

We always welcome bug reports (and patches!) :) [1]

Thanks!

[1] https://github.com/tabulapdf/tabula-java/issues

—
Manuel Aristarán <[email protected]>
http://jazzido.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Identify not visible characters - Overlapped characters

Reply via email to