Hi Paulo,
Is your layout analysis focused on extracting tabular data (records) from a PDF file? Or are you trying to handle more general layouts? PDFBOX-2998 contains detailed discussion about enhancing the extraction algorithms, including adding advanced layout analysis. The argument against this is that it's very hard to simultaneously achieve high quality and general applicability. The current text extractor allows a developer to override the text output methods, but the core is fairly monolithic. It'd be nice to rework the text extraction so that the process was more modular, and so that alternate processes could be include components from externally-developed classes and libraries. This way, PDFbox doesn't need to solve the general layout analysis problem, but it would be easier to develop extensions that solve specific problems well. For what it's worth, the way I currently approach it is to define a PdfTextFeatureExtractor that extends PDFStreamEngine. In particular, the new class overrides the showGlyph() method to write a YAML file that contains detailed information for each rendered glyph. >From there one can develop whatever one wants for layout extraction and all of >the other segmentation and classification tasks. The core layout analysis >techniques I chose for my work are based on the paper "Two Geometric >Algorithms for Layout Analysis", by Thomas Breuel. Best regards, John ________________________________ From: [email protected] <[email protected]> Sent: Wednesday, December 28, 2016 1:29:54 PM To: [email protected] Subject: RE: Identify not visible characters - Overlapped characters Hi Manuel, I'm sorry for my mistake and many thanks for your help and attention. The best tool that I know to extract text from a PDF ( I didn't test Monarch), maintaining the correct layout, is inside a CAAT software: Caseware IDEA. However this software is very expensive and does a lot of other things. All the others tools that I tested (and I tested several) do wrong positioning analysis. It will be good to develop a tool to produce similar results obtained with IDEA. The work that you developed can help others to achieve that result. Paulo -----Mensagem original----- De: Manuel Aristarán [mailto:[email protected]] Em nome de Manuel Aristarán Enviada: quarta-feira, 28 de dezembro de 2016 20:37 Para: [email protected] Assunto: Re: Identify not visible characters - Overlapped characters Hi Paulo, > On Dec 28, 2016, at 9:52 AM, [email protected] wrote: > > Unfortunately, Tabula uses a totally different approach (image > analysis) [...] Sorry for going (sort of) off-topic, but that's not correct. In fact, Tabula does not support images. Thanks to PDFBox, it "mines" text and graphical elements, and uses a set of heuristics that attempt reconstruct a tabular structure. > Tabula also do incoherent analysis when a table is larger than one > page, for that reason Tabula is far from being a good tool for text > extraction with correct positioning. We always welcome bug reports (and patches!) :) [1] Thanks! [1] https://github.com/tabulapdf/tabula-java/issues — Manuel Aristarán <[email protected]> http://jazzido.com --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

