There is no built in functionality to retrieve tabular data with
pdfbox because there is (usually) no table mark-up in pdf documents.
Instead, tables are usually represented as absolutely positioned text
and lines around that text forming the borders of the table.

It is possible to find all lines forming a table. Exactly how that
might work depends heavily on the document in question. For instance,
some documents use three overlapping lines instead of a thick line.
See the answer to my recent question about finding lines in a document
on how to use pdf operators to find lines in a document. While it is
certainly possible with pdfbox, I haven't been able to do it yet.
Therefore I cannot give more detailed information.

Another (a bit complex) option is:
1. Remove all text on a page.
2. Render the page to a graphic format.
3. Find horizontal and vertical lines in the graphic using a line
detection algorithm like Hough transform.
4. Find intersections of detected lines -- they will form a tabular grid from
which you can read with PDFTextStripperByArea

BR,
Ilija.

On 26. 1. 2012., at 22:39, Ray Weidner wrote:

> Hi,
> 
> I'm currently using PDFBox for an application that detects table structures
> in PDF documents.  So far, I do this by extending PDFTextStripper, and
> using the character position and font data to heuristically detect
> table-like text formatting.  This is working pretty well, but we want to
> improve this, if possible, by analyzing vector graphics to detect
> table-like grid lines.  This will definitely improve accuracy, and make it
> easier to parse more complex table structures.
> 
> So how can I do this, and is it even possible?  I'm not at all an expert of
> PDFBox or the PDF standard, so I don't yet know if this can be done (for
> instance, if tables grids are usually formed from background images, this
> is probably not feasible within our time frame).  Please bear with my
> newbishness.
> 
> Thanks in advance!
> 
> Ray Weidner

Reply via email to