There is no built in functionality to retrieve tabular data with pdfbox because there is (usually) no table mark-up in pdf documents. Instead, tables are usually represented as absolutely positioned text and lines around that text forming the borders of the table.
It is possible to find all lines forming a table. Exactly how that might work depends heavily on the document in question. For instance, some documents use three overlapping lines instead of a thick line. See the answer to my recent question about finding lines in a document on how to use pdf operators to find lines in a document. While it is certainly possible with pdfbox, I haven't been able to do it yet. Therefore I cannot give more detailed information. Another (a bit complex) option is: 1. Remove all text on a page. 2. Render the page to a graphic format. 3. Find horizontal and vertical lines in the graphic using a line detection algorithm like Hough transform. 4. Find intersections of detected lines -- they will form a tabular grid from which you can read with PDFTextStripperByArea BR, Ilija. On 26. 1. 2012., at 22:39, Ray Weidner wrote: > Hi, > > I'm currently using PDFBox for an application that detects table structures > in PDF documents. So far, I do this by extending PDFTextStripper, and > using the character position and font data to heuristically detect > table-like text formatting. This is working pretty well, but we want to > improve this, if possible, by analyzing vector graphics to detect > table-like grid lines. This will definitely improve accuracy, and make it > easier to parse more complex table structures. > > So how can I do this, and is it even possible? I'm not at all an expert of > PDFBox or the PDF standard, so I don't yet know if this can be done (for > instance, if tables grids are usually formed from background images, this > is probably not feasible within our time frame). Please bear with my > newbishness. > > Thanks in advance! > > Ray Weidner

