Sorry, must have sent something wrong. Most of the PDFs do represent grid lines with something like vectors -- they use operators to "move a pen" and "stroke a line". JAI seems like a bad choice to me because of difficulties with portability and deployment. Take a look at OpenCV for algorithms, most of them are also easy to find on the web.
Ilija. On Thu, Jan 26, 2012 at 11:58 PM, Ray Weidner <[email protected]> wrote: > Thanks Ilija. It sounds like your suggestion might be the best approach. > I was under the impression that PDF documents represented grid lines with > something like vector graphics. I suppose there is no reason to expect > this to always be the case, and I must make allowances for image > backgrounds. Now all I need to do is find or write some code for line > detection in images...I've spent some time looking for this, but so far, no > dice. Java Advanced Imaging looks promising, but I'm still learning what > that's all about. Any suggestions are welcome. > > Ray > > > On Thu, Jan 26, 2012 at 5:44 PM, Ilija Pavlic <[email protected]>wrote: > >> There is no built in functionality to retrieve tabular data with >> pdfbox because there is (usually) no table mark-up in pdf documents. >> Instead, tables are usually represented as absolutely positioned text >> and lines around that text forming the borders of the table. >> >> It is possible to find all lines forming a table. Exactly how that >> might work depends heavily on the document in question. For instance, >> some documents use three overlapping lines instead of a thick line. >> See the answer to my recent question about finding lines in a document >> on how to use pdf operators to find lines in a document. While it is >> certainly possible with pdfbox, I haven't been able to do it yet. >> Therefore I cannot give more detailed information. >> >> Another (a bit complex) option is: >> 1. Remove all text on a page. >> 2. Render the page to a graphic format. >> 3. Find horizontal and vertical lines in the graphic using a line >> detection algorithm like Hough transform. >> 4. Find intersections of detected lines -- they will form a tabular grid >> from >> which you can read with PDFTextStripperByArea >> >> BR, >> Ilija. >> >> On 26. 1. 2012., at 22:39, Ray Weidner wrote: >> >> > Hi, >> > >> > I'm currently using PDFBox for an application that detects table >> structures >> > in PDF documents. So far, I do this by extending PDFTextStripper, and >> > using the character position and font data to heuristically detect >> > table-like text formatting. This is working pretty well, but we want to >> > improve this, if possible, by analyzing vector graphics to detect >> > table-like grid lines. This will definitely improve accuracy, and make >> it >> > easier to parse more complex table structures. >> > >> > So how can I do this, and is it even possible? I'm not at all an expert >> of >> > PDFBox or the PDF standard, so I don't yet know if this can be done (for >> > instance, if tables grids are usually formed from background images, this >> > is probably not feasible within our time frame). Please bear with my >> > newbishness. >> > >> > Thanks in advance! >> > >> > Ray Weidner >> >>

