[iText-questions] FW: More on extracting content from PDF

Richard Braman Sun, 19 Feb 2006 14:59:10 -0800

Title: Message

Thanks to IText/Jpedal, my conversion project is getting a good start, I have used the code provided by Bruno to get me going to parse the fill in forms and Pub 1346 and my thanks goes out to him, as well as Bill Ensley for giving me some tips on JPedal imaging and Mark Storer for explaining some of the PDF parsing basics to me.

I am now on to the next challenge: parsing the text in the instructions. All of the instructions are released by IRS in PDF as well as in SGML. It is my understanding that the IRS uses the SGML files to create the final PDFs, but since I don't have a publishing background, the concepts here are a bit foreign. The SGML files may have the data in a format better suited for parsing, but even in the SGML package, the tax tables , which are very important, are included as PDFs that are incrorated into the SGML file. The file I am concerned with for my proto type is http://www.irs.gov/pub/irs-pdf/i1040ez.pdf (292KB) which are the instructions for filling out form 1040EZ The SGML precursor file can be found at http://www.irs.gov/pub/irs-sgml/i1040ez.exe, when extracted it creates has an XML file and some PDFs with graphics and taxtables whihc get incorporated into the PDF when they compile the SGML, I guess.

The tax tables for income tax and the earned income credit are obviously very important to tax software. I cannot figure out why IRS does not provide these to software developers as CSV , XML or something easier to deal with, but that is a question I am asking them at meeting of all Tax Admins in May.

I would appreciate anyones help ith coming up with code using iText, JPedal, or PDFBox for parsing the tax tables, which can be found on pages 24-32 of the instructions. For the instructions themselves, I am going to look towards using the SGML files, as they seem to be in a more structured format.

Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002 (voice)

http://www.taxcodesoftware.org
Free Open Source Tax Software

[iText-questions] FW: More on extracting content from PDF

Reply via email to