Thanks to
IText/Jpedal, my conversion project is getting a good start, I have used the
code provided by Bruno to get me going to parse the fill in forms and Pub
1346 and my thanks goes out to him, as well as Bill Ensley for giving me some
tips on JPedal imaging and Mark Storer for explaining some of the PDF parsing
basics to me.
I am now on to the
next challenge: parsing the text in the instructions. All of the
instructions are released by IRS in PDF as well as in SGML. It is my
understanding that the IRS uses the SGML files to create the final PDFs, but
since I don't have a publishing background, the concepts here are a bit
foreign. The SGML files may have the data in a format better suited for
parsing, but even in the SGML package, the tax tables , which are very
important, are included as PDFs that are incrorated into the SGML
file. The file I am concerned with for my proto type is http://www.irs.gov/pub/irs-pdf/i1040ez.pdf (292KB)
which are the instructions for filling out form 1040EZ The SGML precursor file
can be found at http://www.irs.gov/pub/irs-sgml/i1040ez.exe,
when extracted it creates has an XML file and some PDFs with graphics and
taxtables whihc get incorporated into the PDF when they compile the SGML, I
guess.
The tax tables for
income tax and the earned income credit are obviously very important to tax
software. I cannot figure out why IRS does not provide these to software
developers as CSV , XML or something easier to deal with, but that is a question
I am asking them at meeting of all Tax Admins in May.
I would appreciate
anyones help ith coming up with code using iText, JPedal, or PDFBox for parsing
the tax tables, which can be found on pages 24-32 of the instructions. For
the instructions themselves, I am going to look towards using the SGML
files, as they seem to be in a more structured format.
Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002 (voice)
http://www.taxcodesoftware.org
Free Open Source Tax Software