Hi Disclaimer: I have very limited knowledge of the PDF standard and only command the basics of PDFBox, however I have had my share of thrills with the IRS.
What's the final purpose of those filled out PDFs? Do you intend to be MeF ( http://www.irs.gov/pub/irs-pdf/p4164.pdf) compliant? Are we talking about such PDFs (http://www.irs.gov/pub/irs-pdf/, which btw are/could be quite a test bed for PDFBox)? If MeF sounds intruiging to you, "simply" model and validate the input with the IRS' XSD for MeF and model your application around such a stable data governance. Generally the IRS does extensive post-processing on the input documents, so I wouldn't bother too much. But depending on the kind of service you offer, you mileage will vary. Now, we had our share of fun with the IRS when filling out claims from "untrusted sources". If you provide a certified tax service, you might also need to adhere to processing standards set forth by the NIST, as in NIST SP-800-xx (53, for example), outlined for agencies in http://www.irs.gov/pub/irs-pdf/p1075.pdf. If you're just doing it for some friends, apply the basic sanitizing aspects you figure out and go from there. Improve it over time, depending on the feedback of the IRS' process. Best regards Roberto On Sun, Aug 9, 2015 at 11:10 PM, Stuart Small <[email protected]> wrote: > I am putting together a system that automatically generates some tax forms > off of user input. The original PDFs are provided by the IRS, I will just > be plugging user input into relevant fields. > > PDF is a large file format that I don't fully understand. I've been > surprised before by some of the things it is capable. So that got me > thinking, is there any sanitation I need to perform to the user input > before generating the PDF? Or any special cases I should keep in mind when > filling in forms with arbitrary strings from an untrusted source. > > Thanks in advance! >

