You need to consider the history of PDF... The original design was for "electronic paper" - something where you could create a "frozen instance" of your document that would look the same on any computer and print as it looked. As such, there was no need to incorporate semantic information about the structure of the document - only information necessary to render it.
However, as the use of PDF developed it became clear that there was a need to also incorporate structural/semantic information to be able to make use of the content in a consistent fashion (vs. having to "guess", and everyone guessing differently) and thus the tagging/structure features were added in PDF 1.4. Unfortunately, not all PDF producers will put such information into the file :(. Like any format, "garbage in, garbage out". What type of government documents are you talking about? Different departments create different types of documents, and those, of course, vary country to country. Consider in the USA, you have tax forms from the IRS, transcripts from Congress, technical materials from the DOD, etc. And what types of "manipulation" are you expecting? Some documents aren't designed for manipulation, such as the plans for a Sherman Tank - while others, such as forms make sense to enable extraction and processing of the data. Leonard -----Original Message----- From: Mike Marchywka [mailto:marchy...@hotmail.com] Sent: Tuesday, March 10, 2009 6:26 AM To: itext-questions@lists.sourceforge.net Subject: Re: [iText-questions] modifed sample, question on PDF contents ---------------------------------------- > Date: Tue, 10 Mar 2009 08:34:11 +0100 > From: i...@1t3xt.info > To: itext-questions@lists.sourceforge.net > Subject: Re: [iText-questions] modifed sample, question on PDF contents > > Mike Marchywka wrote: >> Is there any information in the >> PDF that tells me how this stuff is supposed to be organized >> to extract the INFORMATION or is this just a bunch of hopelessly jumbled >> text that can only be read by a human, not a computer? > > It's just a bunch of glyphs and lines drawn on a canvas; > there is no structure in the content UNLESS your PDF is tagged. Ok, thanks I'll try to find tags but I was hoping there was some hierarchy to the layout and a traversal pattern or something. Are there particular classes I in itext I should grep for? This would seem like a very limited format in which to present INFORMATION in things like government documents. Surely, there must be some mechanism to extract machine readable information so that other flexible non-proprietary tools can manipulate information easily if the format is being used for public documents. This is probably more of a marketing discussion than a technical one but I would be curious to understand the situation if anyone wants to talk off-list. Thanks. > -- > This answer is provided by 1T3XT BVBA > http://www.1t3xt.com/ - http://www.1t3xt.info > > ------------------------------------------------------------------------------ > _______________________________________________ > iText-questions mailing list > iText-questions@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.1t3xt.com/docs/book.php _________________________________________________________________ Windows Live(tm) Groups: Create an online spot for your favorite groups to meet. http://windowslive.com/online/groups?ocid=TXT_TAGLM_WL_groups_032009 ------------------------------------------------------------------------------ _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php ------------------------------------------------------------------------------ _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php