As a newcomer to the list I'm not sure how apropos this is but until I hear otherwise I'll assume it is ok. This is probably more political than itext relevant.
---------------------------------------- > From: [email protected] > To: [email protected] > Date: Tue, 10 Mar 2009 04:34:57 -0700 > Subject: Re: [iText-questions] modifed sample, question on PDF contents > > You need to consider the history of PDF... > > The original design was for "electronic paper" - something where you could > create a "frozen instance" of your document that would look the same on any > computer and print as it looked. As such, there was no need to incorporate > semantic information about the structure of the document - only information > necessary to render it. Isn't this what a BMP file is (LOL)? I have to admit that my experience with Reader 7 on Win 2K and other attributes of the format left me searching for any other alternatives. Everytime I say or write "PDF" I still think of scanned documents that look like they came in over a FAX machine. I guess a more appropriate comparison, rather than BMP, could be your SVG approach- all you have here is glyphs instead of shapes. For artwork or pictures, this is fine but not for information that is more accurately textual. When would someone decide to publish a PDF file instead of an SVG "document?" > > However, as the use of PDF developed it became clear that there was a need to > also incorporate structural/semantic information to be able to make use of > the content in a consistent fashion (vs. having to "guess", and everyone > guessing differently) and thus the tagging/structure features were added in > PDF 1.4. Unfortunately, not all PDF producers will put such information into > the file :(. Like any format, "garbage in, garbage out". > > What type of government documents are you talking about? Different > departments create different types of documents, and those, of course, vary > country to country. Consider in the USA, you have tax forms from the IRS, > transcripts from Congress, technical materials from the DOD, etc. Well, the FDA publishes clinical trial data for approved drugs in formats that include scanned PDF files, which are pretty much useless for any real analysis by outside entities even with decent OCR software. The FCC, last time I looked, even accepts submissions that disallow extraction of images or text. Fortunately I haven't seen a PDF submission in the SEC company filings in a long time and they have even gone to XBRL XML filings. Computers may be able to automate data processing, not just remove information. A recent summary of my attitude with limited references is here, buried in with some other topics, if you are interested, http://www.sec.gov/comments/s7-04-09/s70409-2.pdf [ note that I did not submit this as a PDF file, LOL ] > > And what types of "manipulation" are you expecting? Some documents aren't > designed for manipulation, such as the plans for a Sherman Tank - while > others, such as forms make sense to enable extraction and processing of the > data. While I'm sure this is just a flippant example ( as I often give LOL), it does illustrate this presumption that people need or want pictures/limited dat, not robust model information when in fact the opposite would be true with this example. You might want to restrict access but this is actually a perfect example of where you NEED automated interaction with information and pictures/views/renderings are really not the main issue. An image document like PDF or a screen shot from a CAD system is not what you want to store and manipulate plans. "Plans" would require even more versatile machine readability with human readability being just a small component. Presumably, you would like to archive, manipulate, and reuse pieces and partially assembled units and make these things automatically from the plans. At minimum, something like a CNC mill or automated material ordering system would have to "read" the plans. The US IRS offers PDF tax forms. I'd like to be able to maintain my own tax information and extract it from a filled in 1040 and not just waste time typing into an information black hole in some proprietary or unworkable format. Taxes are mostly numbers, and numbers can be manipulated for many purposes if not buried in a bunch of irrelevant formatting information. I'd probably cry if I found out the IRS bought special scanner equipment and high-speed printers to print electronic submissions only so they could be scanned back in just because the PDF format doesn't let them separate information from graphics. But, I also would not be surprised if that is exactly what they do. > > Leonard > > -----Original Message----- > From: Mike Marchywka [mailto:[email protected]] > Sent: Tuesday, March 10, 2009 6:26 AM > To: [email protected] > Subject: Re: [iText-questions] modifed sample, question on PDF contents > > > ---------------------------------------- >> Date: Tue, 10 Mar 2009 08:34:11 +0100 >> From: [email protected] >> To: [email protected] >> Subject: Re: [iText-questions] modifed sample, question on PDF contents >> >> Mike Marchywka wrote: >>> Is there any information in the >>> PDF that tells me how this stuff is supposed to be organized >>> to extract the INFORMATION or is this just a bunch of hopelessly jumbled >>> text that can only be read by a human, not a computer? >> >> It's just a bunch of glyphs and lines drawn on a canvas; >> there is no structure in the content UNLESS your PDF is tagged. > > Ok, thanks I'll try to find tags but I was hoping there > was some hierarchy to the layout and a traversal pattern > or something. Are there particular classes I in itext I should > grep for? > > This would seem like a very limited format in which to > present INFORMATION in things like government documents. > Surely, there must be some mechanism to extract machine > readable information so that other flexible non-proprietary > tools can manipulate information easily if the format > is being used for public documents. > > This is probably more of a marketing discussion than a technical > one but I would be curious to understand the situation if anyone > wants to talk off-list. > > Thanks. > > > >> -- >> This answer is provided by 1T3XT BVBA >> http://www.1t3xt.com/ - http://www.1t3xt.info >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> iText-questions mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/itext-questions >> >> Buy the iText book: http://www.1t3xt.com/docs/book.php > > _________________________________________________________________ > Windows Live(tm) Groups: Create an online spot for your favorite groups to > meet. > http://windowslive.com/online/groups?ocid=TXT_TAGLM_WL_groups_032009 > ------------------------------------------------------------------------------ > _______________________________________________ > iText-questions mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.1t3xt.com/docs/book.php > > ------------------------------------------------------------------------------ > _______________________________________________ > iText-questions mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.1t3xt.com/docs/book.php _________________________________________________________________ Windows Live⢠Contacts: Organize your contact list. http://windowslive.com/connect/post/marcusatmicrosoft.spaces.live.com-Blog-cns!503D1D86EBB2B53C!2285.entry?ocid=TXT_TAGLM_WL_UGC_Contacts_032009 ------------------------------------------------------------------------------ _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php
