Re: [iText-questions] modifed sample, question on PDF contents

Mike Marchywka Tue, 10 Mar 2009 05:14:54 -0700


As a newcomer to the list I'm not sure how apropos this
is but until I hear otherwise I'll assume it is ok.
This is probably more political than itext relevant.

----------------------------------------
> From: [email protected]
> To: [email protected]
> Date: Tue, 10 Mar 2009 04:34:57 -0700
> Subject: Re: [iText-questions] modifed sample, question on PDF contents
>
> You need to consider the history of PDF...
>
> The original design was for "electronic paper" - something where you could 
> create a "frozen instance" of your document that would look the same on any 
> computer and print as it looked. As such, there was no need to incorporate 
> semantic information about the structure of the document - only information 
> necessary to render it.

Isn't this what a BMP file is (LOL)? I have to admit that
my experience with Reader 7 on Win 2K and other attributes
of the format left me searching for any other alternatives.
Everytime I say or write "PDF" I still think of scanned
documents that look like they came in over a FAX machine. 

I guess a more appropriate comparison, rather than BMP,
could be your SVG approach- all you have here is glyphs
instead of shapes. For artwork or pictures, this is fine but
not for information that is more accurately textual. 
When would someone decide to publish a PDF file instead of
an SVG "document?"

>
> However, as the use of PDF developed it became clear that there was a need to 
> also incorporate structural/semantic information to be able to make use of 
> the content in a consistent fashion (vs. having to "guess", and everyone 
> guessing differently) and thus the tagging/structure features were added in 
> PDF 1.4. Unfortunately, not all PDF producers will put such information into 
> the file :(. Like any format, "garbage in, garbage out".
>
> What type of government documents are you talking about? Different 
> departments create different types of documents, and those, of course, vary 
> country to country. Consider in the USA, you have tax forms from the IRS, 
> transcripts from Congress, technical materials from the DOD, etc.

Well, the FDA publishes clinical trial data for approved drugs
in formats that include scanned PDF files, which are pretty much
useless for any real analysis by outside entities even with decent
OCR software. The FCC, last time I looked, even accepts submissions
that disallow extraction of images or text. Fortunately I 
haven't seen a PDF submission in the SEC company filings in a long
time and they have even gone to XBRL XML filings. 

Computers may be able to  automate data processing, not just  
remove information. A recent summary of my attitude with
limited references is here, buried in with some other topics, if you are 
interested,

http://www.sec.gov/comments/s7-04-09/s70409-2.pdf

[ note that I did not submit this as a PDF file, LOL  ] 

>
> And what types of "manipulation" are you expecting? Some documents aren't 
> designed for manipulation, such as the plans for a Sherman Tank - while 
> others, such as forms make sense to enable extraction and processing of the 
> data.

While I'm sure this is just a flippant example ( as I often
give LOL), it does illustrate this presumption that
people need or want pictures/limited dat, not robust model 
information when
in fact the opposite would be true with this example. 
You might want to restrict access but this is actually a 
perfect example of where you NEED automated interaction with
information and pictures/views/renderings are really not 
the main issue. An image document like PDF or a 
screen shot from a CAD system
is not what you want to store and manipulate plans.
"Plans" would require even more versatile machine
readability with human readability being just a small component.
Presumably, you would like to archive, manipulate, and reuse
pieces and partially assembled units and make these things
automatically from the plans. At minimum, something like
a CNC mill or automated material ordering system would have to
"read" the plans. 

The US IRS offers PDF tax forms.
I'd like to be able to maintain my own tax information and
extract it from a filled in 1040 and not just waste time typing
into an information black hole in some proprietary or unworkable
format. Taxes are mostly numbers, and numbers can be manipulated
for many purposes if not buried in a bunch of irrelevant formatting
information. I'd probably cry if I found out the IRS bought
special scanner equipment and high-speed printers to print electronic
submissions only so they could be scanned back in just because
the PDF format doesn't let them separate information from graphics.
But, I also would not be surprised if that is exactly what they do.

>
> Leonard
>
> -----Original Message-----
> From: Mike Marchywka [mailto:[email protected]]
> Sent: Tuesday, March 10, 2009 6:26 AM
> To: [email protected]
> Subject: Re: [iText-questions] modifed sample, question on PDF contents
>
>
> ----------------------------------------
>> Date: Tue, 10 Mar 2009 08:34:11 +0100
>> From: [email protected]
>> To: [email protected]
>> Subject: Re: [iText-questions] modifed sample, question on PDF contents
>>
>> Mike Marchywka wrote:
>>> Is there any information in the
>>> PDF that tells me how this stuff is supposed to be organized
>>> to extract the INFORMATION or is this just a bunch of hopelessly jumbled
>>> text that can only be read by a human, not a computer?
>>
>> It's just a bunch of glyphs and lines drawn on a canvas;
>> there is no structure in the content UNLESS your PDF is tagged.
>
> Ok, thanks I'll try to find tags but I was hoping there
> was some hierarchy to the layout and a traversal pattern
> or something. Are there particular classes I in itext I should
> grep for?
>
> This would seem like a very limited format in which to
> present INFORMATION in things like government documents.
> Surely, there must be some mechanism to extract machine
> readable information so that other flexible non-proprietary
> tools can manipulate information easily if the format
> is being used for public documents.
>
> This is probably more of a marketing discussion than a technical
> one but I would be curious to understand the situation if anyone
> wants to talk off-list.
>
> Thanks.
>
>
>
>> --
>> This answer is provided by 1T3XT BVBA
>> http://www.1t3xt.com/ - http://www.1t3xt.info
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> iText-questions mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/itext-questions
>>
>> Buy the iText book: http://www.1t3xt.com/docs/book.php
>
> _________________________________________________________________
> Windows Live(tm) Groups: Create an online spot for your favorite groups to 
> meet.
> http://windowslive.com/online/groups?ocid=TXT_TAGLM_WL_groups_032009
> ------------------------------------------------------------------------------
> _______________________________________________
> iText-questions mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.1t3xt.com/docs/book.php
>
> ------------------------------------------------------------------------------
> _______________________________________________
> iText-questions mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.1t3xt.com/docs/book.php

_________________________________________________________________
Windows Live™ Contacts: Organize your contact list. 
http://windowslive.com/connect/post/marcusatmicrosoft.spaces.live.com-Blog-cns!503D1D86EBB2B53C!2285.entry?ocid=TXT_TAGLM_WL_UGC_Contacts_032009
------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Re: [iText-questions] modifed sample, question on PDF contents

Reply via email to