You need to consider the history of PDF...

The original design was for "electronic paper" - something where you could 
create a "frozen instance" of your document that would look the same on any 
computer and print as it looked.  As such, there was no need to incorporate 
semantic information about the structure of the document - only information 
necessary to render it.

However, as the use of PDF developed it became clear that there was a need to 
also incorporate structural/semantic information to be able to make use of the 
content in a consistent fashion (vs. having to "guess", and everyone guessing 
differently) and thus the tagging/structure features were added in PDF 1.4.  
Unfortunately, not all PDF producers will put such information into the file 
:(.  Like any format, "garbage in, garbage out".

What type of government documents are you talking about?  Different departments 
create different types of documents, and those, of course, vary country to 
country.  Consider in the USA, you have tax forms from the IRS, transcripts 
from Congress, technical materials from the DOD, etc.  

And what types of "manipulation" are you expecting?  Some documents aren't 
designed for manipulation, such as the plans for a Sherman Tank - while others, 
such as forms make sense to enable extraction and processing of the data.

Leonard

-----Original Message-----
From: Mike Marchywka [mailto:marchy...@hotmail.com] 
Sent: Tuesday, March 10, 2009 6:26 AM
To: itext-questions@lists.sourceforge.net
Subject: Re: [iText-questions] modifed sample, question on PDF contents


----------------------------------------
> Date: Tue, 10 Mar 2009 08:34:11 +0100
> From: i...@1t3xt.info
> To: itext-questions@lists.sourceforge.net
> Subject: Re: [iText-questions] modifed sample, question on PDF contents
>
> Mike Marchywka wrote:
>> Is there any information in the
>> PDF that tells me how this stuff is supposed to be organized
>> to extract the INFORMATION or is this just a bunch of hopelessly jumbled
>> text that can only be read by a human, not a computer?
>
> It's just a bunch of glyphs and lines drawn on a canvas;
> there is no structure in the content UNLESS your PDF is tagged.

Ok, thanks I'll try to find tags but I was hoping there
was some hierarchy to the layout and a traversal pattern
or something. Are there particular classes I in itext I should
grep for? 

This would seem like a very limited format in which to 
present INFORMATION in things like government documents.
Surely, there must be some mechanism to extract machine
readable information so that other flexible non-proprietary
tools can manipulate information easily if the format
is being used for public documents. 

This is probably more of a marketing discussion than a technical
one but I would be curious to understand the situation if anyone
wants to talk off-list. 

Thanks. 



> --
> This answer is provided by 1T3XT BVBA
> http://www.1t3xt.com/ - http://www.1t3xt.info
>
> ------------------------------------------------------------------------------
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.1t3xt.com/docs/book.php

_________________________________________________________________
Windows Live(tm) Groups: Create an online spot for your favorite groups to meet.
http://windowslive.com/online/groups?ocid=TXT_TAGLM_WL_groups_032009
------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Reply via email to