Re: [iText-questions] modifed sample, question on PDF contents

Mike Marchywka Sun, 15 Mar 2009 16:35:59 -0700


----------------------------------------
> From: [email protected]
> To: [email protected]
> Date: Tue, 10 Mar 2009 03:15:25 -0700
> Subject: Re: [iText-questions] modifed sample, question on PDF contents
>
> First, if you are going to be doing text extraction with PDF - you REALLY 
> need to read the relevant sections of the PDF Reference/ISO 32000-1 as it 
> will explain everything you need to know.
>
> Now, as Bruno points out, at it's "core" the PDF page is just a series of 
> drawing instructions (eg. moveto, drawstring, moveto, drawline, etc.) and so 
> any determination of how these elements go together must be done by various 
> heuristic models. It's complex, but many developers have written solutions.
>
> HOWEVER, PDF DOES support a concept called 'structured PDF' where the various 
> drawing operations are grouped into logical concepts such as paragraphs, 
> tables, etc. In such documents, you now have the information you need to make 
> higher level logical extraction possible w/o the need to "guess".


Just to make sure I'm on the right track and to summarize for anyone
else who may be interested in getting information out of pictures, 
it looks like the "article,thread,bead" notion would be the place to
look. In the PDF Reference I have, it is in chapter 8, "interactive features"
but I seem to recall when I first read this I did find a section
on structured documents. The PDF32000 document seems to point to chapter
14.7 for "logical structure." Are there other key words that may be worth
examining? 

FWIW, the reflection invokation of methods specialized to each of the 
types that itext extracts from the PDF document seems to work quite well.
I started with a limited set of handlers and then as I encountered unhandled
types, I could grep the itext source code for public methods and AFAIK
that gave me some idea how to dump or traverse the new type. Not exactly
a "design method" but for an empirical/hacking approach it gave me
some idea what was going on. 

Now on to the spec and documentation...
Thanks.

BTW, do you have any offhand thoughts or comments on using a
PDF document to contain information on molecules ?
It also occured to me that this format combined with the right
authoring tools could make a decent language for GUI code generators.
Someone was asking about this on a BlackBerry forum and
just thinking off the top of my head, it wouldn't seem 
unreasonable if you already have a graphical document design tool
with form capability to use it to generate j2me code ( opposite
of what other guy was suggesting LOL).












_________________________________________________________________
Hotmail® is up to 70% faster. Now good news travels really fast. 
http://windowslive.com/online/hotmail?ocid=TXT_TAGLM_WL_HM_70faster_032009
------------------------------------------------------------------------------
Apps built with the Adobe(R) Flex(R) framework and Flex Builder(TM) are
powering Web 2.0 with engaging, cross-platform capabilities. Quickly and
easily build your RIAs with Flex Builder, the Eclipse(TM)based development
software that enables intelligent coding and step-through debugging.
Download the free 60 day trial. http://p.sf.net/sfu/www-adobe-com
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php
Re: [iText-questions] modifed sample, question on PDF contents

Reply via email to