[iText-questions] [SPAM] Re: Searching PDF Contents.

mkl Tue, 23 Oct 2012 03:04:03 -0700

Hi Christian,

Christian Eric Paran wrote
> My Problem here is that I do not know what kind of Strategy it needs. I
> can not find a specific sample that would help me understand how it works.
*
> What are other class and methods that could help me Search the PDF and
> Extract a Paragraph/Sentences?
> Can you give some examples(Maybe Links) of how you did it?
*


The content of a PDF (generally) does not contain information on which part
of it forms a paragraph or a sentence; instead all it contains are position
letter groups (with a font and an affine transformation which positions
those letters, rotatates, skews, and stretches them). The foremost task of a
text extraction strategy (or more generally a RenderListener) is to somehow
make sense of these letter groups to return some text or other information.
iText includes some such strategies:

 * SimpleTextExtractionStrategy: A very simple strategy which assumes the
letter groups already are in the correct order in the PDF and, thus, can
simply be concatenated in the order they are received, merely some spaces or
line feeds are added. Font information and affine deformation are ignored.
 * LocationTextExtractionStrategy: A slightly more complex strategy which
collects the letter groups and eventually combines them by their starting
coordinate when asked for the text. Font information and affine deformation
are ignored.
 * FilteredTextRenderListener: This actually merely is a wrapper for some
other strategy which allows to filter the incoming letter groups forwarded
to the wrapped listener, e.g. to restrict to a given region on the page.

As you see, there is no strategy in the iText base distribution that does
more in respect to text analysis. Depending on your requirements you can do
this analysis on the string returned by the text extraction, e.g. by
splitting at periods '.' or at line feeds '\n'.

If you need more, copy the LocationTextExtractionStrategy as a start (it
already collects text chunks and does the chunk analysis in the end) and
expand the analysis to return the information you need.

Regards,   Michael

PS: You can find some examples at http://itextpdf.com/book/chapter.php?id=15
--- look for Extract* in the column titled "Examples".



--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Searching-PDF-Contents-tp4656680p4656685.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

[iText-questions] [SPAM] Re: Searching PDF Contents.

Reply via email to