Re: [iText-questions] Query regarding feasibility

Mark Storer Tue, 22 Jun 2010 11:28:32 -0700

In theory, yeah.  In practice?  Ouch.

Some of this is relatively easy.  Font, size, color?  No problem... ish.


You'll need to modify SimpleTextExtractingPdfContentStreamProcessor a
bit.

Turn the data from STEPCSP into words.  It'll return the locations of
various bits of text, but they might not be words per se, particularly
if the word changes style (size, color, etc) somewhere in the middle.
You'll also need to keep track of the current font size/color at this
point.

As for "underline/strikethrough", you're in for some Heavy Lifting.  The
current PdfContentStreamProcessor doesn't do line art at all.  You'll
have to add that capability.  Once you're getting information on lines,
you have to look at where the lines are, where the text is, and figure
out whether or not a given hunk of text is underlined,
strike-through'ed, or what.

And all this completely ignores a number of Nightmare Scenarios.  All
that information could be contained within a raster image (jpg or
whatever).  It might ALL be path information (Java's Graphics2D
interfact can produce text that is entirely line art).  The only thing
that will help you at that point is OCR... OCR that knows what an
underline is, how big a charcter is, and what color.  Ouch.

If you can limit your PDF inputs (say they're all coming from the same
program), then you can safely ignore the OCR stuff.  This'll cover a
good 80-90% of the random PDFs out there in the world anyway.

On the other hand, if you want to recognize anything from anyone, you
have a very thorny problem to work through.  You HAVE to do OCR.

Extracting data from a PDF is /hard/.  The ContentStreamProcessor
classes are a step in the right direction, but it's still a long
journey.

--Mark Storer
  Senior Software Engineer
  Cardiff.com
 
import legalese.Disclaimer;
Disclaimer<Cardiff> DisCard = null;
 

> -----Original Message-----
> From: SpiritDev SoftwareSolutions [mailto:spiritde...@gmail.com]
> Sent: Tuesday, June 22, 2010 4:44 AM
> To: itext-questions@lists.sourceforge.net
> Subject: [iText-questions] Query regarding feasibility
> 
> Hi, I need to add a feature in my existing product which can get the
font
> size, color and underline status of the search term in a pdf file for
all
> of
> the occurrences of the search term. Is this possible to extract these
> features for a particular word from pdf file by using itextsharp. If
this
> kind of information can be obtained by using this library then Is
there
> any
> link where I can get help in this direction.
> 
> Thanks and Regards,
> Jivan Goyal,
> SpiritDev Software Solutions.
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.829 / Virus Database: 271.1.1/2953 - Release Date:
06/21/10
> 11:36:00

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Re: [iText-questions] Query regarding feasibility

Reply via email to