In theory, yeah. In practice? Ouch. Some of this is relatively easy. Font, size, color? No problem... ish.
You'll need to modify SimpleTextExtractingPdfContentStreamProcessor a bit. Turn the data from STEPCSP into words. It'll return the locations of various bits of text, but they might not be words per se, particularly if the word changes style (size, color, etc) somewhere in the middle. You'll also need to keep track of the current font size/color at this point. As for "underline/strikethrough", you're in for some Heavy Lifting. The current PdfContentStreamProcessor doesn't do line art at all. You'll have to add that capability. Once you're getting information on lines, you have to look at where the lines are, where the text is, and figure out whether or not a given hunk of text is underlined, strike-through'ed, or what. And all this completely ignores a number of Nightmare Scenarios. All that information could be contained within a raster image (jpg or whatever). It might ALL be path information (Java's Graphics2D interfact can produce text that is entirely line art). The only thing that will help you at that point is OCR... OCR that knows what an underline is, how big a charcter is, and what color. Ouch. If you can limit your PDF inputs (say they're all coming from the same program), then you can safely ignore the OCR stuff. This'll cover a good 80-90% of the random PDFs out there in the world anyway. On the other hand, if you want to recognize anything from anyone, you have a very thorny problem to work through. You HAVE to do OCR. Extracting data from a PDF is /hard/. The ContentStreamProcessor classes are a step in the right direction, but it's still a long journey. --Mark Storer Senior Software Engineer Cardiff.com import legalese.Disclaimer; Disclaimer<Cardiff> DisCard = null; > -----Original Message----- > From: SpiritDev SoftwareSolutions [mailto:spiritde...@gmail.com] > Sent: Tuesday, June 22, 2010 4:44 AM > To: itext-questions@lists.sourceforge.net > Subject: [iText-questions] Query regarding feasibility > > Hi, I need to add a feature in my existing product which can get the font > size, color and underline status of the search term in a pdf file for all > of > the occurrences of the search term. Is this possible to extract these > features for a particular word from pdf file by using itextsharp. If this > kind of information can be obtained by using this library then Is there > any > link where I can get help in this direction. > > Thanks and Regards, > Jivan Goyal, > SpiritDev Software Solutions. > > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 9.0.829 / Virus Database: 271.1.1/2953 - Release Date: 06/21/10 > 11:36:00 ------------------------------------------------------------------------------ ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/