On 20/07/2011 20:58, Dániel Kékesi wrote:
Dear All,

I am using iTextSharp in my application and found its text extraction capabilities excellent. I am facing a problem though. I use the PdfTextExtractor.GetTextFromPage method but it returns text pieces that are far apart separated by a single space. Take the following example (as displayed in Acrobat):

User name: abcdef                               Password: Cool1234

In the PDF there are no spaces between "abcdef" and "Password".

No, there aren't. But there aren't any tabs either.
Both PDF strings are added at coordinates with (about) the same X coordinate,
but with a Y coordinate that puts them far apart.

If I extract the above text using PdfTextExtractor.GetTextFromPage I'll get the following result:

User name: abcdef Password: Cool1234

That's correct.

So the distance between the two words were cut down to a single space. What I need to achieve is that the words that are not separated by a space but a larger distance would be separated by a TAB in the resultant text.

That's not trivial. You'd need to examine the Y coordinates.

I am guessing that I should abandon PdfTextExtractor.GetTextFromPage and use the LocationTextExtractionStrategy class combined with TextRenderInfo

Yes, TextRenderInfo will give you the info about the coordinates, but you'll have to do plenty of programming. Either you'll have to do that programming yourself, or you'll have to hire somebody to do it for you.
------------------------------------------------------------------------------
5 Ways to Improve & Secure Unified Communications
Unified Communications promises greater efficiencies for business. UC can 
improve internal communications as well as offer faster, more efficient ways
to interact with customers and streamline customer service. Learn more!
http://www.accelacomm.com/jaw/sfnl/114/51426253/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to