Op 16/03/2011 12:13, DivyaKambhatla schreef: > We simply extract the text from a PDF and use it to load a tag in an xml > file. > > Let us say, a PDF has text, > > "....of ferritic alloys, largely due to the complexities associated with the > solid state phase transformations that occur in multipass welding" > > When i extract the text from the PDF using iText5.0.5, the output i get > is.. > > "in the case of ferritic alloys, > largelyduetothecomplexitiesassociatedwiththesolidstatephasetransformationsthatoccurin > multipass welding".
Let me take two lines from the PDF syntax: 0 -1.3605 TD [(austenitic)-366.9(stainless)-374.2(steels.)-370(The)-366.7(progress)-368.2(has)-372.2(been)-368.3(less)-371.7(convincing)-368.1(in)-373.3(the)-373.9(case)-368.5(of)-372.4(ferritic)-372.2(alloys,)]TJ 0 -1.3657 TD [(largely)-261.8(due)-263.3(to)-263.7(the)-260.1(complexities)-260.9(associated)-258.8(with)-264.1(the)-265.3(solid)-262.2(state)-260.2(phase)-261.5(transformations)-256.7(that)-263.6(occur)-262.9(in)]TJ As you can see, each line of text is expressed as an array in your document. There are NO spaces. Spacing is handled using a number expressed in glyph space. iText examines these arrays, see there are no space characters whatsoever, and then makes a call based on the values for the spacing. When the different parts of a string are too close to each other, iText assumes both are part of the same word. The PDF seems to be produced by iText 2.0.7, but I wonder who programmed the application that used iText, or which tool was used. ------------------------------------------------------------------------------ Colocation vs. Managed Hosting A question and answer guide to determining the best fit for your organization - today and in the future. http://p.sf.net/sfu/internap-sfd2d _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
