Op 16/03/2011 12:13, DivyaKambhatla schreef:
>   We simply extract the text from a PDF and use it to load a tag in an xml
> file.
>
>    Let us say, a PDF has text,
>
> "....of ferritic alloys, largely due to the complexities associated with the
> solid state phase transformations that occur in multipass welding"
>
>    When i extract the text from the PDF using iText5.0.5, the output i get
> is..
>
> "in the case of ferritic alloys,
> largelyduetothecomplexitiesassociatedwiththesolidstatephasetransformationsthatoccurin
> multipass welding".

Let me take two lines from the PDF syntax:

0 -1.3605 TD
[(austenitic)-366.9(stainless)-374.2(steels.)-370(The)-366.7(progress)-368.2(has)-372.2(been)-368.3(less)-371.7(convincing)-368.1(in)-373.3(the)-373.9(case)-368.5(of)-372.4(ferritic)-372.2(alloys,)]TJ
0 -1.3657 TD
[(largely)-261.8(due)-263.3(to)-263.7(the)-260.1(complexities)-260.9(associated)-258.8(with)-264.1(the)-265.3(solid)-262.2(state)-260.2(phase)-261.5(transformations)-256.7(that)-263.6(occur)-262.9(in)]TJ

As you can see, each line of text is expressed as an array in your document.
There are NO spaces. Spacing is handled using a number expressed in 
glyph space.
iText examines these arrays, see there are no space characters whatsoever,
and then makes a call based on the values for the spacing.
When the different parts of a string are too close to each other, iText 
assumes
both are part of the same word.

The PDF seems to be produced by iText 2.0.7, but I wonder who programmed
the application that used iText, or which tool was used.

------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to