I understand, but is there anything I can do in my code to get the string as shown in ExtractText?
I am subclassing PDFTextStripper, similar to what is done in PrintTextLocations, and the string coming into writeString(String string, List<TextPosition> textPositions) is the one where all the spaces occur. Thanks On Tue, Mar 29, 2016 at 10:03 AM, Tilman Hausherr <[email protected]> wrote: > Here's what I got with ExtractText command line application: > > ______ > ______ 03-09 3,411.69 > ELECTRONIC DEPOSIT FDMS-SETTLEMENT DEPOSIT 376249462999 > 03-10 1,645.22 ELECTRONIC DEPOSIT FDMS-SETTLEMENT DEPOSIT > 376249462999 > > > > However I think I understand the cause of your problem, because there's > output like this: > > String[461.20358,340.904 fs=1.0 xscale=1.0 height=4.44 space=4.7999997 > width=4.799988]6 > String[461.20428,340.904 fs=1.0 xscale=1.0 height=6.48 space=7.2 > width=7.200012] > > i.e. space and a character at the same place. See this content stream: > > BT > 0 0 0 rg > /F0 1 Tf > 1 0 0 1 29.204 460.096 Tm > ( ______ ) Tj > 1 0 0 1 29.204 451.096 Tm > ( ______ ) Tj > /F1 1 Tf > 1 0 0 1 29.204 451.096 Tm > ( 03-09 3,411.69 ELECTRONIC DEPOSIT FDMS-SETTLEMENT > DEPOSIT 376249462999 ) Tj > 1 0 0 1 29.204 442.096 Tm > ( 03-10 1,645.22 ELECTRONIC DEPOSIT FDMS-SETTLEMENT > DEPOSIT 376249462999 ) Tj > ET > > There are two lines that start at the same position 29.204 451.096, one > with blanks, one with a text. That is a bug by the creator of the file. > > Tilman > > > Am 29.03.2016 um 18:48 schrieb Joel Hirsh: > >> I thought it was attached to the first email, but it is also available at >> >> https://www.dropbox.com/s/btqwaxfsubt3rwx/extra%20spaces.pdf?dl=0 >> >> >> On Tue, Mar 29, 2016 at 9:13 AM, Tilman Hausherr <[email protected]> >> wrote: >> >> Please upload that file somewhere. >>> >>> Tilman >>> >>> >>> Am 29.03.2016 um 17:24 schrieb Joel Hirsh: >>> >>> I have a couple of PDF files that have this problem. These are >>>> multi-page PDF files, and on one page (the first) there are a few lines >>>> that get extra spaces between almost every character as seen from >>>> PrintTextLocations. >>>> >>>> Attached is a snippet from one of those files, the first line has the >>>> problem, the second line does not. >>>> >>>> In this file, the first line gets a string that is >>>> 0 3- 09 3 ,4 1 1. 6 9 EL E CT R ON I C D EP O SI T >>>> F DM S -S E TT L EM E NT D E PO S IT 37 6 24 9 46 2 99 >>>> 9 >>>> >>>> While the second line gets the text without any extra spaces. >>>> >>>> The two lines also have different spacing values as reported by >>>> PrintTextLocations. In the full file, all the good lines have one >>>> value, >>>> the bad lines a different value. >>>> >>>> I cannot see any difference between the lines in Acrobat, doing >>>> copy/paste, Nitro editing. >>>> >>>> This problem shows up in 2.0.0 and the latest 2.0.1 snapshot, and some >>>> older versions I tried as well (i.e. I don't think it is any kind of >>>> regression) >>>> >>>> Thanks >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

