[ https://issues.apache.org/jira/browse/PDFBOX-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14973407#comment-14973407 ]
Tilman Hausherr edited comment on PDFBOX-3061 at 10/25/15 8:24 PM: ------------------------------------------------------------------- I'm not questioning the logic in text extraction to calculating the space width (which is very obscure indeed). In this case, it comes down to an incomplete font descriptor. There are widths that start with code 36. And no "/MissingWidth" entry. My current impression is that the software uses a different strategy to get a width. For 2.0, the software calculates an average width of all /Width entries != 0, and that is 558.7313. For 1.8, the software calculates the width for code 0 (because 32 isn't in the TT font) from an advanceWidths table, which is from the hmtx table of the TT font, which is 326.66. was (Author: tilman): I'm not questioning the logic in text extraction to calculating the space width (which is very obscure indeed). In this case, it comes down to an incomplete font descriptor. There are widths that start with code 36. And no "MissingWidth" entry. My current impression is that the software uses a different strategy to get a width. For 2.0, the software calculates an average width of all Width entries != 0, and that is 558.7313. For 1.8, the software calculates the width for code 0 (because 32 isn't in the TT font) from an advanceWidths table, which is from the hmtx table of the TT font, which is 326.66. > Word concatenation in 2.0 not in 1.8 > ------------------------------------ > > Key: PDFBOX-3061 > URL: https://issues.apache.org/jira/browse/PDFBOX-3061 > Project: PDFBox > Issue Type: Sub-task > Components: Text extraction > Affects Versions: 2.0.0 > Reporter: Tilman Hausherr > Attachments: PDFBOX-3061-092465-reduced.pdf > > > Attached file is reduced from govdocs file 092465.pdf. > Text extraction with 1.8: > {code} > day. Some market watchers were > {code} > Text extraction with 2.0: > {code} > day. Somemarketwatcherswere > {code} > Text extraction with Adobe Reader: > {code} > day. Somemarket watchers were > {code} > PrintTextLocations 1.8: > {code} > String[36.0,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 space=3.4298992 > width=6.4154396]d > String[42.41544,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=5.2499504]a > String[47.66539,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=5.837944]y > String[53.503334,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=2.6249733]. > String[60.01537,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=5.5124474]S > String[65.52782,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=5.7329483]o > String[71.260765,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=9.271416]m > String[80.53218,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=5.0294495]e > String[87.505165,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=9.271416]m > String[96.77868,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=5.2499466]a > String[102.028625,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=4.147461]r > String[106.17609,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=5.837944]k > String[112.01403,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=5.0294495]e > String[117.04348,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=3.422966]t > String[122.40893,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=8.75692]w > String[131.16585,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=5.249954]a > String[136.4158,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=3.4229736]t > String[139.83878,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=4.661957]c > String[144.50073,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=6.1109467]h > String[150.61168,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=5.0294495]e > String[155.64113,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=4.147461]r > String[159.78859,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=4.45195]s > String[166.18617,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=8.756912]w > String[174.94308,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=5.0294495]e > String[179.97253,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=4.147461]r > String[184.12,169.67963 fs=1.0 xscale=10.4999 height=6.3524394 > space=3.4298992 width=5.0294495]e > {code} > PrintTextLocations 2.0: > {code} > String[36.0,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 space=5.8666234 > width=6.4154396]d > String[42.41544,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=5.2499504]a > String[47.66539,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=5.837944]y > String[53.503334,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=2.6249733]. > String[60.01537,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=5.5124474]S > String[65.52782,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=5.7329483]o > String[71.260765,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=9.271416]m > String[80.53218,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=5.0294495]e > String[87.505165,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=9.271416]m > String[96.77868,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=5.2499466]a > String[102.028625,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=4.147461]r > String[106.17609,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=5.837944]k > String[112.01403,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=5.0294495]e > String[117.04348,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=3.422966]t > String[122.40893,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=8.75692]w > String[131.16585,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=5.249954]a > String[136.4158,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=3.4229736]t > String[139.83878,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=4.661957]c > String[144.50073,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=6.1109467]h > String[150.61168,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=5.0294495]e > String[155.64113,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=4.147461]r > String[159.78859,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=4.45195]s > String[166.18617,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=8.756912]w > String[174.94308,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=5.0294495]e > String[179.97253,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=4.147461]r > String[184.12,169.67963 fs=1.0 xscale=10.4999 height=6.4035034 > space=5.8666234 width=5.0294495]e > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org