[
https://issues.apache.org/jira/browse/PDFBOX-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-3749:
------------------------------------
Attachment: helloworld.pdf
helloworld-marked-1.png
Here's a file that works as expected.
{code}
The quick brown fox jumps over the lazy dog
String[100.0,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=7.3320007]T
String[107.332,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]h
String[114.004,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]e
String[120.675995,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=3.3359985]
String[124.01199,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]q
String[130.68399,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]u
String[137.35599,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=2.6640015]i
String[140.01999,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.0]c
String[146.01999,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.0]k
String[152.01999,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=3.3359985]
String[155.35599,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]b
String[162.02798,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=3.9960022]r
String[166.02399,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]o
String[172.69598,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=8.664001]w
String[181.35999,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]n
String[188.03198,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=3.3359985]
String[191.36798,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=3.3359985]f
String[194.70398,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]o
String[201.37598,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.0]x
String[207.37598,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=3.3359985]
String[210.71198,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=2.6640015]j
String[213.37598,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]u
String[220.04797,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=9.996002]m
String[230.04398,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]p
String[236.71597,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.0]s
String[242.71597,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=3.3359985]
String[246.05197,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]o
String[252.72397,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.0]v
String[258.72397,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]e
String[265.39597,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=3.9960022]r
String[269.39197,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=3.3359985]
String[272.72797,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=3.3359985]t
String[276.06396,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]h
String[282.73596,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]e
String[289.40796,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=3.3359985]
String[292.74396,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=2.6640015]l
String[295.40796,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]a
String[302.07996,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.0]z
String[308.07996,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.0]y
String[314.07996,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=3.3359985]
String[317.41595,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]d
String[324.08795,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]o
String[330.75995,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004
width=6.671997]g
{code}
> void writeString(String text, List<TextPosition> textPositions) is not called
> per line
> --------------------------------------------------------------------------------------
>
> Key: PDFBOX-3749
> URL: https://issues.apache.org/jira/browse/PDFBOX-3749
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.4
> Environment: Windows 10 64-bit
> Reporter: Harun Reşit Zafer
> Priority: Minor
> Labels: extraction, style
> Attachments: contract_00105_SEDAR-marked-1.png,
> contract_00105_SEDAR.pdf, helloworld-marked-1.png, helloworld.pdf
>
>
> We overwrote the {{void writeString(String text, List<TextPosition>
> textPositions)}} method of the {{PDFTextStripper}} to extract additional
> position and style information from the PDFs. We thought this method would be
> called per line and the elements of the parameter {{List<TextPosition>
> textPositions}} would be all the letters, including the spaces in a line.
> This is indeed the case for thousands of the documents. However, one
> particular document, this is not the case and the {{textPositions}} contains
> just the letters of a word and {{writeString}} is called per word.
> I am not sure if this would be counted as a bug because the final extracted
> text is not affected.
> The problematic PDF is attached.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]