[ 
https://issues.apache.org/jira/browse/PDFBOX-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3749:
------------------------------------
    Attachment: helloworld.pdf
                helloworld-marked-1.png

Here's a file that works as expected.
{code}
The quick brown fox jumps over the lazy dog
String[100.0,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=7.3320007]T
String[107.332,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]h
String[114.004,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]e
String[120.675995,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=3.3359985] 
String[124.01199,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]q
String[130.68399,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]u
String[137.35599,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=2.6640015]i
String[140.01999,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.0]c
String[146.01999,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.0]k
String[152.01999,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=3.3359985] 
String[155.35599,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]b
String[162.02798,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=3.9960022]r
String[166.02399,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]o
String[172.69598,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=8.664001]w
String[181.35999,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]n
String[188.03198,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=3.3359985] 
String[191.36798,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=3.3359985]f
String[194.70398,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]o
String[201.37598,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.0]x
String[207.37598,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=3.3359985] 
String[210.71198,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=2.6640015]j
String[213.37598,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]u
String[220.04797,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=9.996002]m
String[230.04398,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]p
String[236.71597,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.0]s
String[242.71597,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=3.3359985] 
String[246.05197,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]o
String[252.72397,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.0]v
String[258.72397,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]e
String[265.39597,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=3.9960022]r
String[269.39197,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=3.3359985] 
String[272.72797,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=3.3359985]t
String[276.06396,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]h
String[282.73596,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]e
String[289.40796,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=3.3359985] 
String[292.74396,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=2.6640015]l
String[295.40796,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]a
String[302.07996,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.0]z
String[308.07996,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.0]y
String[314.07996,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=3.3359985] 
String[317.41595,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]d
String[324.08795,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]o
String[330.75995,92.0 fs=12.0 xscale=12.0 height=7.9833984 space=3.3360004 
width=6.671997]g
{code}


> void writeString(String text, List<TextPosition> textPositions) is not called 
> per line
> --------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-3749
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3749
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.4
>         Environment: Windows 10 64-bit
>            Reporter: Harun Reşit Zafer
>            Priority: Minor
>              Labels: extraction, style
>         Attachments: contract_00105_SEDAR-marked-1.png, 
> contract_00105_SEDAR.pdf, helloworld-marked-1.png, helloworld.pdf
>
>
> We overwrote the {{void writeString(String text, List<TextPosition> 
> textPositions)}} method of the {{PDFTextStripper}} to extract additional 
> position and style information from the PDFs. We thought this method would be 
> called per line and the elements of the parameter {{List<TextPosition> 
> textPositions}} would be all the letters, including the spaces in a line. 
> This is indeed the case for thousands of the documents. However, one 
> particular document, this is not the case and the {{textPositions}} contains 
> just the letters of a word and {{writeString}} is called per word. 
> I am not sure if this would be counted as a bug because the final extracted 
> text is not affected. 
> The problematic PDF is attached. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to