[
https://issues.apache.org/jira/browse/PDFBOX-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221702#comment-17221702
]
Tilman Hausherr commented on PDFBOX-5002:
-----------------------------------------
Seems nice. I need review the result of tests with have my own, bigger test set.
The different extraction in the "EU" file could be problematic (although the
result looks better). This is a test file of the Tabula project (there are
many, but I kept that one as an early indictor of trouble). They don't want any
extractions differences.
The good thing is that the {{testTabula()}} test passes (it uses a different
algorithm to get font heights). But I'd need to test the Tabula build too which
has more tests.
> PDFTextStripper sometimes fuses two words on different lines
> ------------------------------------------------------------
>
> Key: PDFBOX-5002
> URL: https://issues.apache.org/jira/browse/PDFBOX-5002
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.21
> Reporter: Thierry Guérin
> Priority: Minor
> Fix For: 2.0.22
>
> Attachments: small&Big.pdf
>
>
> This happens when a text in a big font is followed by at least two lines of
> text in a smaller font: the last word of the first line is merged with the
> first word of the second line.
> On the attached PDF, the extracted text is :
> {noformat}
> (...) some text awith smaller font (...){noformat}
> instead of:
>
> {noformat}
> (...) some text with a smaller font (...)
> {noformat}
> I often encounter this kind of problem on invoices, where the company address
> (small text at the top right) is next to the company name & logo (big
> centered text at the top).
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]