[jira] [Commented] (PDFBOX-4101) Word ordering / line detection failures in text extraction

Tilman Hausherr (JIRA) Tue, 06 Feb 2018 14:02:19 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-4101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354611#comment-16354611
 ]


Tilman Hausherr commented on PDFBOX-4101:
-----------------------------------------

I assume this was created in the unsorted mode. That is because of the order of 
the glyphs in the content stream. This order may or may not be useful for 
humans.

> Word ordering / line detection failures in text extraction
> ----------------------------------------------------------
>
>                 Key: PDFBOX-4101
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4101
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.8
>            Reporter: Alexandre
>            Priority: Major
>         Attachments: fails_line_detection-sort.txt, 
> fails_line_detection-unsort.txt, fails_line_detection.pdf, hardtests-11.png
>
>
> Dear Apache contributors,
> I am a (y) user of pdfbox mainly for the purpose of text extraction. The word 
> ordering is not correct for some cases and the line detection may fail too.
> Attachments:
>  * 1st page: the first letter D is not written before "uis sit amet..." but 
> at the end of the page ;
>  * 2nd page: the sentence "scolaire ferry" is just before "réouverture du 
> musée" which is wrong because it's not on the same column ;
> To manage these cases would be more than welcome :D A.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4101) Word ordering / line detection failures in text extraction

Reply via email to