[ 
https://issues.apache.org/jira/browse/PDFBOX-4101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354619#comment-16354619
 ] 

Alexandre edited comment on PDFBOX-4101 at 2/6/18 10:13 PM:
------------------------------------------------------------

I understand what you said! Well, yes I used the unsorted mode for the picture. 
Then, the sorted mode failed for the two cases. It probably sorts glyphs 
according to their distance to the top of the page.


was (Author: arelaxend):
I understand what you said! Well, yes I used the unsorted mode for the picture. 
So, I understood that the sorted mode failed for the two cases. It probably 
sorts glyphs according to their distance to the top of the page.

> Word ordering / line detection failures in text extraction
> ----------------------------------------------------------
>
>                 Key: PDFBOX-4101
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4101
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.8
>            Reporter: Alexandre
>            Priority: Major
>         Attachments: fails_line_detection-sort.txt, 
> fails_line_detection-unsort.txt, fails_line_detection.pdf, hardtests-11.png
>
>
> Dear Apache contributors,
> I am a (y) user of pdfbox mainly for the purpose of text extraction. The word 
> ordering is not correct for some cases and the line detection may fail too.
> Attachments:
>  * 1st page: the first letter D is not written before "uis sit amet..." but 
> at the end of the page ;
>  * 2nd page: the sentence "scolaire ferry" is just before "réouverture du 
> musée" which is wrong because it's not on the same column ;
> To manage these cases would be more than welcome :D A.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to