[jira] [Commented] (PDFBOX-4101) Word ordering / line detection failures in text extraction

2018-02-06 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354990#comment-16354990
 ] 

Tilman Hausherr commented on PDFBOX-4101:
-

There is no fixed rule that the sort mode is better than the unsorted mode... 
sometimes, the unsorted mode is better, e.g. if a column PDF was created with 
the text in perfect reading order. (Open your file with Adobe Reader and try to 
mark the three lines of the left column of page 2... you can't. It will mark 
three other segments as well.) The sorted mode is better if you want your text 
at the location of the PDF. However the sort has no proper transitivity rule 
when glyphs have different sizes. (PDFBOX-1512)

> Word ordering / line detection failures in text extraction
> --
>
> Key: PDFBOX-4101
> URL: https://issues.apache.org/jira/browse/PDFBOX-4101
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.8
>Reporter: Alexandre
>Priority: Major
> Attachments: fails_line_detection-sort.txt, 
> fails_line_detection-unsort.txt, fails_line_detection.pdf, hardtests-11.png
>
>
> Dear Apache contributors,
> I am a (y) user of pdfbox mainly for the purpose of text extraction. The word 
> ordering is not correct for some cases and the line detection may fail too.
> Attachments:
>  * 1st page: the first letter D is not written before "uis sit amet..." but 
> at the end of the page ;
>  * 2nd page: the sentence "scolaire ferry" is just before "réouverture du 
> musée" which is wrong because it's not on the same column ;
> To manage these cases would be more than welcome :D A.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4101) Word ordering / line detection failures in text extraction

2018-02-06 Thread Alexandre (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354619#comment-16354619
 ] 

Alexandre commented on PDFBOX-4101:
---

I understand what you said! Well, yes I used the unsorted mode. So, I 
understood that the sorted mode failed for the two cases.

> Word ordering / line detection failures in text extraction
> --
>
> Key: PDFBOX-4101
> URL: https://issues.apache.org/jira/browse/PDFBOX-4101
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.8
>Reporter: Alexandre
>Priority: Major
> Attachments: fails_line_detection-sort.txt, 
> fails_line_detection-unsort.txt, fails_line_detection.pdf, hardtests-11.png
>
>
> Dear Apache contributors,
> I am a (y) user of pdfbox mainly for the purpose of text extraction. The word 
> ordering is not correct for some cases and the line detection may fail too.
> Attachments:
>  * 1st page: the first letter D is not written before "uis sit amet..." but 
> at the end of the page ;
>  * 2nd page: the sentence "scolaire ferry" is just before "réouverture du 
> musée" which is wrong because it's not on the same column ;
> To manage these cases would be more than welcome :D A.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4101) Word ordering / line detection failures in text extraction

2018-02-06 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354611#comment-16354611
 ] 

Tilman Hausherr commented on PDFBOX-4101:
-

I assume this was created in the unsorted mode. That is because of the order of 
the glyphs in the content stream. This order may or may not be useful for 
humans.

> Word ordering / line detection failures in text extraction
> --
>
> Key: PDFBOX-4101
> URL: https://issues.apache.org/jira/browse/PDFBOX-4101
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.8
>Reporter: Alexandre
>Priority: Major
> Attachments: fails_line_detection-sort.txt, 
> fails_line_detection-unsort.txt, fails_line_detection.pdf, hardtests-11.png
>
>
> Dear Apache contributors,
> I am a (y) user of pdfbox mainly for the purpose of text extraction. The word 
> ordering is not correct for some cases and the line detection may fail too.
> Attachments:
>  * 1st page: the first letter D is not written before "uis sit amet..." but 
> at the end of the page ;
>  * 2nd page: the sentence "scolaire ferry" is just before "réouverture du 
> musée" which is wrong because it's not on the same column ;
> To manage these cases would be more than welcome :D A.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4101) Word ordering / line detection failures in text extraction

2018-02-06 Thread Alexandre (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354601#comment-16354601
 ] 

Alexandre commented on PDFBOX-4101:
---

It does recognize columns but I don't have a clue which algorithm is behind.. 
whitespace detection ? Still it does recognize the columns *below* "scolaire 
ferry" and "réouverture du musée".

In the attachment hardtests-11.png you can see the lines, one color per 
distinct line.

> Word ordering / line detection failures in text extraction
> --
>
> Key: PDFBOX-4101
> URL: https://issues.apache.org/jira/browse/PDFBOX-4101
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.8
>Reporter: Alexandre
>Priority: Major
> Attachments: fails_line_detection-sort.txt, 
> fails_line_detection-unsort.txt, fails_line_detection.pdf, hardtests-11.png
>
>
> Dear Apache contributors,
> I am a (y) user of pdfbox mainly for the purpose of text extraction. The word 
> ordering is not correct for some cases and the line detection may fail too.
> Attachments:
>  * 1st page: the first letter D is not written before "uis sit amet..." but 
> at the end of the page ;
>  * 2nd page: the sentence "scolaire ferry" is just before "réouverture du 
> musée" which is wrong because it's not on the same column ;
> To manage these cases would be more than welcome :D A.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4101) Word ordering / line detection failures in text extraction

2018-02-06 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354597#comment-16354597
 ] 

Tilman Hausherr commented on PDFBOX-4101:
-

Try the sort option... however you still won't be satisfied... PDFBox doesn't 
know about columns. You as a human know this, but PDFBox doesn't have any 
heuristics to detect this. It can be done with "article beads" but your PDF 
doesn't have them. Same for the "D" - how would PDFBox "know" that it belongs 
to the very beginning?

> Word ordering / line detection failures in text extraction
> --
>
> Key: PDFBOX-4101
> URL: https://issues.apache.org/jira/browse/PDFBOX-4101
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.8
>Reporter: Alexandre
>Priority: Major
> Attachments: fails_line_detection.pdf
>
>
> Dear Apache contributors,
> I am a (y) user of pdfbox mainly for the purpose of text extraction. The word 
> ordering is not correct for some cases and the line detection may fail too.
> Attachments:
>  * 1st page: the first letter D is not written before "uis sit amet..." but 
> at the end of the page ;
>  * 2nd page: the sentence "scolaire ferry" is just before "réouverture du 
> musée" which is wrong because it's not on the same column ;
> To manage these cases would be more than welcome :D A.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org