[jira] [Commented] (PDFBOX-3680) Extracted text in wrong order [header, footer, content]

Maruan Sahyoun (JIRA) Wed, 08 Feb 2017 01:44:07 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857731#comment-15857731
 ]


Maruan Sahyoun commented on PDFBOX-3680:
----------------------------------------

Did you use the {{sort}} option when extracting the text?

> Extracted text in wrong order [header, footer, content]
> -------------------------------------------------------
>
>                 Key: PDFBOX-3680
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3680
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.1
>            Reporter: Dominik Bauer
>         Attachments: 1_to_3_Text.txt, DSG 2000, Fassung vom 27.01.2017.pdf
>
>
> When I extract the text from the attached pdf, the text is in the wrong 
> order. 
> Every page has a header, which is "Bundesrecht konsolidiert" some content and 
> a footer, which is "www.ris.bka.gv.at Seite x von y". The content of the 
> footer is a URL and the page number in German language.
> In my eyes the extracted text should have the same order, as we would look at 
> it. The correct order would be header, content, footer. 
> When I open the File in Adobe Reader an copy the text from the page, the text 
> is also in the same order.
> The extracted text is:
> {quote}
>  Bundesrecht konsolidiert 
> www.ris.bka.gv.at Seite 1 von 35 
> Gesamte Rechtsvorschrift [...] und Rechtsnachfolge
> {quote}
> When we look at the page; then the extracted text should be:
> {quote}
>  Bundesrecht konsolidiert 
> Gesamte Rechtsvorschrift [...] und Rechtsnachfolge
> www.ris.bka.gv.at Seite 1 von 35 
> {quote}
> The pdf itself and the extracted text of the first three pages is attached to 
> this Ticket.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3680) Extracted text in wrong order [header, footer, content]

Reply via email to