[
https://issues.apache.org/jira/browse/PDFBOX-3680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857731#comment-15857731
]
Maruan Sahyoun commented on PDFBOX-3680:
----------------------------------------
Did you use the {{sort}} option when extracting the text?
> Extracted text in wrong order [header, footer, content]
> -------------------------------------------------------
>
> Key: PDFBOX-3680
> URL: https://issues.apache.org/jira/browse/PDFBOX-3680
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.1
> Reporter: Dominik Bauer
> Attachments: 1_to_3_Text.txt, DSG 2000, Fassung vom 27.01.2017.pdf
>
>
> When I extract the text from the attached pdf, the text is in the wrong
> order.
> Every page has a header, which is "Bundesrecht konsolidiert" some content and
> a footer, which is "www.ris.bka.gv.at Seite x von y". The content of the
> footer is a URL and the page number in German language.
> In my eyes the extracted text should have the same order, as we would look at
> it. The correct order would be header, content, footer.
> When I open the File in Adobe Reader an copy the text from the page, the text
> is also in the same order.
> The extracted text is:
> {quote}
> Bundesrecht konsolidiert
> www.ris.bka.gv.at Seite 1 von 35
> Gesamte Rechtsvorschrift [...] und Rechtsnachfolge
> {quote}
> When we look at the page; then the extracted text should be:
> {quote}
> Bundesrecht konsolidiert
> Gesamte Rechtsvorschrift [...] und Rechtsnachfolge
> www.ris.bka.gv.at Seite 1 von 35
> {quote}
> The pdf itself and the extracted text of the first three pages is attached to
> this Ticket.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]