[
https://issues.apache.org/jira/browse/PDFBOX-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144804#comment-16144804
]
Tilman Hausherr commented on PDFBOX-3912:
-----------------------------------------
No, the text extraction is done by page. If you can't attach your document,
could you try this:
- split it with PDFSplit (parameters similar)
- do ExtractText on the result document and have a look whether the other page
is there
- do copy & paste from Adobe Reader on the result document and have a look
whether the other page is there
My suspicion is that your document may have invisible text...
> Command line : ExtractText, Duplicated text
> -------------------------------------------
>
> Key: PDFBOX-3912
> URL: https://issues.apache.org/jira/browse/PDFBOX-3912
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.7
> Reporter: Hasan Karaoğlu
> Attachments: bugzilla867751.html
>
>
> When I convert some pages of a pdf file to html, it gives me duplicated
> pages.For example, I convert seventh page of a pdf file. It is converted. But
> it also contains sixth page's content.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]