[jira] [Commented] (PDFBOX-3912) Command line : ExtractText, Duplicated text

Tilman Hausherr (JIRA) Mon, 28 Aug 2017 23:18:57 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144804#comment-16144804
 ]


Tilman Hausherr commented on PDFBOX-3912:
-----------------------------------------

No, the text extraction is done by page. If you can't attach your document, 
could you try this:
- split it with PDFSplit (parameters similar)
- do ExtractText on the result document and have a look whether the other page 
is there
- do copy & paste from Adobe Reader on the result document and have a look 
whether the other page is there

My suspicion is that your document may have invisible text...

> Command line : ExtractText, Duplicated text
> -------------------------------------------
>
>                 Key: PDFBOX-3912
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3912
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7
>            Reporter: Hasan Karaoğlu
>         Attachments: bugzilla867751.html
>
>
> When I convert some pages of a pdf file to html, it gives me duplicated 
> pages.For example, I convert seventh page of a pdf file. It is converted. But 
> it also contains sixth page's content. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3912) Command line : ExtractText, Duplicated text

Reply via email to