[jira] [Commented] (TIKA-1671) Wrapped lines in PDF files not processed correctly

Tim Allison (JIRA) Fri, 17 Jul 2015 04:16:16 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631204#comment-14631204
 ]


Tim Allison commented on TIKA-1671:
-----------------------------------

And a few other points...

Encoding instructions within the PDFs can be botched so that the display is 
great and readable, but the extracted text is garbage...no matter how 
intelligent the extraction software.  Or, if the PDF was generated by a 
scanner, there can be noisy text that results from OCR.

And, of course, many PDFs are "image-only". 



> Wrapped lines in PDF files not processed correctly
> --------------------------------------------------
>
>                 Key: TIKA-1671
>                 URL: https://issues.apache.org/jira/browse/TIKA-1671
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.9
>            Reporter: James Baker
>              Labels: pdf, wrapping
>         Attachments: Test Document.pdf
>
>
> Text that wraps over multiple lines in PDF documents is not extracted 
> correctly by Tika. The expected behaviour would be for it to be extracted as 
> a single line, but instead a line break is inserted at each wrap point.
> This makes it hard, if not impossible, to reassemble text into it's intended 
> form, as it is not known whether a line break in the extracted text is one 
> that appeared in the document or one that was inserted by Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1671) Wrapped lines in PDF files not processed correctly

Reply via email to