[jira] Resolved: (TIKA-584) Tika parse of some PDF files removes all spaces between words

Jukka Zitting (JIRA) Wed, 19 Jan 2011 04:59:14 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting resolved TIKA-584.
--------------------------------

    Resolution: Duplicate
      Assignee: Jukka Zitting

Like TIKA-583, this is a duplicate of TIKA-548, fixed in trunk.

> Tika parse of some PDF files removes all spaces between words
> -------------------------------------------------------------
>
>                 Key: TIKA-584
>                 URL: https://issues.apache.org/jira/browse/TIKA-584
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.8
>         Environment: Windows XP 3, OpenSuse 11.2
>            Reporter: Ajay Vohra
>            Assignee: Jukka Zitting
>         Attachments: JavaEE6Tutorial.pdf
>
>
> In the case of some pdf files (not all), when Tika.parse(InputStream) method 
> is used, the content extracted from the returned reader has all spaces 
> removed. This only happens for some PDF files: An example where this happens 
> is: JavaEE6Tutorial.pdf (available from Oracle). There are many such files 
> where this bug can be seen. I have even tried Tika snapshot 0.9 and the bug 
> remains.
> When PDFTextStripper is directly used, the extracted content is correct, with 
> the spaces between words retained.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (TIKA-584) Tika parse of some PDF files removes all spaces between words

Reply via email to