PDF text sometimes has extra space between letters
--------------------------------------------------

                 Key: TIKA-724
                 URL: https://issues.apache.org/jira/browse/TIKA-724
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Michael McCandless
         Attachments: extraSpaces.pdf

I have a PDF with simple text "Here is some formatted text", but when
I extract with Tika I get extra spaces inserted:

{noformat}
H e re  i s  so me  fo rma tte d  te x t
{noformat}

When I created the text in this PDF (I used the PDFpen tool on OS X),
I set the style of the text to "loosen" (ie, increase space slightly
between the letters), so I think Tika (PDFBox) is trying to "respect"
that whitespace, but it'd be nice to turn this off (if it won't mess
up other places where we DO want the whitespace).

When I copy/paste the text is copied correctly.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to