[ 
https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130514#comment-13130514
 ] 

Michael McCandless commented on TIKA-724:
-----------------------------------------

I dug into this one some more.

Handling space between words is tricky in PDF!  This is because a PDF
need not actually include space characters; instead it can (and does!)
simply place the glyphs at x/y positions with added whitespace between
them.  This easily happens for white-space based languages too.

Yet, sometimes PDFs do include space characters themselves (the attached
PDF is such an example).  Ideally we would be able to somehow detect
this (eg if the PDF is encoded differently internally something) but
I don't know how to do this / if it's even possible.

So for the time being I made a simple addition to PDFParser, adding an
option set/getEnableAutoSpace, defaulting to enabled (ie keeping the
behavior today).  So at least if an app hits PDFs like the one
attached here, or somehow they know their PDFs always include explicit
space characters, they can set this option.

                
> PDF text sometimes has extra space between letters
> --------------------------------------------------
>
>                 Key: TIKA-724
>                 URL: https://issues.apache.org/jira/browse/TIKA-724
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: TIKA-724.patch, extraSpaces.pdf
>
>
> I have a PDF with simple text "Here is some formatted text", but when
> I extract with Tika I get extra spaces inserted:
> {noformat}
> H e re  i s  so me  fo rma tte d  te x t
> {noformat}
> When I created the text in this PDF (I used the PDFpen tool on OS X),
> I set the style of the text to "loosen" (ie, increase space slightly
> between the letters), so I think Tika (PDFBox) is trying to "respect"
> that whitespace, but it'd be nice to turn this off (if it won't mess
> up other places where we DO want the whitespace).
> When I copy/paste the text is copied correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to