[ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130514#comment-13130514 ]
Michael McCandless commented on TIKA-724: ----------------------------------------- I dug into this one some more. Handling space between words is tricky in PDF! This is because a PDF need not actually include space characters; instead it can (and does!) simply place the glyphs at x/y positions with added whitespace between them. This easily happens for white-space based languages too. Yet, sometimes PDFs do include space characters themselves (the attached PDF is such an example). Ideally we would be able to somehow detect this (eg if the PDF is encoded differently internally something) but I don't know how to do this / if it's even possible. So for the time being I made a simple addition to PDFParser, adding an option set/getEnableAutoSpace, defaulting to enabled (ie keeping the behavior today). So at least if an app hits PDFs like the one attached here, or somehow they know their PDFs always include explicit space characters, they can set this option. > PDF text sometimes has extra space between letters > -------------------------------------------------- > > Key: TIKA-724 > URL: https://issues.apache.org/jira/browse/TIKA-724 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Assignee: Michael McCandless > Attachments: TIKA-724.patch, extraSpaces.pdf > > > I have a PDF with simple text "Here is some formatted text", but when > I extract with Tika I get extra spaces inserted: > {noformat} > H e re i s so me fo rma tte d te x t > {noformat} > When I created the text in this PDF (I used the PDFpen tool on OS X), > I set the style of the text to "loosen" (ie, increase space slightly > between the letters), so I think Tika (PDFBox) is trying to "respect" > that whitespace, but it'd be nice to turn this off (if it won't mess > up other places where we DO want the whitespace). > When I copy/paste the text is copied correctly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira