[ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151950#comment-13151950 ]
Ravish Bhagdev commented on TIKA-724: ------------------------------------- Is there a way to control this flag from Solr? Would have expected I could add something in solrconfig.xml to control this flag? As I typed this I realized this might not be the place, so is there a way to control this from command line in tika-app? > PDF text sometimes has extra space between letters > -------------------------------------------------- > > Key: TIKA-724 > URL: https://issues.apache.org/jira/browse/TIKA-724 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 1.0 > > Attachments: TIKA-724.patch, extraSpaces.pdf > > > I have a PDF with simple text "Here is some formatted text", but when > I extract with Tika I get extra spaces inserted: > {noformat} > H e re i s so me fo rma tte d te x t > {noformat} > When I created the text in this PDF (I used the PDFpen tool on OS X), > I set the style of the text to "loosen" (ie, increase space slightly > between the letters), so I think Tika (PDFBox) is trying to "respect" > that whitespace, but it'd be nice to turn this off (if it won't mess > up other places where we DO want the whitespace). > When I copy/paste the text is copied correctly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira