[
https://issues.apache.org/jira/browse/SOLR-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160036#comment-13160036
]
Robert Muir commented on SOLR-2930:
-----------------------------------
my bad, i confused this bug with the pdfbox 'character deletion'
one (TIKA-767), thats still unfortunately not in tika 1.0 it seems.
> Allow controlling an important PDF processing parameter in Tika that splits
> the words in text and is now suppored in version 1.0 of Tika.
> -----------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-2930
> URL: https://issues.apache.org/jira/browse/SOLR-2930
> Project: Solr
> Issue Type: Improvement
> Components: contrib - Solr Cell (Tika extraction)
> Affects Versions: 3.5
> Reporter: Ravish Bhagdev
> Labels: pdf, text-splitting, tika,
>
> Tika 1.0 has fixed a major issue with processing and parsing of PDF files
> that was splitting the words incorrectly:
> https://issues.apache.org/jira/browse/TIKA-724
> This causes text to be indexed incorrectly in solr and it becomes specially
> visible when using spellcheck features etc.
> They have added a special parameter set using setEnableAutoSpace that fixes
> the problem but there is currently no way of setting this when using Solr.
> As discussed in thread on above issue, it would be nice if we could control
> this (and in future other) parameter via Solr configuration.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]