[jira] [Commented] (SOLR-7137) Upgrade to Tika 1.7 in 4_10_3 branch

Uwe Schindler (JIRA) Sun, 22 Feb 2015 12:06:21 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332336#comment-14332336
 ]


Uwe Schindler commented on SOLR-7137:
-------------------------------------

Most of those patches is hashes of JAR files and some License changes. In fact, 
its indeed enough to update the ivy.properties file with all upgraded versions; 
it would just be the release process and validation tasks of Solr not pass.

As said before, it is also enough to download TIKA 1.7 and drop its JAR files 
into the contrib/extraction/lib folder of your Solr installation :-)

> Upgrade to Tika 1.7 in 4_10_3 branch
> ------------------------------------
>
>                 Key: SOLR-7137
>                 URL: https://issues.apache.org/jira/browse/SOLR-7137
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 4.10.3
>            Reporter: Chris A. Mattmann
>            Assignee: Uwe Schindler
>         Attachments: SOLR-7137.Mattmann.022115.patch.txt
>
>
> I have been trying out SolrCell as an alternative to ingesting around 40M 
> images using Tesseract/OCR and Tika. I noticed in 4.10.3 Tika is pinned to 
> 1.5. In 1.5 Tika and in SolrCell 4.10.3, only about 5600 images of a subset 
> of 50,000 are ingested when I run a series of 50k cURL commands to the 
> extract handler. I had a feeling it has something to do with the fact that 
> some of the characters extracted are oddball characters (4@#@#/ ^^^^) due to 
> Tesseract not always extracting the right text. But then I remembered 
> Tesseract didn't land in Tika until 1.7.
> So regardless, I thought I'd upgrade the 4.10.x branch to Tika 1.7. This is a 
> trivial patch to do so, attached (Tika + compress updates). Now all 50K 
> images on the 50K subset are ingested, but I'm noticing something else weird. 
> Despite the fact that Tesseract is called, and despite the fact that on 
> certain images I can verify text is extracted by running Tesseract from the 
> command line on that file, all I am getting in the "content" field of 
> SolrCell is a bunch of "\n \n \n \n \n \n" text. So the text is extracted, 
> there are weird characters, but they don't make it into Solr. Extremely odd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7137) Upgrade to Tika 1.7 in 4_10_3 branch

Reply via email to