[ https://issues.apache.org/jira/browse/SOLR-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332332#comment-14332332 ]
Uwe Schindler commented on SOLR-7137: ------------------------------------- See also SOLR-6488 (must be applied first). > Upgrade to Tika 1.7 in 4_10_3 branch > ------------------------------------ > > Key: SOLR-7137 > URL: https://issues.apache.org/jira/browse/SOLR-7137 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) > Affects Versions: 4.10.3 > Reporter: Chris A. Mattmann > Assignee: Uwe Schindler > Attachments: SOLR-7137.Mattmann.022115.patch.txt > > > I have been trying out SolrCell as an alternative to ingesting around 40M > images using Tesseract/OCR and Tika. I noticed in 4.10.3 Tika is pinned to > 1.5. In 1.5 Tika and in SolrCell 4.10.3, only about 5600 images of a subset > of 50,000 are ingested when I run a series of 50k cURL commands to the > extract handler. I had a feeling it has something to do with the fact that > some of the characters extracted are oddball characters (4@#@#/ ^^^^) due to > Tesseract not always extracting the right text. But then I remembered > Tesseract didn't land in Tika until 1.7. > So regardless, I thought I'd upgrade the 4.10.x branch to Tika 1.7. This is a > trivial patch to do so, attached (Tika + compress updates). Now all 50K > images on the 50K subset are ingested, but I'm noticing something else weird. > Despite the fact that Tesseract is called, and despite the fact that on > certain images I can verify text is extracted by running Tesseract from the > command line on that file, all I am getting in the "content" field of > SolrCell is a bunch of "\n \n \n \n \n \n" text. So the text is extracted, > there are weird characters, but they don't make it into Solr. Extremely odd. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org