[jira] [Commented] (SOLR-7139) ExtractingRequestHandler default solrconfig.xml ignores div tags which breaks TikaOCR

Uwe Schindler (JIRA) Thu, 26 Feb 2015 06:12:40 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338419#comment-14338419
 ]


Uwe Schindler commented on SOLR-7139:
-------------------------------------

Hi,
I analyzed the whole thing. Basically, the simplest fix is to remove the whole 
startDocument() method because it does not do anything useful. The whole setup 
ffor a new document is already done by the constructor.
The startDocument setup looks like the original code writer wanted to "reuse" 
instances. But in fact this is never done (I checked extraction and morphlines).
I will attach a patch that removes the startDocument() and adds documentation 
to javadocs that you can only process *one* document.

> ExtractingRequestHandler default solrconfig.xml ignores div tags which breaks 
> TikaOCR
> -------------------------------------------------------------------------------------
>
>                 Key: SOLR-7139
>                 URL: https://issues.apache.org/jira/browse/SOLR-7139
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 4.10.3
>            Reporter: Chris A. Mattmann
>            Assignee: Uwe Schindler
>            Priority: Critical
>             Fix For: 4.10.4, 5.1, 5.0.1
>
>         Attachments: SOLR-7139.Mattmann.022115.patch.txt
>
>
> While testing my large scale Tika/SolrCell indexing (great work on 
> /extraction guys, really really appreciate it) on my 40M image dataset, I was 
> pulling my frickin' hair out trying to figure out why the TesseractOCR 
> extracted content wasn't actually making it into the index. Well I figured it 
> out lol (many many System.out.printlns later) - it's the disabling of div 
> tags (=>ignored) in the default solrconfig.xml. This basically renders 
> TesseractOCR output in SolrCell useless since it is surrounded by a div tag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-7139) ExtractingRequestHandler default solrconfig.xml ignores div tags which breaks TikaOCR

Reply via email to