Alexandre Rafalovitch created SOLR-4530:
-------------------------------------------

             Summary: DIH: Provide configuration to use Tika's 
IdentityHtmlMapper
                 Key: SOLR-4530
                 URL: https://issues.apache.org/jira/browse/SOLR-4530
             Project: Solr
          Issue Type: Improvement
          Components: contrib - DataImportHandler
    Affects Versions: 4.1
            Reporter: Alexandre Rafalovitch
            Priority: Minor
             Fix For: 4.2


When using TikaEntityProcessor in DIH, the default HTML Mapper strips out most 
of the HTML. It may make sense when the expectation is just to store the 
extracted content as a text blob, but DIH allows more fine-tuned content 
extraction (e.g. with nested XPathEntityProcessor).

Recent Tika versions allow to set an alternative HTML Mapper implementation 
that passes all the HTML in. It would be useful to be able to set that 
implementation from DIH configuration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to