Alexandre Rafalovitch created SOLR-4530:
-------------------------------------------
Summary: DIH: Provide configuration to use Tika's
IdentityHtmlMapper
Key: SOLR-4530
URL: https://issues.apache.org/jira/browse/SOLR-4530
Project: Solr
Issue Type: Improvement
Components: contrib - DataImportHandler
Affects Versions: 4.1
Reporter: Alexandre Rafalovitch
Priority: Minor
Fix For: 4.2
When using TikaEntityProcessor in DIH, the default HTML Mapper strips out most
of the HTML. It may make sense when the expectation is just to store the
extracted content as a text blob, but DIH allows more fine-tuned content
extraction (e.g. with nested XPathEntityProcessor).
Recent Tika versions allow to set an alternative HTML Mapper implementation
that passes all the HTML in. It would be useful to be able to set that
implementation from DIH configuration.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]