Markus Klose created SOLR-3976:
----------------------------------

             Summary: HTMLStripTransformer strips the "tika" field not the 
field to index -> cannot have both (stripped and unstripped)
                 Key: SOLR-3976
                 URL: https://issues.apache.org/jira/browse/SOLR-3976
             Project: Solr
          Issue Type: Bug
          Components: contrib - DataImportHandler
    Affects Versions: 3.6
            Reporter: Markus Klose
            Priority: Minor


I run into the situation to index an html file using the dataimport handler and 
got an unexpected output. I wanted to create one field with the original 
content and one field with the same content but without html markup.

If I enaple the HTMLStripTransformer at field text2 the other one (text1) is 
striped as well


example configuraion:

<dataConfig>
        <dataSource type="BinFileDataSource" name="bin"/>
        <document>
                <entity name="f" processor="FileListEntityProcessor" 
recursive="true" rootEntity="false"
                        dataSource="null" baseDir="...." fileName=".*.html"
                        onError="skip" >
                        
                        <entity name="tika-test" 
processor="TikaEntityProcessor" url="${f.fileAbsolutePath}"
                                format="html" dataSource="bin" onError="skip" 
transformer="HTMLStripTransformer,TemplateTransformer">
                                
                                <field column="id" template="${f.file}"/>
                                
                                <field column="text" name="text1" />
                                <field column="text" name="text2" 
stripHTML="true"/>
                        </entity>
                </entity>
        </document>
</dataConfig>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to