[ 
https://issues.apache.org/jira/browse/SOLR-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481356#comment-13481356
 ] 

Markus Klose commented on SOLR-3976:
------------------------------------

If it sounds like "help me to index an html file" I am sorry. I just tought 
that is a bug and should be posted here. Please close if necessary.


We creadted a workaround with a sub entity like:

<dataConfig>
        <dataSource type="BinFileDataSource" name="bin"/>
        <document>
                <entity name="f" processor="FileListEntityProcessor" 
recursive="true" rootEntity="false"
                        dataSource="null" baseDir="..." fileName=".*.html"
                        onError="skip" transformer="TemplateTransformer">
                        
                        <entity name="tika-test" 
processor="TikaEntityProcessor" url="${f.fileAbsolutePath}"
                                format="html" dataSource="bin" onError="skip" 
transformer="TemplateTransformer,RegexTransformer,DateFormatTransformer,HTMLStripTransformer">
                                
                                <field column="id" template="${f.file}"/>
                                
                                <field column="text" name="text1"/>
                                
                                <entity name="tika2" 
processor="TikaEntityProcessor" url="${f.fileAbsolutePath}"
                                        format="html" dataSource="bin" 
onError="skip" transformer="TemplateTransformer,HTMLStripTransformer">
                                        <field column="text" name="text2" 
stripHTML="false"/>
                                </entity>
                        </entity>
                </entity>
        </document>
</dataConfig>
                
> HTMLStripTransformer strips the "tika" field not the field to index -> cannot 
> have both (stripped and unstripped)
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3976
>                 URL: https://issues.apache.org/jira/browse/SOLR-3976
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>    Affects Versions: 3.6
>            Reporter: Markus Klose
>            Priority: Minor
>
> I run into the situation to index an html file using the dataimport handler 
> and got an unexpected output. I wanted to create one field with the original 
> content and one field with the same content but without html markup.
> If I enaple the HTMLStripTransformer at field text2 the other one (text1) is 
> striped as well
> example configuraion:
> <dataConfig>
>       <dataSource type="BinFileDataSource" name="bin"/>
>       <document>
>               <entity name="f" processor="FileListEntityProcessor" 
> recursive="true" rootEntity="false"
>                       dataSource="null" baseDir="...." fileName=".*.html"
>                       onError="skip" >
>                       
>                       <entity name="tika-test" 
> processor="TikaEntityProcessor" url="${f.fileAbsolutePath}"
>                               format="html" dataSource="bin" onError="skip" 
> transformer="HTMLStripTransformer,TemplateTransformer">
>                               
>                               <field column="id" template="${f.file}"/>
>                               
>                               <field column="text" name="text1" />
>                               <field column="text" name="text2" 
> stripHTML="true"/>
>                       </entity>
>               </entity>
>       </document>
> </dataConfig>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to