On 9/6/2013 7:09 AM, Andreas Owen wrote:
> i've managed to get it working if i use the regexTransformer and string is on 
> the same line in my tika entity. but when the string is multilined it isn't 
> working even though i tried ?s to set the flag dotall.
> 
> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}" 
> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" 
> transformer="RegexTransformer">
>       <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;" 
> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
> </entity>
>                       
> then i tried it like this and i get a stackoverflow
> 
> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;" 
> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
> 
> in javascript this works but maybe because i only used a small string.

Sounds like we've got an XY problem here.

http://people.apache.org/~hossman/#xyproblem

How about you tell us *exactly* what you'd actually like to have happen
and then we can find a solution for you?

It sounds a little bit like you're interested in stripping all the HTML
tags out.  Perhaps the HTMLStripCharFilter?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Something that I already said: By using the KeywordTokenizer, you won't
be able to search for individual words on your HTML input.  The entire
input string is treated as a single token, and therefore ONLY exact
entire-field matches (or certain wildcard matches) will be possible.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory

Note that no matter what you do to your data with the analysis chain,
Solr will always return the text that was originally indexed in search
results.  If you need to affect what gets stored as well, perhaps you
need an Update Processor.

Thanks,
Shawn

Reply via email to