On 9/6/2013 7:09 AM, Andreas Owen wrote: > i've managed to get it working if i use the regexTransformer and string is on > the same line in my tika entity. but when the string is multilined it isn't > working even though i tried ?s to set the flag dotall. > > <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}" > dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" > transformer="RegexTransformer"> > <field column="text_html" regex="<body>(.+)</body>" > replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" /> > </entity> > > then i tried it like this and i get a stackoverflow > > <field column="text_html" regex="<body>((.|\n|\r)+)</body>" > replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" /> > > in javascript this works but maybe because i only used a small string.
Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find a solution for you? It sounds a little bit like you're interested in stripping all the HTML tags out. Perhaps the HTMLStripCharFilter? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Something that I already said: By using the KeywordTokenizer, you won't be able to search for individual words on your HTML input. The entire input string is treated as a single token, and therefore ONLY exact entire-field matches (or certain wildcard matches) will be possible. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory Note that no matter what you do to your data with the analysis chain, Solr will always return the text that was originally indexed in search results. If you need to affect what gets stored as well, perhaps you need an Update Processor. Thanks, Shawn