yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
> Hmmm, have you looked at: > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory > > Not quite the <body>, perhaps, but might it help? > > > On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <a...@conx.ch> wrote: > >> ok i have html pages with <html>.....<!--body-->content i >> want....<!--/body-->.....</html>. i want to extract (index, store) only >> that between the body-comments. i thought regexTransformer would be the >> best because xpath doesn't work in tika and i cant nest a >> xpathEntetyProcessor to use xpath. what i have also found out is that the >> htmlparser from tika cuts my body-comments out and tries to make well >> formed html, which i would like to switch off. >> >> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: >> >>> On 9/6/2013 7:09 AM, Andreas Owen wrote: >>>> i've managed to get it working if i use the regexTransformer and string >> is on the same line in my tika entity. but when the string is multilined it >> isn't working even though i tried ?s to set the flag dotall. >>>> >>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}" >> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >> transformer="RegexTransformer"> >>>> <field column="text_html" regex="<body>(.+)</body>" >> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" /> >>>> </entity> >>>> >>>> then i tried it like this and i get a stackoverflow >>>> >>>> <field column="text_html" regex="<body>((.|\n|\r)+)</body>" >> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" /> >>>> >>>> in javascript this works but maybe because i only used a small string. >>> >>> Sounds like we've got an XY problem here. >>> >>> http://people.apache.org/~hossman/#xyproblem >>> >>> How about you tell us *exactly* what you'd actually like to have happen >>> and then we can find a solution for you? >>> >>> It sounds a little bit like you're interested in stripping all the HTML >>> tags out. Perhaps the HTMLStripCharFilter? >>> >>> >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory >>> >>> Something that I already said: By using the KeywordTokenizer, you won't >>> be able to search for individual words on your HTML input. The entire >>> input string is treated as a single token, and therefore ONLY exact >>> entire-field matches (or certain wildcard matches) will be possible. >>> >>> >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory >>> >>> Note that no matter what you do to your data with the analysis chain, >>> Solr will always return the text that was originally indexed in search >>> results. If you need to affect what gets stored as well, perhaps you >>> need an Update Processor. >>> >>> Thanks, >>> Shawn >> >>