yes but that filter html and not the specific tag i want.

On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:

> Hmmm, have you looked at:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
> 
> Not quite the <body>, perhaps, but might it help?
> 
> 
> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <a...@conx.ch> wrote:
> 
>> ok i have html pages with <html>.....<!--body-->content i
>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>> that between the body-comments. i thought regexTransformer would be the
>> best because xpath doesn't work in tika and i cant nest a
>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>> htmlparser from tika cuts my body-comments out and tries to make well
>> formed html, which i would like to switch off.
>> 
>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>> 
>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>> i've managed to get it working if i use the regexTransformer and string
>> is on the same line in my tika entity. but when the string is multilined it
>> isn't working even though i tried ?s to set the flag dotall.
>>>> 
>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>> transformer="RegexTransformer">
>>>>     <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>> </entity>
>>>> 
>>>> then i tried it like this and i get a stackoverflow
>>>> 
>>>> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>> 
>>>> in javascript this works but maybe because i only used a small string.
>>> 
>>> Sounds like we've got an XY problem here.
>>> 
>>> http://people.apache.org/~hossman/#xyproblem
>>> 
>>> How about you tell us *exactly* what you'd actually like to have happen
>>> and then we can find a solution for you?
>>> 
>>> It sounds a little bit like you're interested in stripping all the HTML
>>> tags out.  Perhaps the HTMLStripCharFilter?
>>> 
>>> 
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>> 
>>> Something that I already said: By using the KeywordTokenizer, you won't
>>> be able to search for individual words on your HTML input.  The entire
>>> input string is treated as a single token, and therefore ONLY exact
>>> entire-field matches (or certain wildcard matches) will be possible.
>>> 
>>> 
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>> 
>>> Note that no matter what you do to your data with the analysis chain,
>>> Solr will always return the text that was originally indexed in search
>>> results.  If you need to affect what gets stored as well, perhaps you
>>> need an Update Processor.
>>> 
>>> Thanks,
>>> Shawn
>> 
>> 

Reply via email to