Re: charfilter doesn't do anything

Andreas Owen Wed, 11 Sep 2013 01:20:31 -0700

perfect, i tried it before but always at the tail of the expression with no 
effect. thanks a lot. a last question, do you know how to keep the html 
comments from being filtered before the transformer has done its work?



On 10. Sep 2013, at 3:17 PM, Jack Krupansky wrote:

> Okay, I can repro the problem. Yes, in appears that the pattern replace char 
> filter does not default to multiline mode for pattern matching, so <body> on 
> one line and </body> on another line cannot be matched.
> 
> Now, whether that is by design or a bug or an option for enhancement is a 
> matter for some committer to comment on.
> 
> But, the good news is that you can in fact set multiline mode in your pattern 
> my starting it with "(?s)", which means that dot accepts line break 
> characters as well.
> 
> So, here are my revised field types:
> 
> <fieldType name="text_html_body" class="solr.TextField" 
> positionIncrementGap="100" >
> <analyzer>
>   <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="(?s)^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>   <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
> 
> <fieldType name="text_html_body_strip" class="solr.TextField" 
> positionIncrementGap="100" >
> <analyzer>
>   <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="(?s)^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>   <charFilter class="solr.HTMLStripCharFilterFactory" />
>   <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
> 
> The first type accepts everything within <body>, including nested HTML 
> formatting, while the latter strips nested HTML formatting as well.
> 
> The tokenizer will in fact strip out white space, but that happens after all 
> character filters have completed.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Andreas Owen
> Sent: Tuesday, September 10, 2013 7:07 AM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> ok i am getting there now but if there are newlines involved the regex stops 
> as soon as it reaches a "\r\n" even if i try [\t\r\n.]* in the regex. I have 
> to get rid of the newlines. why isn't whitespaceTokenizerFactory the right 
> element for this?
> 
> 
> On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote:
> 
>> Use XML then. Although you will need to escape the XML special characters as 
>> I did in the pattern.
>> 
>> The point is simply: Quickly and simply try to find the simple test scenario 
>> that illustrates the problem.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Andreas Owen
>> Sent: Monday, September 09, 2013 7:05 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> i tried but that isn't working either, it want a data-stream, i'll have to 
>> check how to post json instead of xml
>> 
>> On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:
>> 
>>> Did you at least try the pattern I gave you?
>>> 
>>> The point of the curl was the data, not how you send the data. You can just 
>>> use the standard Solr simple post tool.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -----Original Message----- From: Andreas Owen
>>> Sent: Monday, September 09, 2013 6:40 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: charfilter doesn't do anything
>>> 
>>> i've downloaded curl and tried it in the comman prompt and power shell on 
>>> my win 2008r2 server, thats why i used my dataimporter with a single line 
>>> html file and copy/pastet the lines into schema.xml
>>> 
>>> 
>>> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
>>> 
>>>> Did you in fact try my suggested example? If not, please do so.
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> -----Original Message----- From: Andreas Owen
>>>> Sent: Monday, September 09, 2013 4:42 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: charfilter doesn't do anything
>>>> 
>>>> i index html pages with a lot of lines and not just a string with the 
>>>> body-tag.
>>>> it doesn't work with proper html files, even though i took all the new 
>>>> lines out.
>>>> 
>>>> html-file:
>>>> <html>nav-content<body> nur das will ich sehen</body>footer-content</html>
>>>> 
>>>> solr update debug output:
>>>> "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" 
>>>> content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" 
>>>> content=\"text/html; 
>>>> charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das 
>>>> will ich sehenfooter-content</body></html>"]
>>>> 
>>>> 
>>>> 
>>>> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
>>>> 
>>>>> I tried this and it seems to work when added to the standard Solr example 
>>>>> in 4.4:
>>>>> 
>>>>> <field name="body" type="text_html_body" indexed="true" stored="true" />
>>>>> 
>>>>> <fieldType name="text_html_body" class="solr.TextField" 
>>>>> positionIncrementGap="100" >
>>>>> <analyzer>
>>>>> <charFilter class="solr.PatternReplaceCharFilterFactory" 
>>>>> pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>>>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>>> </analyzer>
>>>>> </fieldType>
>>>>> 
>>>>> That char filter retains only text between <body> and </body>. Is that 
>>>>> what you wanted?
>>>>> 
>>>>> Indexing this data:
>>>>> 
>>>>> curl 'localhost:8983/solr/update?commit=true' -H 
>>>>> 'Content-type:application/json' -d '
>>>>> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]'
>>>>> 
>>>>> And querying with these commands:
>>>>> 
>>>>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json";
>>>>> Shows all data
>>>>> 
>>>>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json";
>>>>> shows the body text
>>>>> 
>>>>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json";
>>>>> shows nothing (outside of body)
>>>>> 
>>>>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json";
>>>>> shows nothing (outside of body)
>>>>> 
>>>>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json";
>>>>> Shows nothing, HTML tag stripped
>>>>> 
>>>>> In your original query, you didn't show us what your default field, df 
>>>>> parameter, was.
>>>>> 
>>>>> -- Jack Krupansky
>>>>> 
>>>>> -----Original Message----- From: Andreas Owen
>>>>> Sent: Sunday, September 08, 2013 5:21 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: charfilter doesn't do anything
>>>>> 
>>>>> yes but that filter html and not the specific tag i want.
>>>>> 
>>>>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
>>>>> 
>>>>>> Hmmm, have you looked at:
>>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>>> 
>>>>>> Not quite the <body>, perhaps, but might it help?
>>>>>> 
>>>>>> 
>>>>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <a...@conx.ch> wrote:
>>>>>> 
>>>>>>> ok i have html pages with <html>.....<!--body-->content i
>>>>>>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>>>>>>> that between the body-comments. i thought regexTransformer would be the
>>>>>>> best because xpath doesn't work in tika and i cant nest a
>>>>>>> xpathEntetyProcessor to use xpath. what i have also found out is that 
>>>>>>> the
>>>>>>> htmlparser from tika cuts my body-comments out and tries to make well
>>>>>>> formed html, which i would like to switch off.
>>>>>>> 
>>>>>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>>>>>> 
>>>>>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>>>>>> i've managed to get it working if i use the regexTransformer and 
>>>>>>>>> string
>>>>>>> is on the same line in my tika entity. but when the string is 
>>>>>>> multilined it
>>>>>>> isn't working even though i tried ?s to set the flag dotall.
>>>>>>>>> 
>>>>>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>>>>>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>>>>>>> transformer="RegexTransformer">
>>>>>>>>> <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>>>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>>>> </entity>
>>>>>>>>> 
>>>>>>>>> then i tried it like this and i get a stackoverflow
>>>>>>>>> 
>>>>>>>>> <field column="text_html" 
>>>>>>>>> regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>>>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>>>> 
>>>>>>>>> in javascript this works but maybe because i only used a small string.
>>>>>>>> 
>>>>>>>> Sounds like we've got an XY problem here.
>>>>>>>> 
>>>>>>>> http://people.apache.org/~hossman/#xyproblem
>>>>>>>> 
>>>>>>>> How about you tell us *exactly* what you'd actually like to have happen
>>>>>>>> and then we can find a solution for you?
>>>>>>>> 
>>>>>>>> It sounds a little bit like you're interested in stripping all the HTML
>>>>>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>>>>>> 
>>>>>>>> 
>>>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>>>>> 
>>>>>>>> Something that I already said: By using the KeywordTokenizer, you won't
>>>>>>>> be able to search for individual words on your HTML input.  The entire
>>>>>>>> input string is treated as a single token, and therefore ONLY exact
>>>>>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>>>>>> 
>>>>>>>> 
>>>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>>>>>> 
>>>>>>>> Note that no matter what you do to your data with the analysis chain,
>>>>>>>> Solr will always return the text that was originally indexed in search
>>>>>>>> results.  If you need to affect what gets stored as well, perhaps you
>>>>>>>> need an Update Processor.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Shawn

Re: charfilter doesn't do anything

Reply via email to