Re: charfilter doesn't do anything

Andreas Owen Tue, 10 Sep 2013 04:09:00 -0700

ok i am getting there now but if there are newlines involved the regex stops as 
soon as it reaches a "\r\n" even if i try [\t\r\n.]* in the regex. I have to 
get rid of the newlines. why isn't whitespaceTokenizerFactory the right element 
for this?



On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote:

> Use XML then. Although you will need to escape the XML special characters as 
> I did in the pattern.
> 
> The point is simply: Quickly and simply try to find the simple test scenario 
> that illustrates the problem.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Andreas Owen
> Sent: Monday, September 09, 2013 7:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> i tried but that isn't working either, it want a data-stream, i'll have to 
> check how to post json instead of xml
> 
> On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:
> 
>> Did you at least try the pattern I gave you?
>> 
>> The point of the curl was the data, not how you send the data. You can just 
>> use the standard Solr simple post tool.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Andreas Owen
>> Sent: Monday, September 09, 2013 6:40 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> i've downloaded curl and tried it in the comman prompt and power shell on my 
>> win 2008r2 server, thats why i used my dataimporter with a single line html 
>> file and copy/pastet the lines into schema.xml
>> 
>> 
>> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
>> 
>>> Did you in fact try my suggested example? If not, please do so.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -----Original Message----- From: Andreas Owen
>>> Sent: Monday, September 09, 2013 4:42 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: charfilter doesn't do anything
>>> 
>>> i index html pages with a lot of lines and not just a string with the 
>>> body-tag.
>>> it doesn't work with proper html files, even though i took all the new 
>>> lines out.
>>> 
>>> html-file:
>>> <html>nav-content<body> nur das will ich sehen</body>footer-content</html>
>>> 
>>> solr update debug output:
>>> "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" 
>>> content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" content=\"text/html; 
>>> charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das 
>>> will ich sehenfooter-content</body></html>"]
>>> 
>>> 
>>> 
>>> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
>>> 
>>>> I tried this and it seems to work when added to the standard Solr example 
>>>> in 4.4:
>>>> 
>>>> <field name="body" type="text_html_body" indexed="true" stored="true" />
>>>> 
>>>> <fieldType name="text_html_body" class="solr.TextField" 
>>>> positionIncrementGap="100" >
>>>> <analyzer>
>>>> <charFilter class="solr.PatternReplaceCharFilterFactory" 
>>>> pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> </analyzer>
>>>> </fieldType>
>>>> 
>>>> That char filter retains only text between <body> and </body>. Is that 
>>>> what you wanted?
>>>> 
>>>> Indexing this data:
>>>> 
>>>> curl 'localhost:8983/solr/update?commit=true' -H 
>>>> 'Content-type:application/json' -d '
>>>> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]'
>>>> 
>>>> And querying with these commands:
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json";
>>>> Shows all data
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json";
>>>> shows the body text
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json";
>>>> shows nothing (outside of body)
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json";
>>>> shows nothing (outside of body)
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json";
>>>> Shows nothing, HTML tag stripped
>>>> 
>>>> In your original query, you didn't show us what your default field, df 
>>>> parameter, was.
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> -----Original Message----- From: Andreas Owen
>>>> Sent: Sunday, September 08, 2013 5:21 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: charfilter doesn't do anything
>>>> 
>>>> yes but that filter html and not the specific tag i want.
>>>> 
>>>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
>>>> 
>>>>> Hmmm, have you looked at:
>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>> 
>>>>> Not quite the <body>, perhaps, but might it help?
>>>>> 
>>>>> 
>>>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <a...@conx.ch> wrote:
>>>>> 
>>>>>> ok i have html pages with <html>.....<!--body-->content i
>>>>>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>>>>>> that between the body-comments. i thought regexTransformer would be the
>>>>>> best because xpath doesn't work in tika and i cant nest a
>>>>>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>>>>>> htmlparser from tika cuts my body-comments out and tries to make well
>>>>>> formed html, which i would like to switch off.
>>>>>> 
>>>>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>>>>> 
>>>>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>>>>> i've managed to get it working if i use the regexTransformer and string
>>>>>> is on the same line in my tika entity. but when the string is multilined 
>>>>>> it
>>>>>> isn't working even though i tried ?s to set the flag dotall.
>>>>>>>> 
>>>>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>>>>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>>>>>> transformer="RegexTransformer">
>>>>>>>> <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>>> </entity>
>>>>>>>> 
>>>>>>>> then i tried it like this and i get a stackoverflow
>>>>>>>> 
>>>>>>>> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>>> 
>>>>>>>> in javascript this works but maybe because i only used a small string.
>>>>>>> 
>>>>>>> Sounds like we've got an XY problem here.
>>>>>>> 
>>>>>>> http://people.apache.org/~hossman/#xyproblem
>>>>>>> 
>>>>>>> How about you tell us *exactly* what you'd actually like to have happen
>>>>>>> and then we can find a solution for you?
>>>>>>> 
>>>>>>> It sounds a little bit like you're interested in stripping all the HTML
>>>>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>>>>> 
>>>>>>> 
>>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>>>> 
>>>>>>> Something that I already said: By using the KeywordTokenizer, you won't
>>>>>>> be able to search for individual words on your HTML input.  The entire
>>>>>>> input string is treated as a single token, and therefore ONLY exact
>>>>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>>>>> 
>>>>>>> 
>>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>>>>> 
>>>>>>> Note that no matter what you do to your data with the analysis chain,
>>>>>>> Solr will always return the text that was originally indexed in search
>>>>>>> results.  If you need to affect what gets stored as well, perhaps you
>>>>>>> need an Update Processor.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Shawn

Re: charfilter doesn't do anything

Reply via email to