Re: charfilter doesn't do anything

Andreas Owen Mon, 09 Sep 2013 16:08:47 -0700

i've downloaded curl and tried it in the comman prompt and power shell on my 
win 2008r2 server, thats why i used my dataimporter with a single line html 
file and copy/pastet the lines into schema.xml



On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:

> Did you in fact try my suggested example? If not, please do so.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Andreas Owen
> Sent: Monday, September 09, 2013 4:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> i index html pages with a lot of lines and not just a string with the 
> body-tag.
> it doesn't work with proper html files, even though i took all the new lines 
> out.
> 
> html-file:
> <html>nav-content<body> nur das will ich sehen</body>footer-content</html>
> 
> solr update debug output:
> "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" 
> content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" content=\"text/html; 
> charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das 
> will ich sehenfooter-content</body></html>"]
> 
> 
> 
> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
> 
>> I tried this and it seems to work when added to the standard Solr example in 
>> 4.4:
>> 
>> <field name="body" type="text_html_body" indexed="true" stored="true" />
>> 
>> <fieldType name="text_html_body" class="solr.TextField" 
>> positionIncrementGap="100" >
>> <analyzer>
>>  <charFilter class="solr.PatternReplaceCharFilterFactory" 
>> pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>>  <tokenizer class="solr.StandardTokenizerFactory"/>
>>  <filter class="solr.LowerCaseFilterFactory"/>
>> </analyzer>
>> </fieldType>
>> 
>> That char filter retains only text between <body> and </body>. Is that what 
>> you wanted?
>> 
>> Indexing this data:
>> 
>> curl 'localhost:8983/solr/update?commit=true' -H 
>> 'Content-type:application/json' -d '
>> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]'
>> 
>> And querying with these commands:
>> 
>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json";
>> Shows all data
>> 
>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json";
>> shows the body text
>> 
>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json";
>> shows nothing (outside of body)
>> 
>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json";
>> shows nothing (outside of body)
>> 
>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json";
>> Shows nothing, HTML tag stripped
>> 
>> In your original query, you didn't show us what your default field, df 
>> parameter, was.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Andreas Owen
>> Sent: Sunday, September 08, 2013 5:21 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> yes but that filter html and not the specific tag i want.
>> 
>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
>> 
>>> Hmmm, have you looked at:
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>> 
>>> Not quite the <body>, perhaps, but might it help?
>>> 
>>> 
>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <a...@conx.ch> wrote:
>>> 
>>>> ok i have html pages with <html>.....<!--body-->content i
>>>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>>>> that between the body-comments. i thought regexTransformer would be the
>>>> best because xpath doesn't work in tika and i cant nest a
>>>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>>>> htmlparser from tika cuts my body-comments out and tries to make well
>>>> formed html, which i would like to switch off.
>>>> 
>>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>>> 
>>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>>> i've managed to get it working if i use the regexTransformer and string
>>>> is on the same line in my tika entity. but when the string is multilined it
>>>> isn't working even though i tried ?s to set the flag dotall.
>>>>>> 
>>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>>>> transformer="RegexTransformer">
>>>>>>   <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>> </entity>
>>>>>> 
>>>>>> then i tried it like this and i get a stackoverflow
>>>>>> 
>>>>>> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>> 
>>>>>> in javascript this works but maybe because i only used a small string.
>>>>> 
>>>>> Sounds like we've got an XY problem here.
>>>>> 
>>>>> http://people.apache.org/~hossman/#xyproblem
>>>>> 
>>>>> How about you tell us *exactly* what you'd actually like to have happen
>>>>> and then we can find a solution for you?
>>>>> 
>>>>> It sounds a little bit like you're interested in stripping all the HTML
>>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>>> 
>>>>> 
>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>> 
>>>>> Something that I already said: By using the KeywordTokenizer, you won't
>>>>> be able to search for individual words on your HTML input.  The entire
>>>>> input string is treated as a single token, and therefore ONLY exact
>>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>>> 
>>>>> 
>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>>> 
>>>>> Note that no matter what you do to your data with the analysis chain,
>>>>> Solr will always return the text that was originally indexed in search
>>>>> results.  If you need to affect what gets stored as well, perhaps you
>>>>> need an Update Processor.
>>>>> 
>>>>> Thanks,
>>>>> Shawn
>>>>

Re: charfilter doesn't do anything

Reply via email to