Re: charfilter doesn't do anything

Andreas Owen Mon, 09 Sep 2013 16:07:02 -0700

i tried but that isn't working either, it want a data-stream, i'll have to 
check how to post json instead of xml


On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:

> Did you at least try the pattern I gave you?
> 
> The point of the curl was the data, not how you send the data. You can just 
> use the standard Solr simple post tool.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Andreas Owen
> Sent: Monday, September 09, 2013 6:40 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> i've downloaded curl and tried it in the comman prompt and power shell on my 
> win 2008r2 server, thats why i used my dataimporter with a single line html 
> file and copy/pastet the lines into schema.xml
> 
> 
> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
> 
>> Did you in fact try my suggested example? If not, please do so.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Andreas Owen
>> Sent: Monday, September 09, 2013 4:42 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> i index html pages with a lot of lines and not just a string with the 
>> body-tag.
>> it doesn't work with proper html files, even though i took all the new lines 
>> out.
>> 
>> html-file:
>> <html>nav-content<body> nur das will ich sehen</body>footer-content</html>
>> 
>> solr update debug output:
>> "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" 
>> content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" content=\"text/html; 
>> charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das 
>> will ich sehenfooter-content</body></html>"]
>> 
>> 
>> 
>> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
>> 
>>> I tried this and it seems to work when added to the standard Solr example 
>>> in 4.4:
>>> 
>>> <field name="body" type="text_html_body" indexed="true" stored="true" />
>>> 
>>> <fieldType name="text_html_body" class="solr.TextField" 
>>> positionIncrementGap="100" >
>>> <analyzer>
>>> <charFilter class="solr.PatternReplaceCharFilterFactory" 
>>> pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
>>> <tokenizer class="solr.StandardTokenizerFactory"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> </analyzer>
>>> </fieldType>
>>> 
>>> That char filter retains only text between <body> and </body>. Is that what 
>>> you wanted?
>>> 
>>> Indexing this data:
>>> 
>>> curl 'localhost:8983/solr/update?commit=true' -H 
>>> 'Content-type:application/json' -d '
>>> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]'
>>> 
>>> And querying with these commands:
>>> 
>>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json";
>>> Shows all data
>>> 
>>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json";
>>> shows the body text
>>> 
>>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json";
>>> shows nothing (outside of body)
>>> 
>>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json";
>>> shows nothing (outside of body)
>>> 
>>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json";
>>> Shows nothing, HTML tag stripped
>>> 
>>> In your original query, you didn't show us what your default field, df 
>>> parameter, was.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -----Original Message----- From: Andreas Owen
>>> Sent: Sunday, September 08, 2013 5:21 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: charfilter doesn't do anything
>>> 
>>> yes but that filter html and not the specific tag i want.
>>> 
>>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
>>> 
>>>> Hmmm, have you looked at:
>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>> 
>>>> Not quite the <body>, perhaps, but might it help?
>>>> 
>>>> 
>>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <a...@conx.ch> wrote:
>>>> 
>>>>> ok i have html pages with <html>.....<!--body-->content i
>>>>> want....<!--/body-->.....</html>. i want to extract (index, store) only
>>>>> that between the body-comments. i thought regexTransformer would be the
>>>>> best because xpath doesn't work in tika and i cant nest a
>>>>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>>>>> htmlparser from tika cuts my body-comments out and tries to make well
>>>>> formed html, which i would like to switch off.
>>>>> 
>>>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>>>> 
>>>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>>>> i've managed to get it working if i use the regexTransformer and string
>>>>> is on the same line in my tika entity. but when the string is multilined 
>>>>> it
>>>>> isn't working even though i tried ?s to set the flag dotall.
>>>>>>> 
>>>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
>>>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>>>>> transformer="RegexTransformer">
>>>>>>>  <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>> </entity>
>>>>>>> 
>>>>>>> then i tried it like this and i get a stackoverflow
>>>>>>> 
>>>>>>> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>>>>>>> 
>>>>>>> in javascript this works but maybe because i only used a small string.
>>>>>> 
>>>>>> Sounds like we've got an XY problem here.
>>>>>> 
>>>>>> http://people.apache.org/~hossman/#xyproblem
>>>>>> 
>>>>>> How about you tell us *exactly* what you'd actually like to have happen
>>>>>> and then we can find a solution for you?
>>>>>> 
>>>>>> It sounds a little bit like you're interested in stripping all the HTML
>>>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>>>> 
>>>>>> 
>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>>>> 
>>>>>> Something that I already said: By using the KeywordTokenizer, you won't
>>>>>> be able to search for individual words on your HTML input.  The entire
>>>>>> input string is treated as a single token, and therefore ONLY exact
>>>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>>>> 
>>>>>> 
>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>>>> 
>>>>>> Note that no matter what you do to your data with the analysis chain,
>>>>>> Solr will always return the text that was originally indexed in search
>>>>>> results.  If you need to affect what gets stored as well, perhaps you
>>>>>> need an Update Processor.
>>>>>> 
>>>>>> Thanks,
>>>>>> Shawn

Re: charfilter doesn't do anything

Reply via email to