Re: how to present html content in browse

Lance Norskog Fri, 04 May 2012 19:04:15 -0700

You need positions and offsets to do highlighting. A CharFilter does
not preserve positions.


I think you have to analyze the raw HTML with a different Analyzer, as
well as the stripper. I think this is how it works: use a new Analyzer
stack that uses the StandardAnalyzer, and the lower case filter and
stemmer/synonym etc. Now, store the HTML field with that text type.
You then search on the stripped field, but highlight from the raw
field with 'hl.fl'.

Here's the cool part: you do not actually need to index the raw HTML,
only store it. If you do not index a field, the Highlighter analyzes
the HTML when it needs the positions and offsets.

On Fri, May 4, 2012 at 2:25 PM, okayndc <bodymo...@gmail.com> wrote:
> Okay, thanks for the info.
>
> On Fri, May 4, 2012 at 4:42 PM, Jack Krupansky <j...@basetechnology.com>wrote:
>
>> Evidently there was a problem with highlighting of HTML that is supposedly
>> fixed in Solr 3.6 and trunk:
>>
>> https://issues.apache.org/**jira/browse/SOLR-42<https://issues.apache.org/jira/browse/SOLR-42>
>>
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: okayndc
>> Sent: Friday, May 04, 2012 4:35 PM
>>
>> To: solr-user@lucene.apache.org
>> Subject: Re: how to present html content in browse
>>
>> Is it possible to return the HTML field highlighted?
>>
>> On Fri, May 4, 2012 at 1:27 PM, Jack Krupansky <j...@basetechnology.com>**
>> wrote:
>>
>>  1. The raw html field (call it, "text_html") would be a "string" type
>>> field that is "stored" but not "indexed". This is the field you direct DIH
>>> to output to. This is the field you would return in your search results
>>> with the HTML to be displayed.
>>>
>>> 2. The stripped field (call it, "text_stripped") would be a "text" type
>>> field (where "text" is a field type you add that uses the HTML strip char
>>> filter as shown below) that is not "stored" but is "indexed. Add a
>>> CopyField to your schema that copies from the raw html field to the
>>> stripped field (say, "text_html" to "text_stripped".)
>>>
>>> For reference on HTML strip (HTMLStripCharFilterFactory), see:
>>> http://wiki.apache.org/solr/****AnalyzersTokenizersTokenFilter****s<http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s>
>>> <http://wiki.apache.org/**solr/**AnalyzersTokenizersTokenFilter**s<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>
>>> >
>>>
>>>
>>> Which has:
>>>
>>> <fieldtype name="text" class="solr.TextField">
>>>  <analyzer>
>>>  <charFilter class="solr.****HTMLStripCharFilterFactory"/>
>>>  <charFilter class="solr.****MappingCharFilterFactory"
>>> mapping="mapping-**
>>> ISOLatin1Accent.txt"/>
>>>  <tokenizer class="solr.****StandardTokenizerFactory"/>
>>>  <filter class="solr.****LowerCaseFilterFactory"/>
>>>  <filter class="solr.StopFilterFactory"****/>
>>>  <filter class="solr.****PorterStemFilterFactory"/>
>>>
>>>  </analyzer>
>>> </fieldtype>
>>>
>>> Although, you might want to call that field type "text_stripped" to avoid
>>> confusion with a simple text field
>>>
>>> You can add HTMLStripCharFilterFactory to some other field type that you
>>> might want to use, but this "charFilter" needs to be before the
>>> "tokenizer". The "text" field type above is just an example.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: okayndc
>>> Sent: Friday, May 04, 2012 1:01 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: how to present html content in browse
>>>
>>>
>>> Hello,
>>>
>>> I'm having a hard time understanding this, and I had this same question.
>>>
>>> When using DIH should the HTML field be stored in the raw HTML string
>>> field
>>> or the stripped field?
>>> Also what source field(s) need to be copied and to what destination?
>>>
>>> Thanks
>>>
>>>
>>> On Thu, May 3, 2012 at 10:15 PM, Lance Norskog <goks...@gmail.com> wrote:
>>>
>>>  Make two fields, one with stores the stripped HTML and another that
>>>
>>>> stores the parsed HTML. You can use <copyField> so that you do not
>>>> have to submit the html page twice.
>>>>
>>>> You would mark the stripped field 'indexed=true stored=false' and the
>>>> full text field the other way around. The full text field should be a
>>>> String type.
>>>>
>>>> On Thu, May 3, 2012 at 1:04 PM, srini <softtec...@gmail.com> wrote:
>>>> > I am indexing records from database using DIH. The content of my record
>>>> is in
>>>> > html format. When I use browse
>>>> > I would like to show the content in html format, not in text format. >
>>>> Any
>>>> > ideas?
>>>> >
>>>> > --
>>>> > View this message in context:
>>>> http://lucene.472066.n3.**nabb**le.com/how-to-present-**<http://nabble.com/how-to-present-**>
>>>> html-content-in-browse-****tp3960327.html<http://lucene.**
>>>> 472066.n3.nabble.com/how-to-**present-html-content-in-**
>>>> browse-tp3960327.html<http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html>
>>>> >
>>>>
>>>> > Sent from the Solr - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goks...@gmail.com
>>>>
>>>>
>>>>
>>>
>>



-- 
Lance Norskog
goks...@gmail.com

Re: how to present html content in browse

Reply via email to