You need positions and offsets to do highlighting. A CharFilter does not preserve positions.
I think you have to analyze the raw HTML with a different Analyzer, as well as the stripper. I think this is how it works: use a new Analyzer stack that uses the StandardAnalyzer, and the lower case filter and stemmer/synonym etc. Now, store the HTML field with that text type. You then search on the stripped field, but highlight from the raw field with 'hl.fl'. Here's the cool part: you do not actually need to index the raw HTML, only store it. If you do not index a field, the Highlighter analyzes the HTML when it needs the positions and offsets. On Fri, May 4, 2012 at 2:25 PM, okayndc <bodymo...@gmail.com> wrote: > Okay, thanks for the info. > > On Fri, May 4, 2012 at 4:42 PM, Jack Krupansky <j...@basetechnology.com>wrote: > >> Evidently there was a problem with highlighting of HTML that is supposedly >> fixed in Solr 3.6 and trunk: >> >> https://issues.apache.org/**jira/browse/SOLR-42<https://issues.apache.org/jira/browse/SOLR-42> >> >> >> -- Jack Krupansky >> >> -----Original Message----- From: okayndc >> Sent: Friday, May 04, 2012 4:35 PM >> >> To: solr-user@lucene.apache.org >> Subject: Re: how to present html content in browse >> >> Is it possible to return the HTML field highlighted? >> >> On Fri, May 4, 2012 at 1:27 PM, Jack Krupansky <j...@basetechnology.com>** >> wrote: >> >> 1. The raw html field (call it, "text_html") would be a "string" type >>> field that is "stored" but not "indexed". This is the field you direct DIH >>> to output to. This is the field you would return in your search results >>> with the HTML to be displayed. >>> >>> 2. The stripped field (call it, "text_stripped") would be a "text" type >>> field (where "text" is a field type you add that uses the HTML strip char >>> filter as shown below) that is not "stored" but is "indexed. Add a >>> CopyField to your schema that copies from the raw html field to the >>> stripped field (say, "text_html" to "text_stripped".) >>> >>> For reference on HTML strip (HTMLStripCharFilterFactory), see: >>> http://wiki.apache.org/solr/****AnalyzersTokenizersTokenFilter****s<http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s> >>> <http://wiki.apache.org/**solr/**AnalyzersTokenizersTokenFilter**s<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters> >>> > >>> >>> >>> Which has: >>> >>> <fieldtype name="text" class="solr.TextField"> >>> <analyzer> >>> <charFilter class="solr.****HTMLStripCharFilterFactory"/> >>> <charFilter class="solr.****MappingCharFilterFactory" >>> mapping="mapping-** >>> ISOLatin1Accent.txt"/> >>> <tokenizer class="solr.****StandardTokenizerFactory"/> >>> <filter class="solr.****LowerCaseFilterFactory"/> >>> <filter class="solr.StopFilterFactory"****/> >>> <filter class="solr.****PorterStemFilterFactory"/> >>> >>> </analyzer> >>> </fieldtype> >>> >>> Although, you might want to call that field type "text_stripped" to avoid >>> confusion with a simple text field >>> >>> You can add HTMLStripCharFilterFactory to some other field type that you >>> might want to use, but this "charFilter" needs to be before the >>> "tokenizer". The "text" field type above is just an example. >>> >>> -- Jack Krupansky >>> >>> -----Original Message----- From: okayndc >>> Sent: Friday, May 04, 2012 1:01 PM >>> To: solr-user@lucene.apache.org >>> Subject: Re: how to present html content in browse >>> >>> >>> Hello, >>> >>> I'm having a hard time understanding this, and I had this same question. >>> >>> When using DIH should the HTML field be stored in the raw HTML string >>> field >>> or the stripped field? >>> Also what source field(s) need to be copied and to what destination? >>> >>> Thanks >>> >>> >>> On Thu, May 3, 2012 at 10:15 PM, Lance Norskog <goks...@gmail.com> wrote: >>> >>> Make two fields, one with stores the stripped HTML and another that >>> >>>> stores the parsed HTML. You can use <copyField> so that you do not >>>> have to submit the html page twice. >>>> >>>> You would mark the stripped field 'indexed=true stored=false' and the >>>> full text field the other way around. The full text field should be a >>>> String type. >>>> >>>> On Thu, May 3, 2012 at 1:04 PM, srini <softtec...@gmail.com> wrote: >>>> > I am indexing records from database using DIH. The content of my record >>>> is in >>>> > html format. When I use browse >>>> > I would like to show the content in html format, not in text format. > >>>> Any >>>> > ideas? >>>> > >>>> > -- >>>> > View this message in context: >>>> http://lucene.472066.n3.**nabb**le.com/how-to-present-**<http://nabble.com/how-to-present-**> >>>> html-content-in-browse-****tp3960327.html<http://lucene.** >>>> 472066.n3.nabble.com/how-to-**present-html-content-in-** >>>> browse-tp3960327.html<http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html> >>>> > >>>> >>>> > Sent from the Solr - User mailing list archive at Nabble.com. >>>> >>>> >>>> >>>> -- >>>> Lance Norskog >>>> goks...@gmail.com >>>> >>>> >>>> >>> >> -- Lance Norskog goks...@gmail.com