Hi David, I have many meta-tags in html documents like <meta name="tarih" content="2019-10-15T23:59:59Z"> which matches the field descriptions in schema file.
As I understand, you propose to index the whole html document as one text file and map it to a search field (do you?) . That would take care of the html highlight issue, however I would lose the field information coming from meta-tags . So is it possible to index the html document as html document ? (preserving the field data coming from meta-tags and not strip the html tags) Then I could use solr.HTMLStripCharFilterFactory for analysis. Thank You, Serkan, -----Original Message----- From: David Smiley [mailto:dsmi...@apache.org] Sent: Sunday, May 24, 2020 5:26 PM To: solr-user Subject: Re: highlighting a whole html document using Unified highlighter Instead of stripping the HTML for the stored value, leave it be and remove it during the analysis stage with solr.HTMLStripCharFilterFactory <https://builds.apache.org/job/Solr-reference-guide-master/javadoc/charfilterfactories.html#solr-htmlstripcharfilterfactory> This means the searchable text will only be the visible text, basically. And the highlighter will only highlight what's searchable. I suggest doing some experimentation for searching for words that you know are directly adjacent (no spaces) to opening and closing tags to make sure that the inserted HTML markup for the highlight balance correctly. Use a "phrase query" (quoted) as well, and see if you can highlight around markup like "phrase</p>query" to see what happens. You might need to set hl.weightMatches=false to ensure the words separately are highlighted. I suspect you will find there is a problem, and the root cause is here: LUCENE-5734 <https://issues.apache.org/jira/browse/LUCENE-5734> It's on my long TODO list but hasn't bitten me lately so I've neglected it. ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Sun, May 24, 2020 at 7:20 AM Serkan KAZANCI <ser...@kazanci.com.tr> wrote: > Thanks Jörn for the answer, > > I use post tool to index html documents, so the html tags are stripped > when indexed and stored. The remaining text is mapped to the field content > by default. > > hl.fragsize=0 works perfect for the indexed document, but I can only > display highlighted text-only version of html document because the html > tags are stripped. > > So is it possible to index and store the html document without stripping > the html tags, so that when the document is displayed with hl.fragsize=0 > parameter, it is displayed as original html document? > > Or > > Is it possible to give a whole html document as a parameter to the Unified > highlighter so that output is also a highlighted html document? > > Or > > Do you have a better idea to highlight the keywords of the whole html > document? > > Thanks, > > Serkan > > -----Original Message----- > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: Sunday, May 24, 2020 1:22 PM > To: solr-user@lucene.apache.org > Subject: Re: highlighting a whole html document using Unified highlighter > > hl.fragsize=0 > > https://lucene.apache.org/solr/guide/8_5/highlighting.html > > > > > Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI <ser...@kazanci.com.tr>: > > > > Hi, > > > > > > > > I use solr to search over a million html documents, when a document is > > searched and displayed, I want to highlight the keywords that are used to > > find and access the document. > > > > > > > > Unified highlighter is fast, accurate and supports different languages > but > > only highlights passages with given parameters. > > > > > > > > How can I highlight a whole html document using Unified highlighter? I > have > > written a php code but it cannot do the complex word stemming functions. > > > > > > > > > > > > Thanks, > > > > > > > > Serkan > > > >