RE: highlighting a whole html document using Unified highlighter

Serkan KAZANCI Sun, 24 May 2020 07:52:30 -0700

Hi David,

I have many meta-tags in html documents like  <meta name="tarih" 
content="2019-10-15T23:59:59Z"> which matches the field descriptions in schema 
file.

As I understand, you propose to index the whole html document as one text file 
and map it to a search field (do you?) . That would take care of the html 
highlight issue, however I would lose the field information coming from 
meta-tags .

So is it possible to index the html document as html document ? (preserving the 
field data coming from meta-tags and not strip the html tags) 

Then I could use solr.HTMLStripCharFilterFactory for analysis.

Thank You,

Serkan,

-----Original Message-----
From: David Smiley [mailto:dsmi...@apache.org] 
Sent: Sunday, May 24, 2020 5:26 PM
To: solr-user
Subject: Re: highlighting a whole html document using Unified highlighter

Instead of stripping the HTML for the stored value, leave it be and remove
it during the analysis stage with solr.HTMLStripCharFilterFactory
<https://builds.apache.org/job/Solr-reference-guide-master/javadoc/charfilterfactories.html#solr-htmlstripcharfilterfactory>
This means the searchable text will only be the visible text, basically.
And the highlighter will only highlight what's searchable.

I suggest doing some experimentation for searching for words that you know
are directly adjacent (no spaces) to opening and closing tags to make sure
that the inserted HTML markup for the highlight balance correctly.  Use a
"phrase query" (quoted) as well, and see if you can highlight around markup
like "phrase</p>query" to see what happens.  You might need to set
hl.weightMatches=false to ensure the words separately are highlighted.  I
suspect you will find there is a problem, and the root cause is here:
LUCENE-5734 <https://issues.apache.org/jira/browse/LUCENE-5734>   It's on
my long TODO list but hasn't bitten me lately so I've neglected it.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Sun, May 24, 2020 at 7:20 AM Serkan KAZANCI <ser...@kazanci.com.tr>
wrote:

> Thanks Jörn for the answer,
>
> I use post tool to index html documents, so the html tags are stripped
> when indexed and stored. The remaining text is mapped to the field content
> by default.
>
> hl.fragsize=0 works perfect for the indexed document, but I can only
> display highlighted text-only version of html document because the html
> tags are stripped.
>
> So is it possible to index and store the html document without stripping
> the html tags, so that when the document is displayed with hl.fragsize=0
> parameter, it is displayed as original html document?
>
> Or
>
> Is it possible to give a whole html document as a parameter to the Unified
> highlighter so that output is also a highlighted html document?
>
> Or
>
> Do you have a better idea to highlight the keywords of the whole html
> document?
>
>  Thanks,
>
>  Serkan
>
> -----Original Message-----
> From: Jörn Franke [mailto:jornfra...@gmail.com]
> Sent: Sunday, May 24, 2020 1:22 PM
> To: solr-user@lucene.apache.org
> Subject: Re: highlighting a whole html document using Unified highlighter
>
> hl.fragsize=0
>
> https://lucene.apache.org/solr/guide/8_5/highlighting.html
>
>
>
> > Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI <ser...@kazanci.com.tr>:
> >
> > Hi,
> >
> >
> >
> > I use solr to search over a million html documents, when a document is
> > searched and displayed, I want to highlight the keywords that are used to
> > find and access the document.
> >
> >
> >
> > Unified highlighter is fast, accurate and supports different languages
> but
> > only highlights passages with given parameters.
> >
> >
> >
> > How can I highlight a whole html document using Unified highlighter? I
> have
> > written a php code but it cannot do the complex word stemming functions.
> >
> >
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Serkan
> >
>
>

RE: highlighting a whole html document using Unified highlighter

Reply via email to