Re: highlighting a whole html document using Unified highlighter

David Smiley Sun, 24 May 2020 08:58:31 -0700

These strategies are not mutually exclusive.  Yes I do suggest having the
HTML in whole go into one searchable field to satisfy your highlighting
use-case.  But I can imagine you will also want some document metadata in
separate fields.  It's up to you to parse that out somehow and add it.  You
mentioned you are using bin/post but, IMO, that capability is more for
quick experimentation / tutorials, some POCs, or very simple use-cases.  I
doubt you can do what I suggest while still using bin/post.  You might be
able to use "SolrCell" AKA ExtractingRequestHandler directly, which is what
bin/post does with HTML.


Good luck!

~ David


On Sun, May 24, 2020 at 10:52 AM Serkan KAZANCI <ser...@kazanci.com.tr>
wrote:

> Hi David,
>
> I have many meta-tags in html documents like  <meta name="tarih"
> content="2019-10-15T23:59:59Z"> which matches the field descriptions in
> schema file.
>
> As I understand, you propose to index the whole html document as one text
> file and map it to a search field (do you?) . That would take care of the
> html highlight issue, however I would lose the field information coming
> from meta-tags .
>
> So is it possible to index the html document as html document ?
> (preserving the field data coming from meta-tags and not strip the html
> tags)
>
> Then I could use solr.HTMLStripCharFilterFactory for analysis.
>
> Thank You,
>
> Serkan,
>
>
>
>
> -----Original Message-----
> From: David Smiley [mailto:dsmi...@apache.org]
> Sent: Sunday, May 24, 2020 5:26 PM
> To: solr-user
> Subject: Re: highlighting a whole html document using Unified highlighter
>
> Instead of stripping the HTML for the stored value, leave it be and remove
> it during the analysis stage with solr.HTMLStripCharFilterFactory
> <
> https://builds.apache.org/job/Solr-reference-guide-master/javadoc/charfilterfactories.html#solr-htmlstripcharfilterfactory
> >
> This means the searchable text will only be the visible text, basically.
> And the highlighter will only highlight what's searchable.
>
> I suggest doing some experimentation for searching for words that you know
> are directly adjacent (no spaces) to opening and closing tags to make sure
> that the inserted HTML markup for the highlight balance correctly.  Use a
> "phrase query" (quoted) as well, and see if you can highlight around markup
> like "phrase</p>query" to see what happens.  You might need to set
> hl.weightMatches=false to ensure the words separately are highlighted.  I
> suspect you will find there is a problem, and the root cause is here:
> LUCENE-5734 <https://issues.apache.org/jira/browse/LUCENE-5734>   It's on
> my long TODO list but hasn't bitten me lately so I've neglected it.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Sun, May 24, 2020 at 7:20 AM Serkan KAZANCI <ser...@kazanci.com.tr>
> wrote:
>
> > Thanks Jörn for the answer,
> >
> > I use post tool to index html documents, so the html tags are stripped
> > when indexed and stored. The remaining text is mapped to the field
> content
> > by default.
> >
> > hl.fragsize=0 works perfect for the indexed document, but I can only
> > display highlighted text-only version of html document because the html
> > tags are stripped.
> >
> > So is it possible to index and store the html document without stripping
> > the html tags, so that when the document is displayed with hl.fragsize=0
> > parameter, it is displayed as original html document?
> >
> > Or
> >
> > Is it possible to give a whole html document as a parameter to the
> Unified
> > highlighter so that output is also a highlighted html document?
> >
> > Or
> >
> > Do you have a better idea to highlight the keywords of the whole html
> > document?
> >
> >  Thanks,
> >
> >  Serkan
> >
> > -----Original Message-----
> > From: Jörn Franke [mailto:jornfra...@gmail.com]
> > Sent: Sunday, May 24, 2020 1:22 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: highlighting a whole html document using Unified highlighter
> >
> > hl.fragsize=0
> >
> > https://lucene.apache.org/solr/guide/8_5/highlighting.html
> >
> >
> >
> > > Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI <ser...@kazanci.com.tr>:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I use solr to search over a million html documents, when a document is
> > > searched and displayed, I want to highlight the keywords that are used
> to
> > > find and access the document.
> > >
> > >
> > >
> > > Unified highlighter is fast, accurate and supports different languages
> > but
> > > only highlights passages with given parameters.
> > >
> > >
> > >
> > > How can I highlight a whole html document using Unified highlighter? I
> > have
> > > written a php code but it cannot do the complex word stemming
> functions.
> > >
> > >
> > >
> > >
> > >
> > > Thanks,
> > >
> > >
> > >
> > > Serkan
> > >
> >
> >
>
>

Re: highlighting a whole html document using Unified highlighter

Reply via email to