Re: highlighting a whole html document using Unified highlighter

Serkan KAZANCI Sun, 24 May 2020 09:17:48 -0700

All clear. 

Thanks David,


> On 24 May 2020, at 18:57, David Smiley <david.w.smi...@gmail.com> wrote:
> 
> These strategies are not mutually exclusive.  Yes I do suggest having the
> HTML in whole go into one searchable field to satisfy your highlighting
> use-case.  But I can imagine you will also want some document metadata in
> separate fields.  It's up to you to parse that out somehow and add it.  You
> mentioned you are using bin/post but, IMO, that capability is more for
> quick experimentation / tutorials, some POCs, or very simple use-cases.  I
> doubt you can do what I suggest while still using bin/post.  You might be
> able to use "SolrCell" AKA ExtractingRequestHandler directly, which is what
> bin/post does with HTML.
> 
> Good luck!
> 
> ~ David
> 
> 
>> On Sun, May 24, 2020 at 10:52 AM Serkan KAZANCI <ser...@kazanci.com.tr>
>> wrote:
>> 
>> Hi David,
>> 
>> I have many meta-tags in html documents like  <meta name="tarih"
>> content="2019-10-15T23:59:59Z"> which matches the field descriptions in
>> schema file.
>> 
>> As I understand, you propose to index the whole html document as one text
>> file and map it to a search field (do you?) . That would take care of the
>> html highlight issue, however I would lose the field information coming
>> from meta-tags .
>> 
>> So is it possible to index the html document as html document ?
>> (preserving the field data coming from meta-tags and not strip the html
>> tags)
>> 
>> Then I could use solr.HTMLStripCharFilterFactory for analysis.
>> 
>> Thank You,
>> 
>> Serkan,
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: David Smiley [mailto:dsmi...@apache.org]
>> Sent: Sunday, May 24, 2020 5:26 PM
>> To: solr-user
>> Subject: Re: highlighting a whole html document using Unified highlighter
>> 
>> Instead of stripping the HTML for the stored value, leave it be and remove
>> it during the analysis stage with solr.HTMLStripCharFilterFactory
>> <
>> https://builds.apache.org/job/Solr-reference-guide-master/javadoc/charfilterfactories.html#solr-htmlstripcharfilterfactory
>>> 
>> This means the searchable text will only be the visible text, basically.
>> And the highlighter will only highlight what's searchable.
>> 
>> I suggest doing some experimentation for searching for words that you know
>> are directly adjacent (no spaces) to opening and closing tags to make sure
>> that the inserted HTML markup for the highlight balance correctly.  Use a
>> "phrase query" (quoted) as well, and see if you can highlight around markup
>> like "phrase</p>query" to see what happens.  You might need to set
>> hl.weightMatches=false to ensure the words separately are highlighted.  I
>> suspect you will find there is a problem, and the root cause is here:
>> LUCENE-5734 <https://issues.apache.org/jira/browse/LUCENE-5734>   It's on
>> my long TODO list but hasn't bitten me lately so I've neglected it.
>> 
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>> 
>> 
>> On Sun, May 24, 2020 at 7:20 AM Serkan KAZANCI <ser...@kazanci.com.tr>
>> wrote:
>> 
>>> Thanks Jörn for the answer,
>>> 
>>> I use post tool to index html documents, so the html tags are stripped
>>> when indexed and stored. The remaining text is mapped to the field
>> content
>>> by default.
>>> 
>>> hl.fragsize=0 works perfect for the indexed document, but I can only
>>> display highlighted text-only version of html document because the html
>>> tags are stripped.
>>> 
>>> So is it possible to index and store the html document without stripping
>>> the html tags, so that when the document is displayed with hl.fragsize=0
>>> parameter, it is displayed as original html document?
>>> 
>>> Or
>>> 
>>> Is it possible to give a whole html document as a parameter to the
>> Unified
>>> highlighter so that output is also a highlighted html document?
>>> 
>>> Or
>>> 
>>> Do you have a better idea to highlight the keywords of the whole html
>>> document?
>>> 
>>> Thanks,
>>> 
>>> Serkan
>>> 
>>> -----Original Message-----
>>> From: Jörn Franke [mailto:jornfra...@gmail.com]
>>> Sent: Sunday, May 24, 2020 1:22 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: highlighting a whole html document using Unified highlighter
>>> 
>>> hl.fragsize=0
>>> 
>>> https://lucene.apache.org/solr/guide/8_5/highlighting.html
>>> 
>>> 
>>> 
>>>> Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI <ser...@kazanci.com.tr>:
>>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> 
>>>> I use solr to search over a million html documents, when a document is
>>>> searched and displayed, I want to highlight the keywords that are used
>> to
>>>> find and access the document.
>>>> 
>>>> 
>>>> 
>>>> Unified highlighter is fast, accurate and supports different languages
>>> but
>>>> only highlights passages with given parameters.
>>>> 
>>>> 
>>>> 
>>>> How can I highlight a whole html document using Unified highlighter? I
>>> have
>>>> written a php code but it cannot do the complex word stemming
>> functions.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> 
>>>> 
>>>> Serkan
>>>> 
>>> 
>>> 
>> 
>>

Re: highlighting a whole html document using Unified highlighter

Reply via email to