All clear. Thanks David,
> On 24 May 2020, at 18:57, David Smiley <david.w.smi...@gmail.com> wrote: > > These strategies are not mutually exclusive. Yes I do suggest having the > HTML in whole go into one searchable field to satisfy your highlighting > use-case. But I can imagine you will also want some document metadata in > separate fields. It's up to you to parse that out somehow and add it. You > mentioned you are using bin/post but, IMO, that capability is more for > quick experimentation / tutorials, some POCs, or very simple use-cases. I > doubt you can do what I suggest while still using bin/post. You might be > able to use "SolrCell" AKA ExtractingRequestHandler directly, which is what > bin/post does with HTML. > > Good luck! > > ~ David > > >> On Sun, May 24, 2020 at 10:52 AM Serkan KAZANCI <ser...@kazanci.com.tr> >> wrote: >> >> Hi David, >> >> I have many meta-tags in html documents like <meta name="tarih" >> content="2019-10-15T23:59:59Z"> which matches the field descriptions in >> schema file. >> >> As I understand, you propose to index the whole html document as one text >> file and map it to a search field (do you?) . That would take care of the >> html highlight issue, however I would lose the field information coming >> from meta-tags . >> >> So is it possible to index the html document as html document ? >> (preserving the field data coming from meta-tags and not strip the html >> tags) >> >> Then I could use solr.HTMLStripCharFilterFactory for analysis. >> >> Thank You, >> >> Serkan, >> >> >> >> >> -----Original Message----- >> From: David Smiley [mailto:dsmi...@apache.org] >> Sent: Sunday, May 24, 2020 5:26 PM >> To: solr-user >> Subject: Re: highlighting a whole html document using Unified highlighter >> >> Instead of stripping the HTML for the stored value, leave it be and remove >> it during the analysis stage with solr.HTMLStripCharFilterFactory >> < >> https://builds.apache.org/job/Solr-reference-guide-master/javadoc/charfilterfactories.html#solr-htmlstripcharfilterfactory >>> >> This means the searchable text will only be the visible text, basically. >> And the highlighter will only highlight what's searchable. >> >> I suggest doing some experimentation for searching for words that you know >> are directly adjacent (no spaces) to opening and closing tags to make sure >> that the inserted HTML markup for the highlight balance correctly. Use a >> "phrase query" (quoted) as well, and see if you can highlight around markup >> like "phrase</p>query" to see what happens. You might need to set >> hl.weightMatches=false to ensure the words separately are highlighted. I >> suspect you will find there is a problem, and the root cause is here: >> LUCENE-5734 <https://issues.apache.org/jira/browse/LUCENE-5734> It's on >> my long TODO list but hasn't bitten me lately so I've neglected it. >> >> ~ David Smiley >> Apache Lucene/Solr Search Developer >> http://www.linkedin.com/in/davidwsmiley >> >> >> On Sun, May 24, 2020 at 7:20 AM Serkan KAZANCI <ser...@kazanci.com.tr> >> wrote: >> >>> Thanks Jörn for the answer, >>> >>> I use post tool to index html documents, so the html tags are stripped >>> when indexed and stored. The remaining text is mapped to the field >> content >>> by default. >>> >>> hl.fragsize=0 works perfect for the indexed document, but I can only >>> display highlighted text-only version of html document because the html >>> tags are stripped. >>> >>> So is it possible to index and store the html document without stripping >>> the html tags, so that when the document is displayed with hl.fragsize=0 >>> parameter, it is displayed as original html document? >>> >>> Or >>> >>> Is it possible to give a whole html document as a parameter to the >> Unified >>> highlighter so that output is also a highlighted html document? >>> >>> Or >>> >>> Do you have a better idea to highlight the keywords of the whole html >>> document? >>> >>> Thanks, >>> >>> Serkan >>> >>> -----Original Message----- >>> From: Jörn Franke [mailto:jornfra...@gmail.com] >>> Sent: Sunday, May 24, 2020 1:22 PM >>> To: solr-user@lucene.apache.org >>> Subject: Re: highlighting a whole html document using Unified highlighter >>> >>> hl.fragsize=0 >>> >>> https://lucene.apache.org/solr/guide/8_5/highlighting.html >>> >>> >>> >>>> Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI <ser...@kazanci.com.tr>: >>>> >>>> Hi, >>>> >>>> >>>> >>>> I use solr to search over a million html documents, when a document is >>>> searched and displayed, I want to highlight the keywords that are used >> to >>>> find and access the document. >>>> >>>> >>>> >>>> Unified highlighter is fast, accurate and supports different languages >>> but >>>> only highlights passages with given parameters. >>>> >>>> >>>> >>>> How can I highlight a whole html document using Unified highlighter? I >>> have >>>> written a php code but it cannot do the complex word stemming >> functions. >>>> >>>> >>>> >>>> >>>> >>>> Thanks, >>>> >>>> >>>> >>>> Serkan >>>> >>> >>> >> >>