These strategies are not mutually exclusive. Yes I do suggest having the HTML in whole go into one searchable field to satisfy your highlighting use-case. But I can imagine you will also want some document metadata in separate fields. It's up to you to parse that out somehow and add it. You mentioned you are using bin/post but, IMO, that capability is more for quick experimentation / tutorials, some POCs, or very simple use-cases. I doubt you can do what I suggest while still using bin/post. You might be able to use "SolrCell" AKA ExtractingRequestHandler directly, which is what bin/post does with HTML.
Good luck! ~ David On Sun, May 24, 2020 at 10:52 AM Serkan KAZANCI <ser...@kazanci.com.tr> wrote: > Hi David, > > I have many meta-tags in html documents like <meta name="tarih" > content="2019-10-15T23:59:59Z"> which matches the field descriptions in > schema file. > > As I understand, you propose to index the whole html document as one text > file and map it to a search field (do you?) . That would take care of the > html highlight issue, however I would lose the field information coming > from meta-tags . > > So is it possible to index the html document as html document ? > (preserving the field data coming from meta-tags and not strip the html > tags) > > Then I could use solr.HTMLStripCharFilterFactory for analysis. > > Thank You, > > Serkan, > > > > > -----Original Message----- > From: David Smiley [mailto:dsmi...@apache.org] > Sent: Sunday, May 24, 2020 5:26 PM > To: solr-user > Subject: Re: highlighting a whole html document using Unified highlighter > > Instead of stripping the HTML for the stored value, leave it be and remove > it during the analysis stage with solr.HTMLStripCharFilterFactory > < > https://builds.apache.org/job/Solr-reference-guide-master/javadoc/charfilterfactories.html#solr-htmlstripcharfilterfactory > > > This means the searchable text will only be the visible text, basically. > And the highlighter will only highlight what's searchable. > > I suggest doing some experimentation for searching for words that you know > are directly adjacent (no spaces) to opening and closing tags to make sure > that the inserted HTML markup for the highlight balance correctly. Use a > "phrase query" (quoted) as well, and see if you can highlight around markup > like "phrase</p>query" to see what happens. You might need to set > hl.weightMatches=false to ensure the words separately are highlighted. I > suspect you will find there is a problem, and the root cause is here: > LUCENE-5734 <https://issues.apache.org/jira/browse/LUCENE-5734> It's on > my long TODO list but hasn't bitten me lately so I've neglected it. > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > > > On Sun, May 24, 2020 at 7:20 AM Serkan KAZANCI <ser...@kazanci.com.tr> > wrote: > > > Thanks Jörn for the answer, > > > > I use post tool to index html documents, so the html tags are stripped > > when indexed and stored. The remaining text is mapped to the field > content > > by default. > > > > hl.fragsize=0 works perfect for the indexed document, but I can only > > display highlighted text-only version of html document because the html > > tags are stripped. > > > > So is it possible to index and store the html document without stripping > > the html tags, so that when the document is displayed with hl.fragsize=0 > > parameter, it is displayed as original html document? > > > > Or > > > > Is it possible to give a whole html document as a parameter to the > Unified > > highlighter so that output is also a highlighted html document? > > > > Or > > > > Do you have a better idea to highlight the keywords of the whole html > > document? > > > > Thanks, > > > > Serkan > > > > -----Original Message----- > > From: Jörn Franke [mailto:jornfra...@gmail.com] > > Sent: Sunday, May 24, 2020 1:22 PM > > To: solr-user@lucene.apache.org > > Subject: Re: highlighting a whole html document using Unified highlighter > > > > hl.fragsize=0 > > > > https://lucene.apache.org/solr/guide/8_5/highlighting.html > > > > > > > > > Am 24.05.2020 um 11:49 schrieb Serkan KAZANCI <ser...@kazanci.com.tr>: > > > > > > Hi, > > > > > > > > > > > > I use solr to search over a million html documents, when a document is > > > searched and displayed, I want to highlight the keywords that are used > to > > > find and access the document. > > > > > > > > > > > > Unified highlighter is fast, accurate and supports different languages > > but > > > only highlights passages with given parameters. > > > > > > > > > > > > How can I highlight a whole html document using Unified highlighter? I > > have > > > written a php code but it cannot do the complex word stemming > functions. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Serkan > > > > > > > > >