Hi Arturas, Both Erick and I had a go at improving the documentation here. I hope it's clearer. https://builds.apache.org/job/Solr-reference-guide-master/javadoc/highlighting.html The docs for hl.fl, hl.q, hl.qparser were all updated. The meat of the change was a new note in hl.fl including an example. It's kinda hard to document the problem you found but I hope the note will be somewhat illustrative.
~ David On Mon, Mar 26, 2018 at 3:12 AM Arturas Mazeika <maze...@gmail.com> wrote: > Hi Erick, > > Adding a field-qualify to the hl.q parameter solved the issue. My > excitement is steaming over the roof! What a thorough answer: the > explanation about the behavior of solr, how it tries to interpret what I > mean when I supply a keyword without the field-qualifier. Very impressive. > Would you care (re)posting this answer to stackoverflow? If that is too > much of a hassle, I'll do this in a couple of days myself on your behalf. > > I am impressed how well, thorough, fast and fully the question was > answered. > > Steven hint pushed me into this direction further: he suggested to use the > query part of solr to filter and sort out the relevant answers in the 1st > step and in the 2nd step he'd highlight all the keywords using CTR+F (in > the browser or some alternative viewer). This brought be to the next > question: > > How can one match query terms with the analyze-chained documents in an > efficient and distributed manner? My current understanding how to achieve > this is the following: > > 1. Get the list of ids (contents) of the documents that match the query > 2. Use the http://localhost:8983/solr/#/trans/analysis to re-analyze the > document and the query > 3. Use the matching of the substrings from the original text to last > filter/tokenizer/analyzer in the analyze-chain to map the terms of the > query > 4. Emulate CTRL+F highlighting > > Web Interface of Solr offers quite a bit to advance towards this goal. If > one fires this request: > > * analysis.fieldvalue=Albert Einstein (14 March 1879 – 18 April 1955) was a > German-born theoretical physicist[5] who developed the theory of > relativity, one of the two pillars of modern physics (alongside quantum > mechanics).& > * analysis.query=reletivity theory > > to one of the cores of solr, one gets the steps 1-3 done: > > > http://localhost:8983/solr/trans_shard1_replica_n1/analysis/field?wt=xml&analysis.showmatch=true&analysis.fieldvalue=Albert%20Einstein%20(14%20March%201879%20%E2%80%93%2018%20April%201955)%20was%20a%20German-born%20theoretical%20physicist[5]%20who%20developed%20the%20theory%20of%20relativity,%20one%20of%20the%20two%20pillars%20of%20modern%20physics%20(alongside%20quantum%20mechanics).&analysis.query=reletivity%20theory&analysis.fieldtype=text_en > > Questions: > > 1. Is there a way to "load-balance" this? In the above url, I need to > specify a specific core. Is it possible to generalize it, so the core that > receives the request is not necessarily the one that processes it? Or this > already is distributed in a sense that receiving core and processing cores > are never the same? > > 2. The document was already analyze-chained. Is is possible to store this > information so one does not need to re-analyze-chain it once more? > > Cheers > Arturas > > On Fri, Mar 23, 2018 at 9:15 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > > > Arturas: > > > > Try to field-qualify your hl.q parameter. That looks like: > > > > hl.q=trans:Kundigung > > or > > hl.q=trans:Kündigung > > > > I saw the exact behavior you describe when I did _not_ specify the > > field in the hl.q parameter, i.e. > > > > hl.q=Kundigung > > or > > hl.q=Kündigung > > > > didn't show all highlights. > > > > But when I did specify the field, it worked. > > > > Here's what I think is happening: Solr uses the default search > > field when parsing an un-field-qualified query. I.e. > > > > q=something > > > > is parsed as > > > > q=default_search_field:something. > > > > The default field is controlled in solrconfig.xml with the "df" > > parameter, you'll see entries like: > > <str name="df">my_field</str> > > > > Also when I changed the "df" parameter to the field I was highlighting > > on, I didn't need to specify the field on the hl.q parameter. > > > > hl.q=Kundigung > > or > > hl.q=Kündigung > > > > The default field is usually "text", which knows nothing about > > the German-specific filters you've applied unless you changed it. > > > > So in the absence of a field-qualification for the hl.q parameter Solr > > was parsing the query according to the analysis chain specifed > > in your default field, and probably passed ü through without > > transforming it. Since your indexing analysis chain for that field > > folded ü to just plain u, it wasn't found or highlighted. > > > > On the surface, this does seem like something that should be > > changed, I'll go ahead and ping the dev list. > > > > NOTE: I was trying this on Solr 7.1 > > > > Best, > > Erick > > > > On Fri, Mar 23, 2018 at 12:03 PM, Arturas Mazeika <maze...@gmail.com> > > wrote: > > > Hi Erick, > > > > > > Thanks for the update and the infos. Your post brought quite a bit of > > light > > > into the picture and now I understand quite a bit more about what you > are > > > saying. Your explanation makes sense and can be quite useful in certain > > > scenarious. > > > > > > What stroke me from your description is that you are saying that the > > > analyzer-chain needs to be applied for the highlighting queries as > well. > > > The tragedy is that I am not able to get this for a german collection: > if > > > the query is set (no explicit highlighting query), the highlighting is > > > correct. It is also correct, if I replace the umaults into the > > > corresponding latin chars. Getting the analyzer chain for the > > highlighting > > > terms remains the challenge. > > > > > > Do you think you have a look at the following stakoverflow link? Maybe > > > something comes to your mind... > > > > > > *https://stackoverflow.com/questions/49276093/solr- > > highlighting-terms-with-umlaut-not-found-not-highlighted > > > <https://stackoverflow.com/questions/49276093/solr- > > highlighting-terms-with-umlaut-not-found-not-highlighted>* > > > > > > *Cheers,* > > > > > > *Arturas* > > > On Fri, Mar 23, 2018, 17:43 Erick Erickson <erickerick...@gmail.com> > > wrote: > > > > > >> bq: this is not a typical case that one searches for a keyword but > > >> highlights something else > > >> > > >> This isn't really an unusual case, apparently I mislead you. > > >> > > >> What I was trying to convey is that the analysis chain used is firmly > > >> attached to a particular _field_. There's no way to say "use one > > >> analysis chain for the query and another for highlighting on the > > >> _same_ field". > > >> > > >> You can use two different fields with different analysis chains, one > > >> for each purpose. So something like > > >> > > >> q=f1:something&hl.fl=f2,f3&hl.q=other > > >> > > >> is certainly reasonable. It'll search for "something" in f1, and > > >> highlight "other" in f2 and f3 > > >> > > >> Each fields processes its input with the analysis chain defined in the > > >> schema. > > >> > > >> The rest about stored="true" can be ignored, it's just me wandering > > >> off into the weeds about an optimization that only stores the data > > >> once rather than redundantly in multiple fields. > > >> > > >> Best, > > >> Erick > > >> > > >> On Fri, Mar 23, 2018 at 4:37 AM, Arturas Mazeika <maze...@gmail.com> > > >> wrote: > > >> > Hi Mathesis (Stefan), > > >> > > > >> > Thanks for the questions. This made me look at the problem from a > > >> distance > > >> > and re-frame the situation. Good questions indeed. > > >> > > > >> > Trying to go around: consider a user who describes herself as being > a > > BMW > > >> > fan, being convinced that all BMW need to be the blackest color > > possible > > >> > (for a sake of argument) who would like to search and later browse > the > > >> > entries in the discussion forum (of course not everything but BMW of > > the > > >> > blackest color), and what interest her are the snippets that have > > >> > understood, craziest as keywords or the like (because she is looking > > for > > >> a > > >> > dozen of discussions that she saw before). > > >> > > > >> > What I was not able to achieve so far is: (i) combine query term for > > >> > filtering and highlighting, (ii) using the analyzer-chain from the > > >> > attribute to rewrite the highlight query (or define one in the > search) > > >> > > > >> > CTR+F technique is a very powerful one, indeed. Works most of the > > time. > > >> The > > >> > difficulties with it are query rewriting, enriching, etc. > > >> > > > >> > Cheers, > > >> > Arturas > > >> > > > >> > On Fri, Mar 23, 2018 at 11:29 AM, Stefan Matheis < > > >> matheis.ste...@gmail.com> > > >> > wrote: > > >> > > > >> >> Perhaps we try it the other way round .. what's your use case for > > this? > > >> I'm > > >> >> trying to think of a situation where I'd need this a as user? > > >> >> > > >> >> The only reason I see myself doing this is CTRL+F in a page when > the > > >> search > > >> >> result is not immediately visible for me ;) > > >> >> > > >> >> On Mar 23, 2018 9:41 AM, "Arturas Mazeika" <maze...@gmail.com> > > wrote: > > >> >> > > >> >> > Hi Erick et al, > > >> >> > > > >> >> > From your answer I understand that this is not a typical case > that > > one > > >> >> > searches for a keyword but highlights something else. Since we > have > > >> two > > >> >> > parameters (q vs hl.q) I thought they are freely combinable. From > > your > > >> >> > answer I understand that this is not really the case. My current > > >> >> > understanding came from [1] that says: > > >> >> > > > >> >> > hl.q > > >> >> > > > >> >> > A query to use for highlighting. This parameter allows you to > > >> highlight > > >> >> > different terms than those being used to retrieve documents. > > >> >> > what I hear from you is something different: i.e., that this is > not > > >> >> enough > > >> >> > just to combine the q with hl.q, that there are caveats to > achieve > > the > > >> >> task > > >> >> > (multiple fields, FastVectorHighlighter). > > >> >> > > > >> >> > Your infos are very helpful. > > >> >> > > > >> >> > Cheers, > > >> >> > Arturas > > >> >> > > > >> >> > [1] https://lucene.apache.org/solr/guide/7_2/highlighting.html > > >> >> > > > >> >> > On Thu, Mar 22, 2018 at 4:07 PM, Erick Erickson < > > >> erickerick...@gmail.com > > >> >> > > > >> >> > wrote: > > >> >> > > > >> >> > > Basically you need to use a copyField, but in several variants: > > >> >> > > > > >> >> > > If you use the field _exclusively_ for highlighting then store > > the > > >> raw > > >> >> > > content there and have the field use whatever analyzer you > want. > > You > > >> >> > > do _not_ need to have indexed="true" set for the field if > you're > > >> >> > > highlighting on the fly. So you're searching against field1 > > (which > > >> has > > >> >> > > indexed="true" stored="false" set) but highlighting against > > field2 > > >> >> > > (which has indexed="false" stored="true" set). Of course any > time > > >> you > > >> >> > > want to return the contents in a doc your fl needs to specify > > >> >> > > field2... > > >> >> > > > > >> >> > > The above does not bloat your index at all since the cost of > > >> >> > > stored="true" indexed="true" is the same as if you use two > > fields, > > >> >> > > each with only one option turned on. > > >> >> > > > > >> >> > > The second approach if you want to use FastVectorHighlighter or > > the > > >> >> > > like is simply to index both fields. > > >> >> > > > > >> >> > > Best, > > >> >> > > Erick > > >> >> > > > > >> >> > > On Thu, Mar 22, 2018 at 2:18 AM, Arturas Mazeika < > > maze...@gmail.com > > >> > > > >> >> > > wrote: > > >> >> > > > Hi Solr-Users, > > >> >> > > > > > >> >> > > > I've been playing with a german collection of documents, > where > > I > > >> >> tried > > >> >> > to > > >> >> > > > search for one word (q=Tag) and highlighted another: > > >> >> (hl.q=Kundigung). > > >> >> > Is > > >> >> > > > this a "legal" use case? My key question is how can I tell > solr > > >> which > > >> >> > > query > > >> >> > > > analyzer to use for highlighting? Strictly speaking, I should > > use > > >> >> > > > hl.q=Kündigung to conceptually look for relevant information, > > but > > >> in > > >> >> > this > > >> >> > > > case, no highlighting is returned (as all umlauts are left > out > > in > > >> the > > >> >> > > > index) . > > >> >> > > > > > >> >> > > > Additional infos: > > >> >> > > > > > >> >> > > > solr version: 7.2 > > >> >> > > > urls to query: > > >> >> > > > > > >> >> > > > http://localhost:8983/solr/trans/select?q=trans:Zeit&hl= > > >> >> > > true&hl.fl=trans&hl.q=Kundigung&hl.snippets=3&wt=xml&rows=1 > > >> >> > > > > > >> >> > > > http://localhost:8983/solr/trans/select?q=trans:Zeit&hl= > > >> >> > > > true&hl.fl=trans&hl.q=K%C3%BCndigung&hl.snippets=3&wt=xml&rows=1 > > >> >> > > > <http://localhost:8983/solr/trans/select?q=trans:Zeit&hl= > > >> >> > > true&hl.fl=trans&hl.q=Kundigung&hl.snippets=3&wt=xml&rows=1> > > >> >> > > > > > >> >> > > > Managed-schema: > > >> >> > > > > > >> >> > > > <fieldType name="text_de" class="solr.TextField" > > >> >> > > positionIncrementGap="100"> > > >> >> > > > <analyzer> > > >> >> > > > <tokenizer class="solr.StandardTokenizerFactory"/> > > >> >> > > > <filter class="solr.LowerCaseFilterFactory"/> > > >> >> > > > <filter class="solr.StopFilterFactory" > format="snowball" > > >> >> > > > words="lang/stopwords_de.txt" ignoreCase="true"/> > > >> >> > > > <filter class="solr.GermanNormalizationFilterFactory"/> > > >> >> > > > <filter class="solr.GermanLightStemFilterFactory"/> > > >> >> > > > </analyzer> > > >> >> > > > </fieldType> > > >> >> > > > > > >> >> > > > > > >> >> > > > Other additional infos: > > >> >> > > > https://stackoverflow.com/questions/49276093/solr- > > >> >> > > highlighting-terms-with-umlaut-not-found-not-highlighted > > >> >> > > > > > >> >> > > > Cheers, > > >> >> > > > Arturas > > >> >> > > > > >> >> > > > >> >> > > >> > > > -- Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com