mapping and tuning payloads in Solr 8
Hi all, In our Solr 6 setup we use string payloads to boost certain tokens (URIs). These strings are mapped to floats via a schema parameter "PayloadMapping", which can be read out in our custom WKSimilarity class (extending TFIDFSimilarity). 0.4 0.4 0.5 0 0.0 10.0 3.0 1.0 isAbout=15.0,coversFiscalPeriod=10.0,type=5.0,hasTheme=5.0,subject=4.0,mentions=2.0,creator=2.0 The reason for this indirection is convenience: by storing payload strings i.s.o. floats we could change & tune the boosts easily by updating the schema without having to change the content set. Inside WKSimilarity each payload string is mapped to its corresponding boost value and the final boost is applied via the scorePayload method (where we could tune the boost curve via some additional schema parameters). This works well in Solr 6. The problem: we are about to migrate to Solr 8 and after LUCENE-8014 it isn't possible anymore the override the scorePayload method in WKSimilarity (it is removed from TFIDFSimilarity). I wonder what alternatives there are for mapping strings payload to floats and use them in a tunable formula for boosting. Thanks, Tom Burgmans
RE: Multiplicative Boosts broken since 7.3 (LUCENE-8099)
I like to bump this issue up, since this is a showstopper for us to upgrade from Solr 6. In https://issues.apache.org/jira/browse/SOLR-13126 I described a couple of more use cases in which this bug appears. We see different scores in the EXPLAIN compared to the actual scores and our analysis is that the EXPLAIN in fact is correct. It happens when a multiplicative boost is used (via the "boost" parameter) in combination with some function queries, like "query" and "field". One example (tested on Solr 7.5.0), when running: http://localhost:8983/solr/test/select?defType=edismax&fl=id,score,[explain style=text]&q=*:*&boost=sum(field(price),4) then the expectation is that a document that doesn't have the price field gets a score of 4. The result however is: { "id": "docid123576", "score": 1.0, "[explain]": "4.0 = product of:\n 1.0 = boost\n 4.0 = product of:\n 1.0 = *:*\n4.0 = sum(float(price)=0.0,const(4))\n" } EXPLAIN and score are not consistent. Best regards Tom -Original Message- From: Tobias Ibounig [mailto:t.ibou...@netconomy.net] Sent: dinsdag 22 januari 2019 10:14 To: solr-user@lucene.apache.org Subject: Multiplicative Boosts broken since 7.3 (LUCENE-8099) Hello, As described in https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSOLR-13126&data=02%7C01%7Ctom.burgmans%40wolterskluwer.com%7C82b7f7923bd74285295e08d68049f3da%7C8ac76c91e7f141ffa89c3553b2da2c17%7C0%7C0%7C636837452448856240&sdata=paFEStnQwxcKQQ9mM1MfPXQm%2BrStTaqQnYFH2LolVl8%3D&reserved=0 multiplicative boots (in certain conditions) seem to be broken since 7.3. The error seems to be introduced in https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-8099&data=02%7C01%7Ctom.burgmans%40wolterskluwer.com%7C82b7f7923bd74285295e08d68049f3da%7C8ac76c91e7f141ffa89c3553b2da2c17%7C0%7C0%7C636837452448856240&sdata=Gs1EzQ%2FCSO8ryZJv0EGx2etxmDA7HkW8Crj5H6mE%2FvE%3D&reserved=0. Reverting the SOLR parts to the now deprecated BoostingQuery again fixes the issue. The filed issue contains a test case and a patch with the revert (for testing purposes, not really a clean fix). We sadly couldn't find the actual issue, which seems to lie with the use of "FunctionScoreQuery" for boosting. We were able to patch our 7.5 installation with the patch. As others might be affected as well, we hope this can be helpful in resolving this bug. To all SOLR/Lucene developers, thank you for your work. Looking trough the code base gave me a new appreciation of your work. Best Regards, Tobias PS: This issue was already posted by a colleague, "Inconsistent debugQuery score with multiplicative boost", but I wanted to create a new post with a clearer title.
Change in EXPLAIN info since Solr 5
Hi group, While exploring Solr 5.4.0, I noticed a subtle difference in the EXPLAIN debug information, compared to the version we currently use (4.10.1). Solr 4.10.1: 2.0739748 = (MATCH) max plus 1.0 times others of: 2.0739748 = (MATCH) weight(text:test in 30) [DefaultSimilarity], result of: 2.0739748 = score(doc=30,freq=3.0), product of: 0.3556181 = queryWeight, product of: 3.3671236 = idf(docFreq=17, maxDocs=192) 0.105614804 = queryNorm 5.832029 = fieldWeight in 30, product of: 1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 3.3671236 = idf(docFreq=17, maxDocs=192) 1.0 = fieldNorm(doc=30) Solr 5.4.0: 2.0739748 = max plus 1.0 times others of: 2.0739748 = weight(text:test in 30) [ClassicSimilarity], result of: 2.0739748 = score(doc=30,freq=3.0), product of: 0.3556181 = queryWeight, product of: 3.3671236 = idf(docFreq=17, maxDocs=192) 0.105614804 = queryNorm 5.832029 = fieldWeight in 30, product of: 1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 3.3671236 = idf(docFreq=17, maxDocs=192) 1.0 = fieldNorm(doc=30) The difference is the removal of (MATCH) in some of the EXPLAIN lines. That is causing issues for us since we have developed an EXPLAIN parser that leans on the presence of (MATCH) in the EXPLAIN. Does anyone have a suggestion how to insert back (MATCH) in the explain info (like which file should we patch)? Thanks, Tom
Score results by only the highest scoring term
Hi All, I wonder if it's in some way possible to search for multiple terms like: ( OR OR OR ) and in case a document contains 2 or more of these terms: only the highest scoring term should contribute to the final relevancy score; possibly lower scoring terms should be discarded from the scoring algorithm. Ideally I'd like an operator like ANY: ( ANY ANY ANY ) that has the purpose: return documents, sorted by the score of the highest scoring term. Any thoughts about how to achieve this? _ Tom Burgmans
incomplete proximity boost for fielded searches
Consider query: http://10.208.152.231:8080/solr/wkustaldocsphc_A/search?q=title:(Michigan Corporate Income Tax)&debugQuery=true&pf=title&ps=255&defType=edismax The intention is to perform a search in field title and to apply a proximity boost within a window of 255 words. If I look at the debug information, I see: BoostedQuery(boost(+((title:michigan title:corporate title:income title:tax)~4) (title:"corporate income tax"~255)~1.0)) Note that the first search term (michigan) is missing in the proximity boost clause. I can't believe this is intended behavior. Why is edismax splitting (title:Michigan) and (Corporate Income Tax) while determining what to use for proximity boost? Thanks, Tom
RE: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)
The main reason of using stopwords is to speed up query performance, since we see that a huge part is consumed by highlighting stopwords. Also when reading the full highlighted document, we think that it makes a document better readable when only meaningful words are highlighted. For searching in fact I like to keep stopwords... -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Wednesday 13 March 2013 04:43 To: solr-user@lucene.apache.org Subject: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB) Importance: Low Or don't use stopwords. I haven't used stopwords for, oh, a dozen years or so. Removing stopwords was a hack developed for 16-bit computers and 40 megabyte disks. We don't need to do that any more. wunder On Mar 13, 2013, at 8:28 AM, Ahmet Arslan wrote: > I would merge stop_en.txt and stop_fr.txt. Use same set of stop words for all > fields that you search on. > > You might find this useful : > http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/ > > --- On Wed, 3/13/13, Burgmans, Tom wrote: > >> From: Burgmans, Tom >> Subject: strange edismax parsing when searching in multiple fields (#TB) >> To: "solr-user@lucene.apache.org" >> Date: Wednesday, March 13, 2013, 5:22 PM >> Hi group, >> >> Background: >> I have a collection containing English and French documents. >> I made sure to index the English content in field "body" >> (fieldType=text_en) and the French content in field >> "body_fr" (fieldType=text_fr). >> >> The user could be either English of French so the goal is to >> execute the queries against both fields simultaneously >> without knowing the query language upfront. The query is >> analyzed differently for each field. For both fields a >> stopFilter is configured with each its own list of stopwords >> (different per language). >> >> The issue: >> When I search for 'a result' (without single quotes) in >> field "body" and "body_fr" at the same time, then "a" is >> considered a stopword in English and removed for field >> "body", but not in French so both terms are still searched >> inside "body_fr". What happens is that the query is parsed >> (edismax) into this construction: >> >> ((body_fr:a)~1.0 (body:result | body_fr:result)~1.0) >> >> This query returns only French documents, although there are >> many English documents in the index that contain the term >> 'result' as well. How can that happen? I think it is related >> to the way my query is parsed: there seems to be an >> AND-relationship between (body_fr:a) and (body:result | >> body_fr:result). There is no English document that has >> (body_fr:a), so that's why they don't show up. For me a much >> more logic parsed query would be: >> >> ((body:result)~1.0 | (body_fr:a body_fr:result)~1.0) >> >> How should I interpret this? Is it a bug in edismax? Is it >> intended and if yes: why? >> >> Thanks for any hint, >> Tom >> >> This email and any attachments may contain confidential or >> privileged information >> and is intended for the addressee only. If you are not the >> intended recipient, please >> immediately notify us by email or telephone and delete the >> original email and attachments >> without using, disseminating or reproducing its contents to >> anyone other than the intended >> recipient. Wolters Kluwer shall not be liable for the >> incorrect or incomplete transmission of >> of this email or any attachments, nor for unauthorized use >> by its employees. >> >> Wolters Kluwer nv has its registered address in Alphen aan >> den Rijn, The Netherlands, and is registered >> with the Trade Registry of the Dutch Chamber of Commerce >> under number 33202517. >> -- Walter Underwood wun...@wunderwood.org This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
strange edismax parsing when searching in multiple fields (#TB)
Hi group, Background: I have a collection containing English and French documents. I made sure to index the English content in field "body" (fieldType=text_en) and the French content in field "body_fr" (fieldType=text_fr). The user could be either English of French so the goal is to execute the queries against both fields simultaneously without knowing the query language upfront. The query is analyzed differently for each field. For both fields a stopFilter is configured with each its own list of stopwords (different per language). The issue: When I search for 'a result' (without single quotes) in field "body" and "body_fr" at the same time, then "a" is considered a stopword in English and removed for field "body", but not in French so both terms are still searched inside "body_fr". What happens is that the query is parsed (edismax) into this construction: ((body_fr:a)~1.0 (body:result | body_fr:result)~1.0) This query returns only French documents, although there are many English documents in the index that contain the term 'result' as well. How can that happen? I think it is related to the way my query is parsed: there seems to be an AND-relationship between (body_fr:a) and (body:result | body_fr:result). There is no English document that has (body_fr:a), so that's why they don't show up. For me a much more logic parsed query would be: ((body:result)~1.0 | (body_fr:a body_fr:result)~1.0) How should I interpret this? Is it a bug in edismax? Is it intended and if yes: why? Thanks for any hint, Tom This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
RE: Search in String and Text_en fields simultaneously with edismax
Ah OK. I didn't have a good view of query parsing vs query generation. Thanks for clearing this up. So it means that searching in a tokenized and non-tokenized field simultaneously is not possible when I want - the expression parsed as phrase for the non-tokenized field - the expression parsed as multiple tokens for the tokenized field ? If possible, I'd like to avoid writing my own query parser. -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Thursday 28 February 2013 05:05 To: solr-user@lucene.apache.org Subject: Re: Search in String and Text_en fields simultaneously with edismax Query text is always "tokenized" (more properly, "parsed"), unless the text is enclosed in quotes or spaces are escaped with backslash. Try: q=valueadd:"test . test2" or q=valueadd:test\ .\ test2 Parentheses simply provide grouping, either to control boolean operator evaluation order or to apply a field name to a sequence of query tokens (as you have written.) The analyzer or field type is only consulted when the query is generated, not while it is being parsed. The same identical parsing rules apply to both tokenized and non-tokenized fields. What a field type's analyzer does with its value is irrelevant to query parsing. -- Jack Krupansky -Original Message- From: Burgmans, Tom Sent: Thursday, February 28, 2013 10:48 AM To: solr-user@lucene.apache.org Subject: Search in String and Text_en fields simultaneously with edismax I have a field "valueadd" of type String and field "body" of type text_en (with tokenization and linguistic processing). When I search with edismax against field valueadd like this: q=valueadd:(test . test2) I see that the parsed query is (valueadd:test valueadd:. valueadd:test2)~3 Why not (valueadd:test . test2) ? It looks like the query is tokenized while field type String doesn't have a tokenizer configured. I know I could construct my query as: q=valueadd:"test . test2" in which case the phrase is searched as a whole against valueadd. But why doesn't that happen without quotes? The reason I ask: For a simultaneous search in multiple fields I like to include field valueadd in the qf parameter which contains String and text_en fields, like: &qf=valueadd body How can I search both fields simultaneously without duplicating search terms, while the query is (whitespace) tokenized for "body" but search as a phrase for "valueadd"? Thanks, Tom Burgmans This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517. This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
Search in String and Text_en fields simultaneously with edismax
I have a field "valueadd" of type String and field "body" of type text_en (with tokenization and linguistic processing). When I search with edismax against field valueadd like this: q=valueadd:(test . test2) I see that the parsed query is (valueadd:test valueadd:. valueadd:test2)~3 Why not (valueadd:test . test2) ? It looks like the query is tokenized while field type String doesn't have a tokenizer configured. I know I could construct my query as: q=valueadd:"test . test2" in which case the phrase is searched as a whole against valueadd. But why doesn't that happen without quotes? The reason I ask: For a simultaneous search in multiple fields I like to include field valueadd in the qf parameter which contains String and text_en fields, like: &qf=valueadd body How can I search both fields simultaneously without duplicating search terms, while the query is (whitespace) tokenized for "body" but search as a phrase for "valueadd"? Thanks, Tom Burgmans This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
How to I let the FVH highlight individual terms instead of the complete phrase?
Hi group, I'm trying to highlight my complete(!) XML document, which is indexed for that purpose in a special field called "wkxmlsource". I configured the "wkxmlsource" field like And the text_xml fieldtype is almost equal to the text_en field, but with the as the first class in the index analyzer. That prevents highlighting inside XML tags. First I tried the simple highlighter and that almost worked: I get my document back with my search terms and phrases highlighted, each individual term gets it own highlight tags. But the problem is that not the complete value of field "wkxmlsource" is returned; it cuts off the bottom part, no matter how big I set the hl.fragsize. So my next try was to use the FVH (hl.useFastVectorHighlighter=true) instead. That helped: it returns now the complete value of "wkxmlsource" with all my search terms/phrases highlighted. But...in case of a phrase search, it doesn't highlight each individual term anymore, but it only puts highlight tags around the complete phrase. That could possible lead to malformed XML. An example: Search for phrase: "across the country Santa Fe" it highlights like this in the document: ...spread across the country.Santa Fe Pacific... How can I let the FVH highlight individual terms instead of the complete phrase? Ideally I like to have something like: ...spread across the country.Santa Fe Pacific... which is still valid XML. My boundaryscanner is configured like: WORD en US Thanks, Tom -- Tom Burgmans [cid:image001.jpg@01CDDFA4.2B7968E0] Search Specialist Tel: +31 (0)17 246 66 33 Mobile: +31 (0)6 306 821 78 Platform Technologies Global Platform Organization Zuidpoolsingel 2 2408 ZE, Alphen aan den Rijn The Netherlands tom.burgm...@wolterskluwer.com www.wolterskluwer.com This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
RE: score calculation
I am also busy with getting this clear. Here are my notes so far (by copying and writing myself): queryWeight = the impact of the query against the field implementation: boost(query)*idf*queryNorm boost(query) = boost of the field at query-time Implication: hits in fields with higher boost get a higher score Rationale: a term in field A could be more relevant than the same term in field B idf = inverse document frequency = measure of how often the term appears across the index for this field implementation: log(numDocs/(docFreq+1))+1 Implication: the greater the occurrence of a term in different documents, the lower its score Rationale: common terms are less important than uncommon ones numDocs = the total number of documents in the index, not including those that are marked as deleted but have not yet been purged. This is a constant (the same value for all documents in the index). docFreq = the number of documents in the index which contain the term in this field. This is a constant (the same value for all documents in the index containing this field) queryNorm = normalization factor so that queries can be compared implementation: 1/sqrt(sumOfSquaredWeights) Implication: doesn't impact the relevancy of this result Rationale: queryNorm is not related to the relevance of the document, but rather tries to make scores between different queries comparable. This value is equal for all results of the query fieldWeight = the score of a term matching the field implementation: tf*idf*fieldNorm tf = term frequency in a field = measure of how often a term appears in the field implementation: sqrt(freq) Implication: the more frequent a term occurs in a field, the greater its score Rationale: fields which contains more of a term are generally more relevant freq = termFreq = amount of times the term occurs in the field for this document fieldNorm = impact of a hit in this field implementation: lengthNorm*boost(index) lengthNorm = measure of the importance of a term according to the total number of terms in the field implementation: 1/sqrt(numTerms) Implication: a term matched in fields with less terms have a higher score Rationale: a term in a field with less terms is more important than one with more numTerms = amount of terms in a field boost (index) = boost of the field at index-time Implication: hits in fields with higher boost get a higher score Rationale: a term in field A could be more relevant than the same term in field B maxDocs = the number of documents in the index, including those that are marked as deleted but have not yet been purged. This is a constant (the same value for all documents in the index) Implication: (probably) doesn't play a role in the scoring calculation coord = number of terms in the query that were found in the document (omitted if equal to 1) implementation: overlap/maxOverlap Implication: of the terms in the query, a document that contains more terms will have a higher score Rationale: documents that match the most optional terms score highest overlap = the number of query terms matched in the document maxOverlap = the total number of terms in the query FunctionQuery = could be any kind of custom ranking function, which outcome is added to, or multiplied with the default rank score. Implication: various Look at the EXPLAIN information to see how the final score is calculated. Tom -Original Message- From: Sangeetha [mailto:sangeetha...@gmail.com] Sent: Thursday 13 December 2012 08:33 To: solr-user@lucene.apache.org Subject: score calculation I want to know how score is calculated? what is fieldweight, fieldNorm, queryWeight and queryNorm. And what is the formula to get the final score using fieldweight, fieldNorm, queryWeight ,queryNorm, idf and tf. Can anyone explain or provide some links? Thanks, Sangeetha -- View this message in context: http://lucene.472066.n3.nabble.com/score-calculation-tp4026669.html Sent from the Solr - User mailing list archive at Nabble.com. This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 332025
RE: Can a field with defined synonym be searched without the synonym?
In our case it's the opposite. For our clients it is very important that every synonym gets equal chances in the relevancy calculation. The fact that "nol" scores higher than "net operating loss", simply because its document frequency is lower, is unacceptable and a reason to look for ways to disable the IDF from the score calculation. But that is in fact something I don't like to do since IDF is such an elementary part of the algorithm (and very useful for non-synonym searches). Pre-processing synonyms to apply 'reverse weighting' is also a strategy to consider but I agree with Walter that this very error-prone, things could get easily out of sync. Moreover, none of our Dev-, QA-, STG-, PRD- environment contain exactly the same content, so it would require different tuned synonyms dictionary for each of them...meh... In our previous search engine (FAST ESP) we basically switched off IDF, but I am still a bit hoping that there is a more sophisticated solution with Solr. -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Thursday 13 December 2012 02:30 To: solr-user@lucene.apache.org Subject: Re: Can a field with defined synonym be searched without the synonym? All of the applications I've seen with user control over synonym expansion where recall-oriented. The "give me all matches for X" kind of problem. So ranking is not as important. wunder On Dec 12, 2012, at 5:23 PM, Roman Chyla wrote: > Well, this IDF problem has more sides. So, let's say your synonym file > contains multi-token synonyms (it does, right? or perhaps you don't need > it? well, some people do) > > "TV, TV set, TV foo, television" > > if you use the default synonym expansion, when you index 'television' > > you have increased frequency of also 'set', 'foo', so, the IDF of 'TV' is > the same as that of 'television' - but IDF of 'foo' and 'set' has changed > (their frequency increased, their IDF decreased) -- TV's have in fact made > 'foo' term very frequent and undesirable > > So, you might be sure that IDF of 'TV' and 'television' are the same, but > you are not aware it has 'screwed' other (desirable) terms - so it really > depends. And I wouldn't argue these cases are esoteric. > > And finally: there are use cases out there, where people NEED to switch off > synonym expansion at will (find only these documents, that contain the word > 'TV' and not that bloody 'foo'). This cannot be done if the index contains > all synonym terms (unless you have a way to mark the original and the > synonym in the index). > > roman > > > On Wed, Dec 12, 2012 at 12:50 PM, Walter Underwood > wrote: > >> Query parsers cannot fix the IDF problem or make query-time synonyms >> faster. Query synonym expansion makes more search terms. More search terms >> are more work at query time. >> >> The IDF problem is real; I've run up against it. The most rare variant of >> the synonym have the highest score. This probably the opposite of what you >> want. For me, it was "TV" and "television". Documents with "TV" had higher >> scores than those with "television". >> >> wunder >> >> On Dec 12, 2012, at 9:45 AM, Roman Chyla wrote: >> >>> @wunder >>> It is a misconception (well, supported by that wiki description) that the >>> query time synonym filter have these problems. It is actually the default >>> parser, that is causing these problems. Look at this if you still think >>> that index time synonyms are cure for all: >>> https://issues.apache.org/jira/browse/LUCENE-4499 >>> >>> @joe >>> If you can use the flexible query parser (as linked in by @Swati) then >> all >>> you need to do is to define a different field with a different tokenizer >>> chain and then swap the field names before the analyzers processes the >>> document (and then rewrite the field name back - for example, we have >>> fields called "author" and "author_nosyn") >>> >>> roman >>> >>> On Wed, Dec 12, 2012 at 12:38 PM, Walter Underwood < >> wun...@wunderwood.org>wrote: >>> Query time synonyms have known problems. They are slower, cause >> incorrect IDF, and don't work for phrase synonyms. Apply synonyms at index time and you will have none of those problems. See: >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory wunder On Dec 12, 2012, at 9:34 AM, Swati Swoboda wrote: > Query-time analyzers are still applied, even if you include a string in quotes. Would you expect "foo" to not match "Foo" just because it's enclosed in quotes? > > Also look at this, someone who had similar requirements: > >> http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time-td2919876.html > > > -Original Message- > From: joe.cohe...@gmail.com [mailto:joe.cohe...@gmail.com] > Sent: Wednesday, December 12, 2012 12:09 PM > To: solr-user@lucene.apache.org > Subject: Re: Can a field with defined synon
RE: edismax: implicit AND changes into implicit OR
Yes /browse returns velocity stuff, but I mostly add &wt=xml in the query. And yes, I looked at the parsedquery feedback that &debugQuery=true provides. That basically confirms my idea that the implicit AND is indeed switched to an implicit OR in case an explicit OR is somewhere else present in the query. Even the default operator set to AND seems to be overruled. Thanks, I'll think about submitting a Jira. -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Wednesday 12 December 2012 06:43 To: solr-user@lucene.apache.org Subject: Re: edismax: implicit AND changes into implicit OR On 12/12/2012 10:27 AM, Burgmans, Tom wrote: > I have set in the schema (and > restarted Solr), and tested again with > > http://localhost:8983/solr/collection1/browse?defType=edismax&q=(Thomas+Michael)+OR+xxxmatchesnothingxxx&q.op=AND > > note the extra parameter. Still it returns the 7 documents that matches > (Thomas OR Michael), but not (Thomas AND Michael). > > The only way to enforce an implicit AND is by changing the query into > > http://localhost:8983/solr/collection1/browse?defType=edismax&q=(%2BThomas+%2BMichael)+OR+%2Bxxxmatchesnothingxxx > > But then the AND isn't implicit anymore...and I don't like to prefix all my > search terms with a +. It smells like a bug to me, so you should probably file an issue in Jira. I will admit that this is getting somewhat outside my experience level. I noticed the /browse there ... is this just what you have named your handler, or is this connected with the Velocity stuff? Have you tried adding &debugQuery=true to your URL and seeing what your different queries actually parse to? It may also be a good idea to add &echoParams=all so you can see all parameters that are going into the request. Thanks, Shawn This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
RE: edismax: implicit AND changes into implicit OR
I have set in the schema (and restarted Solr), and tested again with http://localhost:8983/solr/collection1/browse?defType=edismax&q=(Thomas+Michael)+OR+xxxmatchesnothingxxx&q.op=AND note the extra parameter. Still it returns the 7 documents that matches (Thomas OR Michael), but not (Thomas AND Michael). The only way to enforce an implicit AND is by changing the query into http://localhost:8983/solr/collection1/browse?defType=edismax&q=(%2BThomas+%2BMichael)+OR+%2Bxxxmatchesnothingxxx But then the AND isn't implicit anymore...and I don't like to prefix all my search terms with a +. -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Wednesday 12 December 2012 05:46 To: solr-user@lucene.apache.org Subject: Re: edismax: implicit AND changes into implicit OR On 12/12/2012 5:51 AM, Burgmans, Tom wrote: > I have some documents indexed; 3 of them contain "Thomas" and 4 of > them contain "Michael", but none of the contain both. A search for > > http://localhost:8983/solr/collection1/browse?defType=edismax&q=(Thomas+Michael) > <http://localhost:8983/solr/collection1/browse?defType=edismax&q=%28Thomas+Michael%29> > > returns 0 results as expected since there is an implicit AND between > the two terms and there is no document that matches both. But a search > for > > http://localhost:8983/solr/collection1/browse?defType=edismax&q=(Thomas+Michael)+OR+xxxmatchesnothingxxx > <http://localhost:8983/solr/collection1/browse?defType=edismax&q=%28Thomas+Michael%29+OR+xxxmatchesnothingxxx> > > returns 7 results. For some reason the implicit AND turns into an > implicit OR, in case an Explicit OR is added to the query expression. > The parsedquery information confirms this behavior. > > I'll give you my best guess, nothing to back this up but instinct. The following statements (especially the second one) may be wrong: When you do not include any boolean operators, edismax is using its "mm" parameter, which defaults to 100%, meaning that all search terms must match (equivalent to a default operator of AND). When you DO include a boolean operator, mm goes out the window and edismax reverts to using the default operator for solr, your schema, or the request handler, which unless you have changed it, is OR. Thanks, Shawn This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.
edismax: implicit AND changes into implicit OR
Hi all, I wonder if this is a bug or expected behavior: I have some documents indexed; 3 of them contain "Thomas" and 4 of them contain "Michael", but none of the contain both. A search for http://localhost:8983/solr/collection1/browse?defType=edismax&q=(Thomas+Michael) returns 0 results as expected since there is an implicit AND between the two terms and there is no document that matches both. But a search for http://localhost:8983/solr/collection1/browse?defType=edismax&q=(Thomas+Michael)+OR+xxxmatchesnothingxxx returns 7 results. For some reason the implicit AND turns into an implicit OR, in case an Explicit OR is added to the query expression. The parsedquery information confirms this behavior. Why is edismax doing this? Tested on a Solr 4.0.0 instance. Thanks, Tom -- Tom Burgmans [cid:image001.jpg@01CDD86E.DC411F70] Search Specialist Tel: +31 (0)17 246 66 33 Mobile: +31 (0)6 306 821 78 Platform Technologies Global Platform Organization Zuidpoolsingel 2 2408 ZE, Alphen aan den Rijn The Netherlands tom.burgm...@wolterskluwer.com www.wolterskluwer.com This email and any attachments may contain confidential or privileged information and is intended for the addressee only. If you are not the intended recipient, please immediately notify us by email or telephone and delete the original email and attachments without using, disseminating or reproducing its contents to anyone other than the intended recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission of of this email or any attachments, nor for unauthorized use by its employees. Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands, and is registered with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.