from:"Burgmans, Tom"

mapping and tuning payloads in Solr 8

2020-02-12 Thread Burgmans, Tom

Hi all,

In our Solr 6 setup we use string payloads to boost certain tokens (URIs). 
These strings are mapped to floats via a schema parameter "PayloadMapping", 
which can be read out in our custom WKSimilarity class (extending 
TFIDFSimilarity).









   
0.4
0.4
0.5
0
0.0
10.0
3.0
 1.0
 isAbout=15.0,coversFiscalPeriod=10.0,type=5.0,hasTheme=5.0,subject=4.0,mentions=2.0,creator=2.0
   


The reason for this indirection is convenience: by storing payload strings 
i.s.o. floats we could change & tune the boosts easily by updating the schema 
without having to change the content set.
Inside WKSimilarity each payload string is mapped to its corresponding boost 
value and the final boost is applied via the scorePayload method (where we 
could tune the boost curve via some additional schema parameters). This works 
well in Solr 6.

The problem: we are about to migrate to Solr 8 and after LUCENE-8014 it isn't 
possible anymore the override the scorePayload method in WKSimilarity (it is 
removed from TFIDFSimilarity). I wonder what alternatives there are for mapping 
strings payload to floats and use them in a tunable formula for boosting.

Thanks,
Tom Burgmans

RE: Multiplicative Boosts broken since 7.3 (LUCENE-8099)

2019-02-13 Thread Burgmans, Tom

I like to bump this issue up, since this is a showstopper for us to upgrade 
from Solr 6. In https://issues.apache.org/jira/browse/SOLR-13126 I described a 
couple of more use cases in which this bug appears. We see different scores in 
the EXPLAIN compared to the actual scores and our analysis is that the EXPLAIN 
in fact is correct. It happens when a multiplicative boost is used (via the 
"boost" parameter) in combination with some function queries, like "query" and 
"field". 

One example (tested on Solr 7.5.0), when running: 

http://localhost:8983/solr/test/select?defType=edismax&fl=id,score,[explain 
style=text]&q=*:*&boost=sum(field(price),4)

then the expectation is that a document that doesn't have the price field gets 
a score of 4. The result however is: 

{
"id": "docid123576",
"score": 1.0,
"[explain]": "4.0 = product of:\n  1.0 = boost\n  4.0 = product of:\n
1.0 = *:*\n4.0 = sum(float(price)=0.0,const(4))\n"
}

EXPLAIN and score are not consistent.

Best regards Tom


-Original Message-
From: Tobias Ibounig [mailto:t.ibou...@netconomy.net] 
Sent: dinsdag 22 januari 2019 10:14
To: solr-user@lucene.apache.org
Subject: Multiplicative Boosts broken since 7.3 (LUCENE-8099)

Hello,

As described in 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSOLR-13126&data=02%7C01%7Ctom.burgmans%40wolterskluwer.com%7C82b7f7923bd74285295e08d68049f3da%7C8ac76c91e7f141ffa89c3553b2da2c17%7C0%7C0%7C636837452448856240&sdata=paFEStnQwxcKQQ9mM1MfPXQm%2BrStTaqQnYFH2LolVl8%3D&reserved=0
 multiplicative boots (in certain conditions) seem to be broken since 7.3.
The error seems to be introduced in 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-8099&data=02%7C01%7Ctom.burgmans%40wolterskluwer.com%7C82b7f7923bd74285295e08d68049f3da%7C8ac76c91e7f141ffa89c3553b2da2c17%7C0%7C0%7C636837452448856240&sdata=Gs1EzQ%2FCSO8ryZJv0EGx2etxmDA7HkW8Crj5H6mE%2FvE%3D&reserved=0.
 Reverting the SOLR parts to the now deprecated BoostingQuery again fixes the 
issue.
The filed issue contains a test case and a patch with the revert (for testing 
purposes, not really a clean fix).
We sadly couldn't find the actual issue, which seems to lie with the use of 
"FunctionScoreQuery" for boosting.

We were able to patch our 7.5 installation with the patch. As others might be 
affected as well, we hope this can be helpful in resolving this bug.

To all SOLR/Lucene developers, thank you for your work. Looking trough the code 
base gave me a new appreciation of your work.

Best Regards,
Tobias

PS: This issue was already posted by a colleague, "Inconsistent debugQuery 
score with multiplicative boost", but I wanted to create a new post with a 
clearer title.

Change in EXPLAIN info since Solr 5

2016-02-04 Thread Burgmans, Tom

Hi group, 

While exploring Solr 5.4.0, I noticed a subtle difference in the EXPLAIN debug 
information, compared to the version we currently use (4.10.1).

Solr 4.10.1:

2.0739748 = (MATCH) max plus 1.0 times others of:
  2.0739748 = (MATCH) weight(text:test in 30) [DefaultSimilarity], result of:
2.0739748 = score(doc=30,freq=3.0), product of:
  0.3556181 = queryWeight, product of:
3.3671236 = idf(docFreq=17, maxDocs=192)
0.105614804 = queryNorm
  5.832029 = fieldWeight in 30, product of:
1.7320508 = tf(freq=3.0), with freq of:
  3.0 = termFreq=3.0
3.3671236 = idf(docFreq=17, maxDocs=192)
1.0 = fieldNorm(doc=30)

Solr 5.4.0:

2.0739748 = max plus 1.0 times others of:
  2.0739748 = weight(text:test in 30) [ClassicSimilarity], result of:
2.0739748 = score(doc=30,freq=3.0), product of:
  0.3556181 = queryWeight, product of:
3.3671236 = idf(docFreq=17, maxDocs=192)
0.105614804 = queryNorm
  5.832029 = fieldWeight in 30, product of:
1.7320508 = tf(freq=3.0), with freq of:
  3.0 = termFreq=3.0
3.3671236 = idf(docFreq=17, maxDocs=192)
1.0 = fieldNorm(doc=30)

The difference is the removal of (MATCH) in some of the EXPLAIN lines. That is 
causing issues for us since we have developed an EXPLAIN parser that leans on 
the presence of (MATCH) in the EXPLAIN.
Does anyone have a suggestion how to insert back (MATCH) in the explain info 
(like which file should we patch)?

Thanks, Tom

Score results by only the highest scoring term

2015-02-03 Thread Burgmans, Tom

Hi All,

I wonder if it's in some way possible to search for multiple terms like:

( OR  OR  OR )

and in case a document contains 2 or more of these terms: only the highest 
scoring term should contribute to the final relevancy score; possibly lower 
scoring  terms should be discarded from the scoring algorithm.

Ideally I'd like an operator like ANY:

( ANY  ANY  ANY )

that has the purpose: return documents, sorted by the score of the highest 
scoring term.

Any thoughts about how to achieve this?

_
Tom Burgmans

incomplete proximity boost for fielded searches

2014-08-28 Thread Burgmans, Tom

Consider query:
http://10.208.152.231:8080/solr/wkustaldocsphc_A/search?q=title:(Michigan 
Corporate Income Tax)&debugQuery=true&pf=title&ps=255&defType=edismax

The intention is to perform a search in field title and to apply a proximity 
boost within a window of 255 words. If I look at the debug information, I see:


BoostedQuery(boost(+((title:michigan title:corporate title:income title:tax)~4) 
(title:"corporate income tax"~255)~1.0))


Note that the first search term (michigan) is missing in the proximity boost 
clause. I can't believe this is intended behavior. 

Why is edismax splitting  (title:Michigan) and (Corporate Income Tax) while 
determining what to use for proximity boost?

Thanks, Tom

RE: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)

2013-03-13 Thread Burgmans, Tom

The main reason of using stopwords is to speed up query performance, since we 
see that a huge part is consumed by highlighting stopwords. Also when reading 
the full highlighted document, we think that it makes a document better 
readable when only meaningful words are highlighted.

For searching in fact I like to keep stopwords...


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org]
Sent: Wednesday 13 March 2013 04:43
To: solr-user@lucene.apache.org
Subject: [SPAM] Re: strange edismax parsing when searching in multiple fields 
(#TB)
Importance: Low

Or don't use stopwords. I haven't used stopwords for, oh, a dozen years or so.

Removing stopwords was a hack developed for 16-bit computers and 40 megabyte 
disks. We don't need to do that any more.

wunder

On Mar 13, 2013, at 8:28 AM, Ahmet Arslan wrote:

> I would merge stop_en.txt and stop_fr.txt. Use same set of stop words for all 
> fields that you search on.
>
> You might find this useful : 
> http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
>
> --- On Wed, 3/13/13, Burgmans, Tom  wrote:
>
>> From: Burgmans, Tom 
>> Subject: strange edismax parsing when searching in multiple fields (#TB)
>> To: "solr-user@lucene.apache.org" 
>> Date: Wednesday, March 13, 2013, 5:22 PM
>> Hi group,
>>
>> Background:
>> I have a collection containing English and French documents.
>> I made sure to index the English content in field "body"
>> (fieldType=text_en) and the French content in field
>> "body_fr" (fieldType=text_fr).
>>
>> The user could be either English of French so the goal is to
>> execute the queries against both fields simultaneously
>> without knowing the query language upfront. The query is
>> analyzed differently for each field. For both fields a
>> stopFilter is configured with each its own list of stopwords
>> (different per language).
>>
>> The issue:
>> When I search for 'a result' (without single quotes) in
>> field "body" and "body_fr" at the same time, then "a" is
>> considered a stopword in English and removed for field
>> "body", but not in French so both terms are still searched
>> inside "body_fr". What happens is that the query is parsed
>> (edismax) into this construction:
>>
>> ((body_fr:a)~1.0 (body:result | body_fr:result)~1.0)
>>
>> This query returns only French documents, although there are
>> many English documents in the index that contain the term
>> 'result' as well. How can that happen? I think it is related
>> to the way my query is parsed: there seems to be an
>> AND-relationship between (body_fr:a) and (body:result |
>> body_fr:result). There is no English document that has
>> (body_fr:a), so that's why they don't show up. For me a much
>> more logic parsed query would be:
>>
>> ((body:result)~1.0 | (body_fr:a body_fr:result)~1.0)
>>
>> How should I interpret this? Is it a bug in edismax? Is it
>> intended and if yes: why?
>>
>> Thanks for any hint,
>> Tom
>>
>> This email and any attachments may contain confidential or
>> privileged information
>> and is intended for the addressee only. If you are not the
>> intended recipient, please
>> immediately notify us by email or telephone and delete the
>> original email and attachments
>> without using, disseminating or reproducing its contents to
>> anyone other than the intended
>> recipient. Wolters Kluwer shall not be liable for the
>> incorrect or incomplete transmission of
>> of this email or any attachments, nor for unauthorized use
>> by its employees.
>>
>> Wolters Kluwer nv has its registered address in Alphen aan
>> den Rijn, The Netherlands, and is registered
>> with the Trade Registry of the Dutch Chamber of Commerce
>> under number 33202517.
>>

--
Walter Underwood
wun...@wunderwood.org




This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.

strange edismax parsing when searching in multiple fields (#TB)

2013-03-13 Thread Burgmans, Tom

Hi group,

Background:
I have a collection containing English and French documents. I made sure to 
index the English content in field "body" (fieldType=text_en) and the French 
content in field "body_fr" (fieldType=text_fr).

The user could be either English of French so the goal is to execute the 
queries against both fields simultaneously without knowing the query language 
upfront. The query is analyzed differently for each field. For both fields a 
stopFilter is configured with each its own list of stopwords (different per 
language).

The issue:
When I search for 'a result' (without single quotes) in field "body" and 
"body_fr" at the same time, then "a" is considered a stopword in English and 
removed for field "body", but not in French so both terms are still searched 
inside "body_fr". What happens is that the query is parsed (edismax) into this 
construction:

((body_fr:a)~1.0 (body:result | body_fr:result)~1.0)

This query returns only French documents, although there are many English 
documents in the index that contain the term 'result' as well. How can that 
happen? I think it is related to the way my query is parsed: there seems to be 
an AND-relationship between (body_fr:a) and (body:result | body_fr:result). 
There is no English document that has (body_fr:a), so that's why they don't 
show up. For me a much more logic parsed query would be:

((body:result)~1.0 | (body_fr:a body_fr:result)~1.0)

How should I interpret this? Is it a bug in edismax? Is it intended and if yes: 
why?

Thanks for any hint,
Tom

This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.

RE: Search in String and Text_en fields simultaneously with edismax

2013-02-28 Thread Burgmans, Tom

Ah OK. I didn't have a good view of query parsing vs query generation. Thanks 
for clearing this up.

So it means that searching in a tokenized and non-tokenized field 
simultaneously is not possible when I want
- the expression parsed as phrase for the non-tokenized field
- the expression parsed as multiple tokens for the tokenized field
?

If possible, I'd like to avoid writing my own query parser.



-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Thursday 28 February 2013 05:05
To: solr-user@lucene.apache.org
Subject: Re: Search in String and Text_en fields simultaneously with edismax

Query text is always "tokenized" (more properly, "parsed"), unless the text
is enclosed in quotes or spaces are escaped with backslash. Try:

q=valueadd:"test . test2"

or

q=valueadd:test\ .\ test2

Parentheses simply provide grouping, either to control boolean operator
evaluation order or to apply a field name to a sequence of query tokens (as
you have written.)

The analyzer or field type is only consulted when the query is generated,
not while it is being parsed. The same identical parsing rules apply to both
tokenized and non-tokenized fields. What a field type's analyzer does with
its value is irrelevant to query parsing.

-- Jack Krupansky

-Original Message-
From: Burgmans, Tom
Sent: Thursday, February 28, 2013 10:48 AM
To: solr-user@lucene.apache.org
Subject: Search in String and Text_en fields simultaneously with edismax

I have a field "valueadd" of type String and field "body" of type text_en
(with tokenization and linguistic processing).

When I search with edismax against field valueadd like this:
q=valueadd:(test . test2)
I see that the parsed query is
(valueadd:test valueadd:. valueadd:test2)~3

Why not (valueadd:test . test2) ? It looks like the query is tokenized while
field type String doesn't have a tokenizer configured.

I know I could construct my query as:
q=valueadd:"test . test2"
in which case the phrase is searched as a whole against valueadd. But why
doesn't that happen without quotes?


The reason I ask:
For a simultaneous search in multiple fields I like to include field
valueadd in the qf parameter which contains String and text_en fields, like:
&qf=valueadd body

How can I search both fields simultaneously without duplicating search
terms, while the query is (whitespace) tokenized for "body" but search as a
phrase for "valueadd"?

Thanks,
Tom Burgmans

This email and any attachments may contain confidential or privileged
information
and is intended for the addressee only. If you are not the intended
recipient, please
immediately notify us by email or telephone and delete the original email
and attachments
without using, disseminating or reproducing its contents to anyone other
than the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or
incomplete transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number
33202517.


This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.

Search in String and Text_en fields simultaneously with edismax

2013-02-28 Thread Burgmans, Tom

I have a field "valueadd" of type String and field "body" of type text_en (with 
tokenization and linguistic processing).

When I search with edismax against field valueadd like this:
q=valueadd:(test . test2)
I see that the parsed query is
(valueadd:test valueadd:. valueadd:test2)~3

Why not (valueadd:test . test2) ? It looks like the query is tokenized while 
field type String doesn't have a tokenizer configured.

I know I could construct my query as:
q=valueadd:"test . test2"
in which case the phrase is searched as a whole against valueadd. But why 
doesn't that happen without quotes?


The reason I ask:
For a simultaneous search in multiple fields I like to include field valueadd 
in the qf parameter which contains String and text_en fields, like:
&qf=valueadd body

How can I search both fields simultaneously without duplicating search terms, 
while the query is (whitespace) tokenized for "body" but search as a phrase for 
"valueadd"?

Thanks,
Tom Burgmans

This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.

How to I let the FVH highlight individual terms instead of the complete phrase?

2012-12-21 Thread Burgmans, Tom

Hi group,

I'm trying to highlight my complete(!) XML document, which is indexed for that 
purpose in a special field called "wkxmlsource". I configured the "wkxmlsource" 
field like



And the text_xml fieldtype is almost equal to the text_en field, but with the 
 as the first class in 
the index analyzer. That prevents highlighting inside XML tags.

First I tried the simple highlighter and that almost worked: I get my document 
back with my search terms and phrases highlighted, each individual term gets it 
own highlight tags. But the problem is that not the complete value of field 
"wkxmlsource" is returned; it cuts off the bottom part, no matter how big I set 
the hl.fragsize.

So my next try was to use the FVH (hl.useFastVectorHighlighter=true) instead. 
That helped: it returns now the complete value of "wkxmlsource" with all my 
search terms/phrases highlighted. But...in case of a phrase search, it doesn't 
highlight each individual term anymore, but it only puts highlight tags around 
the complete phrase. That could possible lead to malformed XML. An example:

Search for phrase: "across the country Santa Fe" it highlights like this in the 
document:

...spread across the country.Santa Fe Pacific... 

How can I let the FVH highlight individual terms instead of the complete 
phrase? Ideally I like to have something like:

...spread across  the  
country.Santa  Fe 
Pacific... 

which is still valid XML.

My boundaryscanner is configured like:

   


WORD

en

US




Thanks, Tom
--
Tom Burgmans

[cid:image001.jpg@01CDDFA4.2B7968E0]

Search Specialist


Tel:  +31 (0)17 246 66 33
Mobile: +31 (0)6 306 821 78

Platform Technologies
Global Platform Organization

Zuidpoolsingel 2
2408 ZE, Alphen aan den Rijn The Netherlands

tom.burgm...@wolterskluwer.com


www.wolterskluwer.com






This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.

RE: score calculation

2012-12-12 Thread Burgmans, Tom

I am also busy with getting this clear. Here are my notes so far (by copying 
and writing myself):



queryWeight = the impact of the query against the field
implementation: boost(query)*idf*queryNorm


boost(query) = boost of the field at query-time
Implication: hits in fields with higher boost get a higher score
Rationale: a term in field A could be more relevant than the same term 
in field B


idf = inverse document frequency = measure of how often the term appears 
across the index for this field
implementation: log(numDocs/(docFreq+1))+1
Implication: the greater the occurrence of a term in different 
documents, the lower its score
Rationale: common terms are less important than uncommon ones
numDocs = the total number of documents in the index, not including those 
that are marked as deleted but have not yet been purged. This is a constant 
(the same value for all documents in the index).
docFreq = the number of documents in the index which contain the term in 
this field. This is a constant (the same value for all documents in the index 
containing this field)


queryNorm = normalization factor so that queries can be compared
implementation: 1/sqrt(sumOfSquaredWeights)
Implication: doesn't impact the relevancy of this result
Rationale: queryNorm is not related to the relevance of the document, 
but rather tries to make scores between different queries comparable. This 
value is equal for all results of the query


fieldWeight = the score of a term matching the field
implementation: tf*idf*fieldNorm


tf = term frequency in a field = measure of how often a term appears in the 
field
implementation: sqrt(freq)
Implication: the more frequent a term occurs in a field, the greater 
its score
Rationale: fields which contains more of a term are generally more 
relevant
freq = termFreq = amount of times the term occurs in the field for this 
document


fieldNorm = impact of a hit in this field
implementation: lengthNorm*boost(index)
lengthNorm = measure of the importance of a term according to the total 
number of terms in the field
implementation: 1/sqrt(numTerms)
Implication: a term matched in fields with less terms have a higher 
score
Rationale: a term in a field with less terms is more important than one 
with more
numTerms = amount of terms in a field
boost (index) = boost of the field at index-time
Implication: hits in fields with higher boost get a higher score
Rationale: a term in field A could be more relevant than the same term 
in field B


maxDocs = the number of documents in the index, including those that are 
marked as deleted but have not yet been purged. This is a constant (the same 
value for all documents in the index)
Implication: (probably) doesn't play a role in the scoring calculation


coord = number of terms in the query that were found in the document 
(omitted if equal to 1)
implementation: overlap/maxOverlap
Implication: of the terms in the query, a document that contains more 
terms will have a higher score
Rationale: documents that match the most optional terms score highest
overlap = the number of query terms matched in the document
maxOverlap = the total number of terms in the query


FunctionQuery = could be any kind of custom ranking function, which outcome 
is added to, or multiplied with the default rank score.
Implication: various


Look at the EXPLAIN information to see how the final score is calculated.

Tom


-Original Message-
From: Sangeetha [mailto:sangeetha...@gmail.com]
Sent: Thursday 13 December 2012 08:33
To: solr-user@lucene.apache.org
Subject: score calculation


I want to know how score is calculated?

what is fieldweight, fieldNorm, queryWeight and queryNorm. And what is the
formula to get the final score using fieldweight, fieldNorm, queryWeight
,queryNorm, idf and tf.

Can anyone explain or provide some links?

Thanks,
Sangeetha



--
View this message in context: 
http://lucene.472066.n3.nabble.com/score-calculation-tp4026669.html
Sent from the Solr - User mailing list archive at Nabble.com.

This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 332025

RE: Can a field with defined synonym be searched without the synonym?

2012-12-12 Thread Burgmans, Tom

In our case it's the opposite. For our clients it is very important that every 
synonym gets equal chances in the relevancy calculation. The fact that "nol" 
scores higher than "net operating loss", simply because its document frequency 
is lower, is unacceptable and a reason to look for ways to disable the IDF from 
the score calculation. But that is in fact something I don't like to do since 
IDF is such an elementary part of the algorithm (and very useful for 
non-synonym searches).

Pre-processing synonyms to apply 'reverse weighting' is also a strategy to 
consider but I agree with Walter that this very error-prone, things could get 
easily out of sync. Moreover, none of our Dev-, QA-, STG-, PRD- environment 
contain exactly the same content, so it would require different tuned synonyms 
dictionary for each of them...meh...

In our previous search engine (FAST ESP) we basically switched off IDF, but I 
am still a bit hoping that there is a more sophisticated solution with Solr.


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org]
Sent: Thursday 13 December 2012 02:30
To: solr-user@lucene.apache.org
Subject: Re: Can a field with defined synonym be searched without the synonym?

All of the applications I've seen with user control over synonym expansion 
where recall-oriented. The "give me all matches for X" kind of problem. So 
ranking is not as important.

wunder

On Dec 12, 2012, at 5:23 PM, Roman Chyla wrote:

> Well, this IDF problem has more sides. So, let's say your synonym file
> contains multi-token synonyms (it does, right? or perhaps you don't need
> it? well, some people do)
>
> "TV, TV set, TV foo, television"
>
> if you use the default synonym expansion, when you index 'television'
>
> you have increased frequency of also 'set', 'foo', so, the IDF of 'TV' is
> the same as that of 'television' - but IDF of 'foo' and 'set' has changed
> (their frequency increased, their IDF decreased) -- TV's have in fact made
> 'foo' term very frequent and undesirable
>
> So, you might be sure that IDF of 'TV' and 'television' are the same, but
> you are not aware it has 'screwed' other (desirable) terms - so it really
> depends. And I wouldn't argue these cases are esoteric.
>
> And finally: there are use cases out there, where people NEED to switch off
> synonym expansion at will (find only these documents, that contain the word
> 'TV' and not that bloody 'foo'). This cannot be done if the index contains
> all synonym terms (unless you have a way to mark the original and the
> synonym in the index).
>
> roman
>
>
> On Wed, Dec 12, 2012 at 12:50 PM, Walter Underwood 
> wrote:
>
>> Query parsers cannot fix the IDF problem or make query-time synonyms
>> faster. Query synonym expansion makes more search terms. More search terms
>> are more work at query time.
>>
>> The IDF problem is real; I've run up against it. The most rare variant of
>> the synonym have the highest score. This probably the opposite of what you
>> want. For me, it was "TV" and "television". Documents with "TV" had higher
>> scores than those with "television".
>>
>> wunder
>>
>> On Dec 12, 2012, at 9:45 AM, Roman Chyla wrote:
>>
>>> @wunder
>>> It is a misconception (well, supported by that wiki description) that the
>>> query time synonym filter have these problems. It is actually the default
>>> parser, that is causing these problems. Look at this if you still think
>>> that index time synonyms are cure for all:
>>> https://issues.apache.org/jira/browse/LUCENE-4499
>>>
>>> @joe
>>> If you can use the flexible query parser (as linked in by @Swati) then
>> all
>>> you need to do is to define a different field with a different tokenizer
>>> chain and then swap the field names before the analyzers processes the
>>> document (and then rewrite the field name back - for example, we have
>>> fields called "author" and "author_nosyn")
>>>
>>> roman
>>>
>>> On Wed, Dec 12, 2012 at 12:38 PM, Walter Underwood <
>> wun...@wunderwood.org>wrote:
>>>
 Query time synonyms have known problems. They are slower, cause
>> incorrect
 IDF, and don't work for phrase synonyms.

 Apply synonyms at index time and you will have none of those problems.

 See:

>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

 wunder

 On Dec 12, 2012, at 9:34 AM, Swati Swoboda wrote:

> Query-time analyzers are still applied, even if you include a string in
 quotes. Would you expect "foo" to not match "Foo" just because it's
 enclosed in quotes?
>
> Also look at this, someone who had similar requirements:
>

>> http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time-td2919876.html
>
>
> -Original Message-
> From: joe.cohe...@gmail.com [mailto:joe.cohe...@gmail.com]
> Sent: Wednesday, December 12, 2012 12:09 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Can a field with defined synon

RE: edismax: implicit AND changes into implicit OR

2012-12-12 Thread Burgmans, Tom

Yes /browse returns velocity stuff, but I mostly add &wt=xml in the query. And 
yes, I looked at the parsedquery feedback that &debugQuery=true provides. That 
basically confirms my idea that the implicit AND is indeed switched to an 
implicit OR in case an explicit OR is somewhere else present in the query. Even 
the default operator set to AND seems to be overruled.

Thanks, I'll think about submitting a Jira.

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org]
Sent: Wednesday 12 December 2012 06:43
To: solr-user@lucene.apache.org
Subject: Re: edismax: implicit AND changes into implicit OR

On 12/12/2012 10:27 AM, Burgmans, Tom wrote:
> I have set  in the schema (and 
> restarted Solr), and tested again with
>
> http://localhost:8983/solr/collection1/browse?defType=edismax&q=(Thomas+Michael)+OR+xxxmatchesnothingxxx&q.op=AND
>
> note the extra parameter. Still it returns the 7 documents that matches 
> (Thomas OR Michael), but not (Thomas AND Michael).
>
> The only way to enforce an implicit AND is by changing the query into
>
> http://localhost:8983/solr/collection1/browse?defType=edismax&q=(%2BThomas+%2BMichael)+OR+%2Bxxxmatchesnothingxxx
>
> But then the AND isn't implicit anymore...and I don't like to prefix all my 
> search terms with a +.

It smells like a bug to me, so you should probably file an issue in
Jira.  I will admit that this is getting somewhat outside my experience
level.

I noticed the /browse there ... is this just what you have named your
handler, or is this connected with the Velocity stuff?

Have you tried adding &debugQuery=true to your URL and seeing what your
different queries actually parse to?  It may also be a good idea to add
&echoParams=all so you can see all parameters that are going into the
request.

Thanks,
Shawn

This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.

RE: edismax: implicit AND changes into implicit OR

2012-12-12 Thread Burgmans, Tom

I have set  in the schema (and 
restarted Solr), and tested again with

http://localhost:8983/solr/collection1/browse?defType=edismax&q=(Thomas+Michael)+OR+xxxmatchesnothingxxx&q.op=AND

note the extra parameter. Still it returns the 7 documents that matches (Thomas 
OR Michael), but not (Thomas AND Michael).

The only way to enforce an implicit AND is by changing the query into

http://localhost:8983/solr/collection1/browse?defType=edismax&q=(%2BThomas+%2BMichael)+OR+%2Bxxxmatchesnothingxxx

But then the AND isn't implicit anymore...and I don't like to prefix all my 
search terms with a +.

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org]
Sent: Wednesday 12 December 2012 05:46
To: solr-user@lucene.apache.org
Subject: Re: edismax: implicit AND changes into implicit OR

On 12/12/2012 5:51 AM, Burgmans, Tom wrote:
> I have some documents indexed; 3 of them contain "Thomas" and 4 of
> them contain "Michael", but none of the contain both. A search for
>
> http://localhost:8983/solr/collection1/browse?defType=edismax&q=(Thomas+Michael)
> <http://localhost:8983/solr/collection1/browse?defType=edismax&q=%28Thomas+Michael%29>
>
> returns 0 results as expected since there is an implicit AND between
> the two terms and there is no document that matches both. But a search
> for
>
> http://localhost:8983/solr/collection1/browse?defType=edismax&q=(Thomas+Michael)+OR+xxxmatchesnothingxxx
> <http://localhost:8983/solr/collection1/browse?defType=edismax&q=%28Thomas+Michael%29+OR+xxxmatchesnothingxxx>
>
> returns 7 results. For some reason the implicit AND turns into an
> implicit OR, in case an Explicit OR is added to the query expression.
> The parsedquery information confirms this behavior.
>
>

I'll give you my best guess, nothing to back this up but instinct. The
following statements (especially the second one) may be wrong:

When you do not include any boolean operators, edismax is using its "mm"
parameter, which defaults to 100%, meaning that all search terms must
match (equivalent to a default operator of AND).

When you DO include a boolean operator, mm goes out the window and
edismax reverts to using the default operator for solr, your schema, or
the request handler, which unless you have changed it, is OR.

Thanks,
Shawn

This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.

edismax: implicit AND changes into implicit OR

2012-12-12 Thread Burgmans, Tom

Hi all,

I wonder if this is a bug or expected behavior:

I have some documents indexed; 3 of them contain "Thomas" and 4 of them contain
"Michael", but none of the contain both. A search for
http://localhost:8983/solr/collection1/browse?defType=edismax&q=(Thomas+Michael)
returns 0 results as expected since there is an implicit AND between the two
terms and there is no document that matches both. But a search for
http://localhost:8983/solr/collection1/browse?defType=edismax&q=(Thomas+Michael)+OR+xxxmatchesnothingxxx
returns 7 results. For some reason the implicit AND turns into an implicit OR,
in case an Explicit OR is added to the query expression. The parsedquery
information confirms this behavior.

Why is edismax doing this?

Tested on a Solr 4.0.0 instance.

Thanks, Tom

--
Tom Burgmans

[cid:image001.jpg@01CDD86E.DC411F70]

Search Specialist

Tel: +31 (0)17 246 66 33
Mobile: +31 (0)6 306 821 78

Platform Technologies
Global Platform Organization

Zuidpoolsingel 2
2408 ZE, Alphen aan den Rijn The Netherlands

tom.burgm...@wolterskluwer.com

www.wolterskluwer.com

This email and any attachments may contain confidential or privileged
information
and is intended for the addressee only. If you are not the intended recipient,
please
immediately notify us by email or telephone and delete the original email and
attachments
without using, disseminating or reproducing its contents to anyone other than
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.

mapping and tuning payloads in Solr 8

RE: Multiplicative Boosts broken since 7.3 (LUCENE-8099)

Change in EXPLAIN info since Solr 5

Score results by only the highest scoring term

incomplete proximity boost for fielded searches

RE: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)

strange edismax parsing when searching in multiple fields (#TB)

RE: Search in String and Text_en fields simultaneously with edismax

Search in String and Text_en fields simultaneously with edismax

How to I let the FVH highlight individual terms instead of the complete phrase?

RE: score calculation

RE: Can a field with defined synonym be searched without the synonym?

RE: edismax: implicit AND changes into implicit OR

RE: edismax: implicit AND changes into implicit OR

edismax: implicit AND changes into implicit OR

15 matches

Site Navigation

Mail list logo

Footer information