Re: custom scoring

Em Mon, 20 Feb 2012 04:53:15 -0800

Carlos,

nice to hear that the approach helped you!


Could you show us how your query-request looks like after reworking?

Regards,
Em

Am 20.02.2012 13:30, schrieb Carlos Gonzalez-Cadenas:
> Hello all:
> 
> We've done some tests with Em's approach of putting a BooleanQuery in front
> of our user query, that means:
> 
> BooleanQuery
>     must (DismaxQuery)
>     should (FunctionQuery)
> 
> The FunctionQuery obtains the SOLR IR score by means of a QueryValueSource,
> then does the SQRT of this value, and then multiplies it by our custom
> "query_score" float, pulling it by means of a FieldCacheSource.
> 
> In particular, we've proceeded in the following way:
> 
>    - we've loaded the whole index in the page cache of the OS to make sure
>    we don't have disk IO problems that might affect the benchmarks (our
>    machine has enough memory to load all the index in RAM)
>    - we've executed an out-of-benchmark query 10-20 times to make sure that
>    everything is jitted and that Lucene's FieldCache is properly populated.
>    - we've disabled all the caches (filter query cache, document cache,
>    query cache)
>    - we've executed 8 different user queries with and without
>    FunctionQueries, with early termination in both cases (our collector stops
>    after collecting 50 documents per shard)
> 
> Em was correct, the query is much faster with the BooleanQuery in front,
> but it's still 30-40% slower than the query without FunctionQueries.
> 
> Although one may think that it's reasonable that the query response time
> increases because of the extra computations, we believe that the increase
> is too big, given that we're collecting just 500-600 documents due to the
> early query termination techniques we currently use.
> 
> Any ideas on how to make it faster?.
> 
> Thanks a lot,
> Carlos
> 
> Carlos Gonzalez-Cadenas
> CEO, ExperienceOn - New generation search
> http://www.experienceon.com
> 
> Mobile: +34 652 911 201
> Skype: carlosgonzalezcadenas
> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> 
> 
> On Fri, Feb 17, 2012 at 11:07 AM, Carlos Gonzalez-Cadenas <
> c...@experienceon.com> wrote:
> 
>> Thanks Em, Robert, Chris for your time and valuable advice. We'll make
>> some tests and will let you know soon.
>>
>>
>>
>> On Thu, Feb 16, 2012 at 11:43 PM, Em <mailformailingli...@yahoo.de> wrote:
>>
>>> Hello Carlos,
>>>
>>> I think we missunderstood eachother.
>>>
>>> As an example:
>>> BooleanQuery (
>>>  clauses: (
>>>     MustMatch(
>>>               DisjunctionMaxQuery(
>>>                   TermQuery("stopword_field", "barcelona"),
>>>                   TermQuery("stopword_field", "hoteles")
>>>               )
>>>     ),
>>>     ShouldMatch(
>>>                  FunctionQuery(
>>>                    *please insert your function here*
>>>                 )
>>>     )
>>>  )
>>> )
>>>
>>> Explanation:
>>> You construct an artificial BooleanQuery which wraps your user's query
>>> as well as your function query.
>>> Your user's query - in that case - is just a DisjunctionMaxQuery
>>> consisting of two TermQueries.
>>> In the real world you might construct another BooleanQuery around your
>>> DisjunctionMaxQuery in order to have more flexibility.
>>> However the interesting part of the given example is, that we specify
>>> the user's query as a MustMatch-condition of the BooleanQuery and the
>>> FunctionQuery just as a ShouldMatch.
>>> Constructed that way, I am expecting the FunctionQuery only scores those
>>> documents which fit the MustMatch-Condition.
>>>
>>> I conclude that from the fact that the FunctionQuery-class also has a
>>> skipTo-method and I would expect that the scorer will use it to score
>>> only matching documents (however I did not search where and how it might
>>> get called).
>>>
>>> If my conclusion is wrong than hopefully Robert Muir (as far as I can
>>> see the author of that class) can tell us what was the intention by
>>> constructing an every-time-match-all-function-query.
>>>
>>> Can you validate whether your QueryParser constructs a query in the form
>>> I drew above?
>>>
>>> Regards,
>>> Em
>>>
>>> Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
>>>> Hello Em:
>>>>
>>>> 1) Here's a printout of an example DisMax query (as you can see mostly
>>> MUST
>>>> terms except for some SHOULD terms used for boosting scores for
>>> stopwords)
>>>> *
>>>> *
>>>> *((+stopword_shortened_phrase:hoteles
>>> +stopword_shortened_phrase:barcelona
>>>> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
>>>> +stopword_phrase:barcelona
>>>> stopword_phrase:en) | (+stopword_shortened_phrase:hoteles
>>> +stopword_short
>>>> ened_phrase:barcelona stopword_shortened_phrase:en) |
>>> (+stopword_phrase:hoteles
>>>> +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
>>>> tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
>>>> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
>>> +wildcard_stopw
>>>> ord_phrase:barcelona stopword_phrase:en) |
>>> (+stopword_shortened_phrase:hoteles
>>>> +wildcard_stopword_shortened_phrase:barcelona
>>> stopword_shortened_phrase:en)
>>>> | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
>>>> stopword_phrase:en))*
>>>> *
>>>> *
>>>> 2)* *The collector is inserted in the SolrIndexSearcher (replacing the
>>>> TimeLimitingCollector). We trigger it through the SOLR interface by
>>> passing
>>>> the timeAllowed parameter. We know this is a hack but AFAIK there's no
>>>> out-of-the-box way to specify custom collectors by now (
>>>> https://issues.apache.org/jira/browse/SOLR-1680). In any case the
>>> collector
>>>> part works perfectly as of now, so clearly this is not the problem.
>>>>
>>>> 3) Re: your sentence:
>>>> *
>>>> *
>>>> **I* would expect that with a shrinking set of matching documents to
>>>> the overall-query, the function query only checks those documents that
>>> are
>>>> guaranteed to be within the result set.*
>>>> *
>>>> *
>>>> Yes, I agree with this, but this snippet of code in FunctionQuery.java
>>>> seems to say otherwise:
>>>>
>>>>     // instead of matching all docs, we could also embed a query.
>>>>     // the score could either ignore the subscore, or boost it.
>>>>     // Containment:  floatline(foo:myTerm, "myFloatField", 1.0, 0.0f)
>>>>     // Boost:        foo:myTerm^floatline("myFloatField",1.0,0.0f)
>>>>     @Override
>>>>     public int nextDoc() throws IOException {
>>>>       for(;;) {
>>>>         ++doc;
>>>>         if (doc>=maxDoc) {
>>>>           return doc=NO_MORE_DOCS;
>>>>         }
>>>>         if (acceptDocs != null && !acceptDocs.get(doc)) continue;
>>>>         return doc;
>>>>       }
>>>>     }
>>>>
>>>> It seems that the author also thought of maybe embedding a query in
>>> order
>>>> to restrict matches, but this doesn't seem to be in place as of now (or
>>>> maybe I'm not understanding how the whole thing works :) ).
>>>>
>>>> Thanks
>>>> Carlos
>>>> *
>>>> *
>>>>
>>>> Carlos Gonzalez-Cadenas
>>>> CEO, ExperienceOn - New generation search
>>>> http://www.experienceon.com
>>>>
>>>> Mobile: +34 652 911 201
>>>> Skype: carlosgonzalezcadenas
>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>>>
>>>>
>>>> On Thu, Feb 16, 2012 at 8:09 PM, Em <mailformailingli...@yahoo.de>
>>> wrote:
>>>>
>>>>> Hello Carlos,
>>>>>
>>>>>> We have some more tests on that matter: now we're moving from issuing
>>>>> this
>>>>>> large query through the SOLR interface to creating our own
>>>>> QueryParser. The
>>>>>> initial tests we've done in our QParser (that internally creates
>>> multiple
>>>>>> queries and inserts them inside a DisjunctionMaxQuery) are very good,
>>>>> we're
>>>>>> getting very good response times and high quality answers. But when
>>> we've
>>>>>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e.
>>> with a
>>>>>> QueryValueSource that wraps the DisMaxQuery), then the times move from
>>>>>> 10-20 msec to 200-300msec.
>>>>> I reviewed the sourcecode and yes, the FunctionQuery iterates over the
>>>>> whole index, however... let's see!
>>>>>
>>>>> In relation to the DisMaxQuery you create within your parser: What kind
>>>>> of clause is the FunctionQuery and what kind of clause are your other
>>>>> queries (MUST, SHOULD, MUST_NOT...)?
>>>>>
>>>>> *I* would expect that with a shrinking set of matching documents to the
>>>>> overall-query, the function query only checks those documents that are
>>>>> guaranteed to be within the result set.
>>>>>
>>>>>> Note that we're using early termination of queries (via a custom
>>>>>> collector), and therefore (as shown by the numbers I included above)
>>> even
>>>>>> if the query is very complex, we're getting very fast answers. The
>>> only
>>>>>> situation where the response time explodes is when we include a
>>>>>> FunctionQuery.
>>>>> Could you give us some details about how/where did you plugin the
>>>>> Collector, please?
>>>>>
>>>>> Kind regards,
>>>>> Em
>>>>>
>>>>> Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas:
>>>>>> Hello Em:
>>>>>>
>>>>>> Thanks for your answer.
>>>>>>
>>>>>> Yes, we initially also thought that the excessive increase in response
>>>>> time
>>>>>> was caused by the several queries being executed, and we did another
>>>>> test.
>>>>>> We executed one of the subqueries that I've shown to you directly in
>>> the
>>>>>> "q" parameter and then we tested this same subquery (only this one,
>>>>> without
>>>>>> the others) with the function query "query($q1)" in the "q" parameter.
>>>>>>
>>>>>> Theoretically the times for these two queries should be more or less
>>> the
>>>>>> same, but the second one is several times slower than the first one.
>>>>> After
>>>>>> this observation we learned more about function queries and we learned
>>>>> from
>>>>>> the code and from some comments in the forums [1] that the
>>>>> FunctionQueries
>>>>>> are expected to match all documents.
>>>>>>
>>>>>> We have some more tests on that matter: now we're moving from issuing
>>>>> this
>>>>>> large query through the SOLR interface to creating our own
>>> QueryParser.
>>>>> The
>>>>>> initial tests we've done in our QParser (that internally creates
>>> multiple
>>>>>> queries and inserts them inside a DisjunctionMaxQuery) are very good,
>>>>> we're
>>>>>> getting very good response times and high quality answers. But when
>>> we've
>>>>>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e.
>>> with a
>>>>>> QueryValueSource that wraps the DisMaxQuery), then the times move from
>>>>>> 10-20 msec to 200-300msec.
>>>>>>
>>>>>> Note that we're using early termination of queries (via a custom
>>>>>> collector), and therefore (as shown by the numbers I included above)
>>> even
>>>>>> if the query is very complex, we're getting very fast answers. The
>>> only
>>>>>> situation where the response time explodes is when we include a
>>>>>> FunctionQuery.
>>>>>>
>>>>>> Re: your question of what we're trying to achieve ... We're
>>> implementing
>>>>> a
>>>>>> powerful query autocomplete system, and we use several fields to a)
>>>>> improve
>>>>>> performance on wildcard queries and b) have a very precise control
>>> over
>>>>> the
>>>>>> score.
>>>>>>
>>>>>> Thanks a lot for your help,
>>>>>> Carlos
>>>>>>
>>>>>> [1]:
>>>>>
>>> http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0
>>>>>>
>>>>>> Carlos Gonzalez-Cadenas
>>>>>> CEO, ExperienceOn - New generation search
>>>>>> http://www.experienceon.com
>>>>>>
>>>>>> Mobile: +34 652 911 201
>>>>>> Skype: carlosgonzalezcadenas
>>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 16, 2012 at 7:09 PM, Em <mailformailingli...@yahoo.de>
>>>>> wrote:
>>>>>>
>>>>>>> Hello Carlos,
>>>>>>>
>>>>>>> well, you must take into account that you are executing up to 8
>>> queries
>>>>>>> per request instead of one query per request.
>>>>>>>
>>>>>>> I am not totally sure about the details of the implementation of the
>>>>>>> max-function-query, but I guess it first iterates over the results of
>>>>>>> the first max-query, afterwards over the results of the second
>>> max-query
>>>>>>> and so on. This is a much higher complexity than in the case of a
>>> normal
>>>>>>> query.
>>>>>>>
>>>>>>> I would suggest you to optimize your request. I don't think that this
>>>>>>> particular function query is matching *all* docs. Instead I think it
>>>>>>> just matches those docs specified by your inner-query (although I
>>> might
>>>>>>> be wrong about that).
>>>>>>>
>>>>>>> What are you trying to achieve by your request?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Em
>>>>>>>
>>>>>>> Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas:
>>>>>>>> Hello Em:
>>>>>>>>
>>>>>>>> The URL is quite large (w/ shards, ...), maybe it's best if I paste
>>> the
>>>>>>>> relevant parts.
>>>>>>>>
>>>>>>>> Our "q" parameter is:
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>> "q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)))))\"",
>>>>>>>>
>>>>>>>> The subqueries q8, q7, q4 and q3 are regular queries, for example:
>>>>>>>>
>>>>>>>> "q7":"stopword_phrase:colomba~1 AND stopword_phrase:santa AND
>>>>>>>> wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
>>>>>>>> (stopword_phrase:las AND stopword_phrase:de)"
>>>>>>>>
>>>>>>>> We've executed the subqueries q3-q8 independently and they're very
>>>>> fast,
>>>>>>>> but when we introduce the function queries as described below, it
>>> all
>>>>>>> goes
>>>>>>>> 10X slower.
>>>>>>>>
>>>>>>>> Let me know if you need anything else.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Carlos
>>>>>>>>
>>>>>>>>
>>>>>>>> Carlos Gonzalez-Cadenas
>>>>>>>> CEO, ExperienceOn - New generation search
>>>>>>>> http://www.experienceon.com
>>>>>>>>
>>>>>>>> Mobile: +34 652 911 201
>>>>>>>> Skype: carlosgonzalezcadenas
>>>>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Feb 16, 2012 at 4:02 PM, Em <mailformailingli...@yahoo.de>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello carlos,
>>>>>>>>>
>>>>>>>>> could you show us how your Solr-call looks like?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Em
>>>>>>>>>
>>>>>>>>> Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
>>>>>>>>>> Hello all:
>>>>>>>>>>
>>>>>>>>>> We'd like to score the matching documents using a combination of
>>>>> SOLR's
>>>>>>>>> IR
>>>>>>>>>> score with another application-specific score that we store within
>>>>> the
>>>>>>>>>> documents themselves (i.e. a float field containing the
>>> app-specific
>>>>>>>>>> score). In particular, we'd like to calculate the final score
>>> doing
>>>>>>> some
>>>>>>>>>> operations with both numbers (i.e product, sqrt, ...)
>>>>>>>>>>
>>>>>>>>>> According to what we know, there are two ways to do this in SOLR:
>>>>>>>>>>
>>>>>>>>>> A) Sort by function [1]: We've tested an expression like
>>>>>>>>>> "sort=product(score, query_score)" in the SOLR query, where score
>>> is
>>>>>>> the
>>>>>>>>>> common SOLR IR score and query_score is our own precalculated
>>> score,
>>>>>>> but
>>>>>>>>> it
>>>>>>>>>> seems that SOLR can only do this with stored/indexed fields (and
>>>>>>>>> obviously
>>>>>>>>>> "score" is not stored/indexed).
>>>>>>>>>>
>>>>>>>>>> B) Function queries: We've used _val_ and function queries like
>>> max,
>>>>>>> sqrt
>>>>>>>>>> and query, and we've obtained the desired results from a
>>> functional
>>>>>>> point
>>>>>>>>>> of view. However, our index is quite large (400M documents) and
>>> the
>>>>>>>>>> performance degrades heavily, given that function queries are
>>> AFAIK
>>>>>>>>>> matching all the documents.
>>>>>>>>>>
>>>>>>>>>> I have two questions:
>>>>>>>>>>
>>>>>>>>>> 1) Apart from the two options I mentioned, is there any other
>>>>> (simple)
>>>>>>>>> way
>>>>>>>>>> to achieve this that we're not aware of?
>>>>>>>>>>
>>>>>>>>>> 2) If we have to choose the function queries path, would it be
>>> very
>>>>>>>>>> difficult to modify the actual implementation so that it doesn't
>>>>> match
>>>>>>>>> all
>>>>>>>>>> the documents, that is, to pass a query so that it only operates
>>> over
>>>>>>> the
>>>>>>>>>> documents matching the query?. Looking at the FunctionQuery.java
>>>>> source
>>>>>>>>>> code, there's a comment that says "// instead of matching all
>>> docs,
>>>>> we
>>>>>>>>>> could also embed a query. the score could either ignore the
>>> subscore,
>>>>>>> or
>>>>>>>>>> boost it", which is giving us some hope that maybe it's possible
>>> and
>>>>>>> even
>>>>>>>>>> desirable to go in this direction. If you can give us some
>>> directions
>>>>>>>>> about
>>>>>>>>>> how to go about this, we may be able to do the actual
>>> implementation.
>>>>>>>>>>
>>>>>>>>>> BTW, we're using Lucene/SOLR trunk.
>>>>>>>>>>
>>>>>>>>>> Thanks a lot for your help.
>>>>>>>>>> Carlos
>>>>>>>>>>
>>>>>>>>>> [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: custom scoring

Reply via email to