Re: custom scoring

Carlos Gonzalez-Cadenas Mon, 20 Feb 2012 05:06:17 -0800

Yeah Em, it helped a lot :)

Here it is (for the user query "hoteles"):


*+(stopword_shortened_phrase:hoteles | stopword_phrase:hoteles |
wildcard_stopword_shortened_phrase:hoteles |
wildcard_stopword_phrase:hoteles) *

*product(pow(query((stopword_shortened_phrase:hoteles |
stopword_phrase:hoteles | wildcard_stopword_shortened_phrase:hoteles |
wildcard_stopword_phrase:hoteles),def=0.0),const(0.5)),float(query_score))*

Thanks a lot for your help.

Carlos
Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Mon, Feb 20, 2012 at 1:50 PM, Em <mailformailingli...@yahoo.de> wrote:

> Carlos,
>
> nice to hear that the approach helped you!
>
> Could you show us how your query-request looks like after reworking?
>
> Regards,
> Em
>
> Am 20.02.2012 13:30, schrieb Carlos Gonzalez-Cadenas:
> > Hello all:
> >
> > We've done some tests with Em's approach of putting a BooleanQuery in
> front
> > of our user query, that means:
> >
> > BooleanQuery
> >     must (DismaxQuery)
> >     should (FunctionQuery)
> >
> > The FunctionQuery obtains the SOLR IR score by means of a
> QueryValueSource,
> > then does the SQRT of this value, and then multiplies it by our custom
> > "query_score" float, pulling it by means of a FieldCacheSource.
> >
> > In particular, we've proceeded in the following way:
> >
> >    - we've loaded the whole index in the page cache of the OS to make
> sure
> >    we don't have disk IO problems that might affect the benchmarks (our
> >    machine has enough memory to load all the index in RAM)
> >    - we've executed an out-of-benchmark query 10-20 times to make sure
> that
> >    everything is jitted and that Lucene's FieldCache is properly
> populated.
> >    - we've disabled all the caches (filter query cache, document cache,
> >    query cache)
> >    - we've executed 8 different user queries with and without
> >    FunctionQueries, with early termination in both cases (our collector
> stops
> >    after collecting 50 documents per shard)
> >
> > Em was correct, the query is much faster with the BooleanQuery in front,
> > but it's still 30-40% slower than the query without FunctionQueries.
> >
> > Although one may think that it's reasonable that the query response time
> > increases because of the extra computations, we believe that the increase
> > is too big, given that we're collecting just 500-600 documents due to the
> > early query termination techniques we currently use.
> >
> > Any ideas on how to make it faster?.
> >
> > Thanks a lot,
> > Carlos
> >
> > Carlos Gonzalez-Cadenas
> > CEO, ExperienceOn - New generation search
> > http://www.experienceon.com
> >
> > Mobile: +34 652 911 201
> > Skype: carlosgonzalezcadenas
> > LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >
> >
> > On Fri, Feb 17, 2012 at 11:07 AM, Carlos Gonzalez-Cadenas <
> > c...@experienceon.com> wrote:
> >
> >> Thanks Em, Robert, Chris for your time and valuable advice. We'll make
> >> some tests and will let you know soon.
> >>
> >>
> >>
> >> On Thu, Feb 16, 2012 at 11:43 PM, Em <mailformailingli...@yahoo.de>
> wrote:
> >>
> >>> Hello Carlos,
> >>>
> >>> I think we missunderstood eachother.
> >>>
> >>> As an example:
> >>> BooleanQuery (
> >>>  clauses: (
> >>>     MustMatch(
> >>>               DisjunctionMaxQuery(
> >>>                   TermQuery("stopword_field", "barcelona"),
> >>>                   TermQuery("stopword_field", "hoteles")
> >>>               )
> >>>     ),
> >>>     ShouldMatch(
> >>>                  FunctionQuery(
> >>>                    *please insert your function here*
> >>>                 )
> >>>     )
> >>>  )
> >>> )
> >>>
> >>> Explanation:
> >>> You construct an artificial BooleanQuery which wraps your user's query
> >>> as well as your function query.
> >>> Your user's query - in that case - is just a DisjunctionMaxQuery
> >>> consisting of two TermQueries.
> >>> In the real world you might construct another BooleanQuery around your
> >>> DisjunctionMaxQuery in order to have more flexibility.
> >>> However the interesting part of the given example is, that we specify
> >>> the user's query as a MustMatch-condition of the BooleanQuery and the
> >>> FunctionQuery just as a ShouldMatch.
> >>> Constructed that way, I am expecting the FunctionQuery only scores
> those
> >>> documents which fit the MustMatch-Condition.
> >>>
> >>> I conclude that from the fact that the FunctionQuery-class also has a
> >>> skipTo-method and I would expect that the scorer will use it to score
> >>> only matching documents (however I did not search where and how it
> might
> >>> get called).
> >>>
> >>> If my conclusion is wrong than hopefully Robert Muir (as far as I can
> >>> see the author of that class) can tell us what was the intention by
> >>> constructing an every-time-match-all-function-query.
> >>>
> >>> Can you validate whether your QueryParser constructs a query in the
> form
> >>> I drew above?
> >>>
> >>> Regards,
> >>> Em
> >>>
> >>> Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
> >>>> Hello Em:
> >>>>
> >>>> 1) Here's a printout of an example DisMax query (as you can see mostly
> >>> MUST
> >>>> terms except for some SHOULD terms used for boosting scores for
> >>> stopwords)
> >>>> *
> >>>> *
> >>>> *((+stopword_shortened_phrase:hoteles
> >>> +stopword_shortened_phrase:barcelona
> >>>> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
> >>>> +stopword_phrase:barcelona
> >>>> stopword_phrase:en) | (+stopword_shortened_phrase:hoteles
> >>> +stopword_short
> >>>> ened_phrase:barcelona stopword_shortened_phrase:en) |
> >>> (+stopword_phrase:hoteles
> >>>> +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
> >>>> tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
> >>>> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
> >>> +wildcard_stopw
> >>>> ord_phrase:barcelona stopword_phrase:en) |
> >>> (+stopword_shortened_phrase:hoteles
> >>>> +wildcard_stopword_shortened_phrase:barcelona
> >>> stopword_shortened_phrase:en)
> >>>> | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
> >>>> stopword_phrase:en))*
> >>>> *
> >>>> *
> >>>> 2)* *The collector is inserted in the SolrIndexSearcher (replacing the
> >>>> TimeLimitingCollector). We trigger it through the SOLR interface by
> >>> passing
> >>>> the timeAllowed parameter. We know this is a hack but AFAIK there's no
> >>>> out-of-the-box way to specify custom collectors by now (
> >>>> https://issues.apache.org/jira/browse/SOLR-1680). In any case the
> >>> collector
> >>>> part works perfectly as of now, so clearly this is not the problem.
> >>>>
> >>>> 3) Re: your sentence:
> >>>> *
> >>>> *
> >>>> **I* would expect that with a shrinking set of matching documents to
> >>>> the overall-query, the function query only checks those documents that
> >>> are
> >>>> guaranteed to be within the result set.*
> >>>> *
> >>>> *
> >>>> Yes, I agree with this, but this snippet of code in FunctionQuery.java
> >>>> seems to say otherwise:
> >>>>
> >>>>     // instead of matching all docs, we could also embed a query.
> >>>>     // the score could either ignore the subscore, or boost it.
> >>>>     // Containment:  floatline(foo:myTerm, "myFloatField", 1.0, 0.0f)
> >>>>     // Boost:        foo:myTerm^floatline("myFloatField",1.0,0.0f)
> >>>>     @Override
> >>>>     public int nextDoc() throws IOException {
> >>>>       for(;;) {
> >>>>         ++doc;
> >>>>         if (doc>=maxDoc) {
> >>>>           return doc=NO_MORE_DOCS;
> >>>>         }
> >>>>         if (acceptDocs != null && !acceptDocs.get(doc)) continue;
> >>>>         return doc;
> >>>>       }
> >>>>     }
> >>>>
> >>>> It seems that the author also thought of maybe embedding a query in
> >>> order
> >>>> to restrict matches, but this doesn't seem to be in place as of now
> (or
> >>>> maybe I'm not understanding how the whole thing works :) ).
> >>>>
> >>>> Thanks
> >>>> Carlos
> >>>> *
> >>>> *
> >>>>
> >>>> Carlos Gonzalez-Cadenas
> >>>> CEO, ExperienceOn - New generation search
> >>>> http://www.experienceon.com
> >>>>
> >>>> Mobile: +34 652 911 201
> >>>> Skype: carlosgonzalezcadenas
> >>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >>>>
> >>>>
> >>>> On Thu, Feb 16, 2012 at 8:09 PM, Em <mailformailingli...@yahoo.de>
> >>> wrote:
> >>>>
> >>>>> Hello Carlos,
> >>>>>
> >>>>>> We have some more tests on that matter: now we're moving from
> issuing
> >>>>> this
> >>>>>> large query through the SOLR interface to creating our own
> >>>>> QueryParser. The
> >>>>>> initial tests we've done in our QParser (that internally creates
> >>> multiple
> >>>>>> queries and inserts them inside a DisjunctionMaxQuery) are very
> good,
> >>>>> we're
> >>>>>> getting very good response times and high quality answers. But when
> >>> we've
> >>>>>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e.
> >>> with a
> >>>>>> QueryValueSource that wraps the DisMaxQuery), then the times move
> from
> >>>>>> 10-20 msec to 200-300msec.
> >>>>> I reviewed the sourcecode and yes, the FunctionQuery iterates over
> the
> >>>>> whole index, however... let's see!
> >>>>>
> >>>>> In relation to the DisMaxQuery you create within your parser: What
> kind
> >>>>> of clause is the FunctionQuery and what kind of clause are your other
> >>>>> queries (MUST, SHOULD, MUST_NOT...)?
> >>>>>
> >>>>> *I* would expect that with a shrinking set of matching documents to
> the
> >>>>> overall-query, the function query only checks those documents that
> are
> >>>>> guaranteed to be within the result set.
> >>>>>
> >>>>>> Note that we're using early termination of queries (via a custom
> >>>>>> collector), and therefore (as shown by the numbers I included above)
> >>> even
> >>>>>> if the query is very complex, we're getting very fast answers. The
> >>> only
> >>>>>> situation where the response time explodes is when we include a
> >>>>>> FunctionQuery.
> >>>>> Could you give us some details about how/where did you plugin the
> >>>>> Collector, please?
> >>>>>
> >>>>> Kind regards,
> >>>>> Em
> >>>>>
> >>>>> Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas:
> >>>>>> Hello Em:
> >>>>>>
> >>>>>> Thanks for your answer.
> >>>>>>
> >>>>>> Yes, we initially also thought that the excessive increase in
> response
> >>>>> time
> >>>>>> was caused by the several queries being executed, and we did another
> >>>>> test.
> >>>>>> We executed one of the subqueries that I've shown to you directly in
> >>> the
> >>>>>> "q" parameter and then we tested this same subquery (only this one,
> >>>>> without
> >>>>>> the others) with the function query "query($q1)" in the "q"
> parameter.
> >>>>>>
> >>>>>> Theoretically the times for these two queries should be more or less
> >>> the
> >>>>>> same, but the second one is several times slower than the first one.
> >>>>> After
> >>>>>> this observation we learned more about function queries and we
> learned
> >>>>> from
> >>>>>> the code and from some comments in the forums [1] that the
> >>>>> FunctionQueries
> >>>>>> are expected to match all documents.
> >>>>>>
> >>>>>> We have some more tests on that matter: now we're moving from
> issuing
> >>>>> this
> >>>>>> large query through the SOLR interface to creating our own
> >>> QueryParser.
> >>>>> The
> >>>>>> initial tests we've done in our QParser (that internally creates
> >>> multiple
> >>>>>> queries and inserts them inside a DisjunctionMaxQuery) are very
> good,
> >>>>> we're
> >>>>>> getting very good response times and high quality answers. But when
> >>> we've
> >>>>>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e.
> >>> with a
> >>>>>> QueryValueSource that wraps the DisMaxQuery), then the times move
> from
> >>>>>> 10-20 msec to 200-300msec.
> >>>>>>
> >>>>>> Note that we're using early termination of queries (via a custom
> >>>>>> collector), and therefore (as shown by the numbers I included above)
> >>> even
> >>>>>> if the query is very complex, we're getting very fast answers. The
> >>> only
> >>>>>> situation where the response time explodes is when we include a
> >>>>>> FunctionQuery.
> >>>>>>
> >>>>>> Re: your question of what we're trying to achieve ... We're
> >>> implementing
> >>>>> a
> >>>>>> powerful query autocomplete system, and we use several fields to a)
> >>>>> improve
> >>>>>> performance on wildcard queries and b) have a very precise control
> >>> over
> >>>>> the
> >>>>>> score.
> >>>>>>
> >>>>>> Thanks a lot for your help,
> >>>>>> Carlos
> >>>>>>
> >>>>>> [1]:
> >>>>>
> >>>
> http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0
> >>>>>>
> >>>>>> Carlos Gonzalez-Cadenas
> >>>>>> CEO, ExperienceOn - New generation search
> >>>>>> http://www.experienceon.com
> >>>>>>
> >>>>>> Mobile: +34 652 911 201
> >>>>>> Skype: carlosgonzalezcadenas
> >>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Feb 16, 2012 at 7:09 PM, Em <mailformailingli...@yahoo.de>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Hello Carlos,
> >>>>>>>
> >>>>>>> well, you must take into account that you are executing up to 8
> >>> queries
> >>>>>>> per request instead of one query per request.
> >>>>>>>
> >>>>>>> I am not totally sure about the details of the implementation of
> the
> >>>>>>> max-function-query, but I guess it first iterates over the results
> of
> >>>>>>> the first max-query, afterwards over the results of the second
> >>> max-query
> >>>>>>> and so on. This is a much higher complexity than in the case of a
> >>> normal
> >>>>>>> query.
> >>>>>>>
> >>>>>>> I would suggest you to optimize your request. I don't think that
> this
> >>>>>>> particular function query is matching *all* docs. Instead I think
> it
> >>>>>>> just matches those docs specified by your inner-query (although I
> >>> might
> >>>>>>> be wrong about that).
> >>>>>>>
> >>>>>>> What are you trying to achieve by your request?
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Em
> >>>>>>>
> >>>>>>> Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas:
> >>>>>>>> Hello Em:
> >>>>>>>>
> >>>>>>>> The URL is quite large (w/ shards, ...), maybe it's best if I
> paste
> >>> the
> >>>>>>>> relevant parts.
> >>>>>>>>
> >>>>>>>> Our "q" parameter is:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> "q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)))))\"",
> >>>>>>>>
> >>>>>>>> The subqueries q8, q7, q4 and q3 are regular queries, for example:
> >>>>>>>>
> >>>>>>>> "q7":"stopword_phrase:colomba~1 AND stopword_phrase:santa AND
> >>>>>>>> wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
> >>>>>>>> (stopword_phrase:las AND stopword_phrase:de)"
> >>>>>>>>
> >>>>>>>> We've executed the subqueries q3-q8 independently and they're very
> >>>>> fast,
> >>>>>>>> but when we introduce the function queries as described below, it
> >>> all
> >>>>>>> goes
> >>>>>>>> 10X slower.
> >>>>>>>>
> >>>>>>>> Let me know if you need anything else.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Carlos
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Carlos Gonzalez-Cadenas
> >>>>>>>> CEO, ExperienceOn - New generation search
> >>>>>>>> http://www.experienceon.com
> >>>>>>>>
> >>>>>>>> Mobile: +34 652 911 201
> >>>>>>>> Skype: carlosgonzalezcadenas
> >>>>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Feb 16, 2012 at 4:02 PM, Em <mailformailingli...@yahoo.de
> >
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hello carlos,
> >>>>>>>>>
> >>>>>>>>> could you show us how your Solr-call looks like?
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Em
> >>>>>>>>>
> >>>>>>>>> Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
> >>>>>>>>>> Hello all:
> >>>>>>>>>>
> >>>>>>>>>> We'd like to score the matching documents using a combination of
> >>>>> SOLR's
> >>>>>>>>> IR
> >>>>>>>>>> score with another application-specific score that we store
> within
> >>>>> the
> >>>>>>>>>> documents themselves (i.e. a float field containing the
> >>> app-specific
> >>>>>>>>>> score). In particular, we'd like to calculate the final score
> >>> doing
> >>>>>>> some
> >>>>>>>>>> operations with both numbers (i.e product, sqrt, ...)
> >>>>>>>>>>
> >>>>>>>>>> According to what we know, there are two ways to do this in
> SOLR:
> >>>>>>>>>>
> >>>>>>>>>> A) Sort by function [1]: We've tested an expression like
> >>>>>>>>>> "sort=product(score, query_score)" in the SOLR query, where
> score
> >>> is
> >>>>>>> the
> >>>>>>>>>> common SOLR IR score and query_score is our own precalculated
> >>> score,
> >>>>>>> but
> >>>>>>>>> it
> >>>>>>>>>> seems that SOLR can only do this with stored/indexed fields (and
> >>>>>>>>> obviously
> >>>>>>>>>> "score" is not stored/indexed).
> >>>>>>>>>>
> >>>>>>>>>> B) Function queries: We've used _val_ and function queries like
> >>> max,
> >>>>>>> sqrt
> >>>>>>>>>> and query, and we've obtained the desired results from a
> >>> functional
> >>>>>>> point
> >>>>>>>>>> of view. However, our index is quite large (400M documents) and
> >>> the
> >>>>>>>>>> performance degrades heavily, given that function queries are
> >>> AFAIK
> >>>>>>>>>> matching all the documents.
> >>>>>>>>>>
> >>>>>>>>>> I have two questions:
> >>>>>>>>>>
> >>>>>>>>>> 1) Apart from the two options I mentioned, is there any other
> >>>>> (simple)
> >>>>>>>>> way
> >>>>>>>>>> to achieve this that we're not aware of?
> >>>>>>>>>>
> >>>>>>>>>> 2) If we have to choose the function queries path, would it be
> >>> very
> >>>>>>>>>> difficult to modify the actual implementation so that it doesn't
> >>>>> match
> >>>>>>>>> all
> >>>>>>>>>> the documents, that is, to pass a query so that it only operates
> >>> over
> >>>>>>> the
> >>>>>>>>>> documents matching the query?. Looking at the FunctionQuery.java
> >>>>> source
> >>>>>>>>>> code, there's a comment that says "// instead of matching all
> >>> docs,
> >>>>> we
> >>>>>>>>>> could also embed a query. the score could either ignore the
> >>> subscore,
> >>>>>>> or
> >>>>>>>>>> boost it", which is giving us some hope that maybe it's possible
> >>> and
> >>>>>>> even
> >>>>>>>>>> desirable to go in this direction. If you can give us some
> >>> directions
> >>>>>>>>> about
> >>>>>>>>>> how to go about this, we may be able to do the actual
> >>> implementation.
> >>>>>>>>>>
> >>>>>>>>>> BTW, we're using Lucene/SOLR trunk.
> >>>>>>>>>>
> >>>>>>>>>> Thanks a lot for your help.
> >>>>>>>>>> Carlos
> >>>>>>>>>>
> >>>>>>>>>> [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
>

Re: custom scoring

Reply via email to