Re: custom scoring

Carlos Gonzalez-Cadenas Thu, 16 Feb 2012 11:30:02 -0800

Hello Em:

1) Here's a printout of an example DisMax query (as you can see mostly MUST
terms except for some SHOULD terms used for boosting scores for stopwords)
*
*
*((+stopword_shortened_phrase:hoteles +stopword_shortened_phrase:barcelona
stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
+stopword_phrase:barcelona
stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short
ened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
+stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw
ord_phrase:barcelona stopword_phrase:en) | (+stopword_shortened_phrase:hoteles
+wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en)
| (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
stopword_phrase:en))*
*
*
2)* *The collector is inserted in the SolrIndexSearcher (replacing the
TimeLimitingCollector). We trigger it through the SOLR interface by passing
the timeAllowed parameter. We know this is a hack but AFAIK there's no
out-of-the-box way to specify custom collectors by now (
https://issues.apache.org/jira/browse/SOLR-1680). In any case the collector
part works perfectly as of now, so clearly this is not the problem.


3) Re: your sentence:
*
*
**I* would expect that with a shrinking set of matching documents to
the overall-query, the function query only checks those documents that are
guaranteed to be within the result set.*
*
*
Yes, I agree with this, but this snippet of code in FunctionQuery.java
seems to say otherwise:

    // instead of matching all docs, we could also embed a query.
    // the score could either ignore the subscore, or boost it.
    // Containment:  floatline(foo:myTerm, "myFloatField", 1.0, 0.0f)
    // Boost:        foo:myTerm^floatline("myFloatField",1.0,0.0f)
    @Override
    public int nextDoc() throws IOException {
      for(;;) {
        ++doc;
        if (doc>=maxDoc) {
          return doc=NO_MORE_DOCS;
        }
        if (acceptDocs != null && !acceptDocs.get(doc)) continue;
        return doc;
      }
    }

It seems that the author also thought of maybe embedding a query in order
to restrict matches, but this doesn't seem to be in place as of now (or
maybe I'm not understanding how the whole thing works :) ).

Thanks
Carlos
*
*

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Thu, Feb 16, 2012 at 8:09 PM, Em <mailformailingli...@yahoo.de> wrote:

> Hello Carlos,
>
> > We have some more tests on that matter: now we're moving from issuing
> this
> > large query through the SOLR interface to creating our own
> QueryParser. The
> > initial tests we've done in our QParser (that internally creates multiple
> > queries and inserts them inside a DisjunctionMaxQuery) are very good,
> we're
> > getting very good response times and high quality answers. But when we've
> > tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a
> > QueryValueSource that wraps the DisMaxQuery), then the times move from
> > 10-20 msec to 200-300msec.
> I reviewed the sourcecode and yes, the FunctionQuery iterates over the
> whole index, however... let's see!
>
> In relation to the DisMaxQuery you create within your parser: What kind
> of clause is the FunctionQuery and what kind of clause are your other
> queries (MUST, SHOULD, MUST_NOT...)?
>
> *I* would expect that with a shrinking set of matching documents to the
> overall-query, the function query only checks those documents that are
> guaranteed to be within the result set.
>
> > Note that we're using early termination of queries (via a custom
> > collector), and therefore (as shown by the numbers I included above) even
> > if the query is very complex, we're getting very fast answers. The only
> > situation where the response time explodes is when we include a
> > FunctionQuery.
> Could you give us some details about how/where did you plugin the
> Collector, please?
>
> Kind regards,
> Em
>
> Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas:
> > Hello Em:
> >
> > Thanks for your answer.
> >
> > Yes, we initially also thought that the excessive increase in response
> time
> > was caused by the several queries being executed, and we did another
> test.
> > We executed one of the subqueries that I've shown to you directly in the
> > "q" parameter and then we tested this same subquery (only this one,
> without
> > the others) with the function query "query($q1)" in the "q" parameter.
> >
> > Theoretically the times for these two queries should be more or less the
> > same, but the second one is several times slower than the first one.
> After
> > this observation we learned more about function queries and we learned
> from
> > the code and from some comments in the forums [1] that the
> FunctionQueries
> > are expected to match all documents.
> >
> > We have some more tests on that matter: now we're moving from issuing
> this
> > large query through the SOLR interface to creating our own QueryParser.
> The
> > initial tests we've done in our QParser (that internally creates multiple
> > queries and inserts them inside a DisjunctionMaxQuery) are very good,
> we're
> > getting very good response times and high quality answers. But when we've
> > tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a
> > QueryValueSource that wraps the DisMaxQuery), then the times move from
> > 10-20 msec to 200-300msec.
> >
> > Note that we're using early termination of queries (via a custom
> > collector), and therefore (as shown by the numbers I included above) even
> > if the query is very complex, we're getting very fast answers. The only
> > situation where the response time explodes is when we include a
> > FunctionQuery.
> >
> > Re: your question of what we're trying to achieve ... We're implementing
> a
> > powerful query autocomplete system, and we use several fields to a)
> improve
> > performance on wildcard queries and b) have a very precise control over
> the
> > score.
> >
> > Thanks a lot for your help,
> > Carlos
> >
> > [1]:
> http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0
> >
> > Carlos Gonzalez-Cadenas
> > CEO, ExperienceOn - New generation search
> > http://www.experienceon.com
> >
> > Mobile: +34 652 911 201
> > Skype: carlosgonzalezcadenas
> > LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >
> >
> > On Thu, Feb 16, 2012 at 7:09 PM, Em <mailformailingli...@yahoo.de>
> wrote:
> >
> >> Hello Carlos,
> >>
> >> well, you must take into account that you are executing up to 8 queries
> >> per request instead of one query per request.
> >>
> >> I am not totally sure about the details of the implementation of the
> >> max-function-query, but I guess it first iterates over the results of
> >> the first max-query, afterwards over the results of the second max-query
> >> and so on. This is a much higher complexity than in the case of a normal
> >> query.
> >>
> >> I would suggest you to optimize your request. I don't think that this
> >> particular function query is matching *all* docs. Instead I think it
> >> just matches those docs specified by your inner-query (although I might
> >> be wrong about that).
> >>
> >> What are you trying to achieve by your request?
> >>
> >> Regards,
> >> Em
> >>
> >> Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas:
> >>> Hello Em:
> >>>
> >>> The URL is quite large (w/ shards, ...), maybe it's best if I paste the
> >>> relevant parts.
> >>>
> >>> Our "q" parameter is:
> >>>
> >>>
> >>
> "q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)))))\"",
> >>>
> >>> The subqueries q8, q7, q4 and q3 are regular queries, for example:
> >>>
> >>> "q7":"stopword_phrase:colomba~1 AND stopword_phrase:santa AND
> >>> wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
> >>> (stopword_phrase:las AND stopword_phrase:de)"
> >>>
> >>> We've executed the subqueries q3-q8 independently and they're very
> fast,
> >>> but when we introduce the function queries as described below, it all
> >> goes
> >>> 10X slower.
> >>>
> >>> Let me know if you need anything else.
> >>>
> >>> Thanks
> >>> Carlos
> >>>
> >>>
> >>> Carlos Gonzalez-Cadenas
> >>> CEO, ExperienceOn - New generation search
> >>> http://www.experienceon.com
> >>>
> >>> Mobile: +34 652 911 201
> >>> Skype: carlosgonzalezcadenas
> >>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >>>
> >>>
> >>> On Thu, Feb 16, 2012 at 4:02 PM, Em <mailformailingli...@yahoo.de>
> >> wrote:
> >>>
> >>>> Hello carlos,
> >>>>
> >>>> could you show us how your Solr-call looks like?
> >>>>
> >>>> Regards,
> >>>> Em
> >>>>
> >>>> Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
> >>>>> Hello all:
> >>>>>
> >>>>> We'd like to score the matching documents using a combination of
> SOLR's
> >>>> IR
> >>>>> score with another application-specific score that we store within
> the
> >>>>> documents themselves (i.e. a float field containing the app-specific
> >>>>> score). In particular, we'd like to calculate the final score doing
> >> some
> >>>>> operations with both numbers (i.e product, sqrt, ...)
> >>>>>
> >>>>> According to what we know, there are two ways to do this in SOLR:
> >>>>>
> >>>>> A) Sort by function [1]: We've tested an expression like
> >>>>> "sort=product(score, query_score)" in the SOLR query, where score is
> >> the
> >>>>> common SOLR IR score and query_score is our own precalculated score,
> >> but
> >>>> it
> >>>>> seems that SOLR can only do this with stored/indexed fields (and
> >>>> obviously
> >>>>> "score" is not stored/indexed).
> >>>>>
> >>>>> B) Function queries: We've used _val_ and function queries like max,
> >> sqrt
> >>>>> and query, and we've obtained the desired results from a functional
> >> point
> >>>>> of view. However, our index is quite large (400M documents) and the
> >>>>> performance degrades heavily, given that function queries are AFAIK
> >>>>> matching all the documents.
> >>>>>
> >>>>> I have two questions:
> >>>>>
> >>>>> 1) Apart from the two options I mentioned, is there any other
> (simple)
> >>>> way
> >>>>> to achieve this that we're not aware of?
> >>>>>
> >>>>> 2) If we have to choose the function queries path, would it be very
> >>>>> difficult to modify the actual implementation so that it doesn't
> match
> >>>> all
> >>>>> the documents, that is, to pass a query so that it only operates over
> >> the
> >>>>> documents matching the query?. Looking at the FunctionQuery.java
> source
> >>>>> code, there's a comment that says "// instead of matching all docs,
> we
> >>>>> could also embed a query. the score could either ignore the subscore,
> >> or
> >>>>> boost it", which is giving us some hope that maybe it's possible and
> >> even
> >>>>> desirable to go in this direction. If you can give us some directions
> >>>> about
> >>>>> how to go about this, we may be able to do the actual implementation.
> >>>>>
> >>>>> BTW, we're using Lucene/SOLR trunk.
> >>>>>
> >>>>> Thanks a lot for your help.
> >>>>> Carlos
> >>>>>
> >>>>> [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: custom scoring

Reply via email to