Carlos, nice to hear that the approach helped you!
Could you show us how your query-request looks like after reworking? Regards, Em Am 20.02.2012 13:30, schrieb Carlos Gonzalez-Cadenas: > Hello all: > > We've done some tests with Em's approach of putting a BooleanQuery in front > of our user query, that means: > > BooleanQuery > must (DismaxQuery) > should (FunctionQuery) > > The FunctionQuery obtains the SOLR IR score by means of a QueryValueSource, > then does the SQRT of this value, and then multiplies it by our custom > "query_score" float, pulling it by means of a FieldCacheSource. > > In particular, we've proceeded in the following way: > > - we've loaded the whole index in the page cache of the OS to make sure > we don't have disk IO problems that might affect the benchmarks (our > machine has enough memory to load all the index in RAM) > - we've executed an out-of-benchmark query 10-20 times to make sure that > everything is jitted and that Lucene's FieldCache is properly populated. > - we've disabled all the caches (filter query cache, document cache, > query cache) > - we've executed 8 different user queries with and without > FunctionQueries, with early termination in both cases (our collector stops > after collecting 50 documents per shard) > > Em was correct, the query is much faster with the BooleanQuery in front, > but it's still 30-40% slower than the query without FunctionQueries. > > Although one may think that it's reasonable that the query response time > increases because of the extra computations, we believe that the increase > is too big, given that we're collecting just 500-600 documents due to the > early query termination techniques we currently use. > > Any ideas on how to make it faster?. > > Thanks a lot, > Carlos > > Carlos Gonzalez-Cadenas > CEO, ExperienceOn - New generation search > http://www.experienceon.com > > Mobile: +34 652 911 201 > Skype: carlosgonzalezcadenas > LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas > > > On Fri, Feb 17, 2012 at 11:07 AM, Carlos Gonzalez-Cadenas < > c...@experienceon.com> wrote: > >> Thanks Em, Robert, Chris for your time and valuable advice. We'll make >> some tests and will let you know soon. >> >> >> >> On Thu, Feb 16, 2012 at 11:43 PM, Em <mailformailingli...@yahoo.de> wrote: >> >>> Hello Carlos, >>> >>> I think we missunderstood eachother. >>> >>> As an example: >>> BooleanQuery ( >>> clauses: ( >>> MustMatch( >>> DisjunctionMaxQuery( >>> TermQuery("stopword_field", "barcelona"), >>> TermQuery("stopword_field", "hoteles") >>> ) >>> ), >>> ShouldMatch( >>> FunctionQuery( >>> *please insert your function here* >>> ) >>> ) >>> ) >>> ) >>> >>> Explanation: >>> You construct an artificial BooleanQuery which wraps your user's query >>> as well as your function query. >>> Your user's query - in that case - is just a DisjunctionMaxQuery >>> consisting of two TermQueries. >>> In the real world you might construct another BooleanQuery around your >>> DisjunctionMaxQuery in order to have more flexibility. >>> However the interesting part of the given example is, that we specify >>> the user's query as a MustMatch-condition of the BooleanQuery and the >>> FunctionQuery just as a ShouldMatch. >>> Constructed that way, I am expecting the FunctionQuery only scores those >>> documents which fit the MustMatch-Condition. >>> >>> I conclude that from the fact that the FunctionQuery-class also has a >>> skipTo-method and I would expect that the scorer will use it to score >>> only matching documents (however I did not search where and how it might >>> get called). >>> >>> If my conclusion is wrong than hopefully Robert Muir (as far as I can >>> see the author of that class) can tell us what was the intention by >>> constructing an every-time-match-all-function-query. >>> >>> Can you validate whether your QueryParser constructs a query in the form >>> I drew above? >>> >>> Regards, >>> Em >>> >>> Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas: >>>> Hello Em: >>>> >>>> 1) Here's a printout of an example DisMax query (as you can see mostly >>> MUST >>>> terms except for some SHOULD terms used for boosting scores for >>> stopwords) >>>> * >>>> * >>>> *((+stopword_shortened_phrase:hoteles >>> +stopword_shortened_phrase:barcelona >>>> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles >>>> +stopword_phrase:barcelona >>>> stopword_phrase:en) | (+stopword_shortened_phrase:hoteles >>> +stopword_short >>>> ened_phrase:barcelona stopword_shortened_phrase:en) | >>> (+stopword_phrase:hoteles >>>> +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor >>>> tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona >>>> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles >>> +wildcard_stopw >>>> ord_phrase:barcelona stopword_phrase:en) | >>> (+stopword_shortened_phrase:hoteles >>>> +wildcard_stopword_shortened_phrase:barcelona >>> stopword_shortened_phrase:en) >>>> | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona >>>> stopword_phrase:en))* >>>> * >>>> * >>>> 2)* *The collector is inserted in the SolrIndexSearcher (replacing the >>>> TimeLimitingCollector). We trigger it through the SOLR interface by >>> passing >>>> the timeAllowed parameter. We know this is a hack but AFAIK there's no >>>> out-of-the-box way to specify custom collectors by now ( >>>> https://issues.apache.org/jira/browse/SOLR-1680). In any case the >>> collector >>>> part works perfectly as of now, so clearly this is not the problem. >>>> >>>> 3) Re: your sentence: >>>> * >>>> * >>>> **I* would expect that with a shrinking set of matching documents to >>>> the overall-query, the function query only checks those documents that >>> are >>>> guaranteed to be within the result set.* >>>> * >>>> * >>>> Yes, I agree with this, but this snippet of code in FunctionQuery.java >>>> seems to say otherwise: >>>> >>>> // instead of matching all docs, we could also embed a query. >>>> // the score could either ignore the subscore, or boost it. >>>> // Containment: floatline(foo:myTerm, "myFloatField", 1.0, 0.0f) >>>> // Boost: foo:myTerm^floatline("myFloatField",1.0,0.0f) >>>> @Override >>>> public int nextDoc() throws IOException { >>>> for(;;) { >>>> ++doc; >>>> if (doc>=maxDoc) { >>>> return doc=NO_MORE_DOCS; >>>> } >>>> if (acceptDocs != null && !acceptDocs.get(doc)) continue; >>>> return doc; >>>> } >>>> } >>>> >>>> It seems that the author also thought of maybe embedding a query in >>> order >>>> to restrict matches, but this doesn't seem to be in place as of now (or >>>> maybe I'm not understanding how the whole thing works :) ). >>>> >>>> Thanks >>>> Carlos >>>> * >>>> * >>>> >>>> Carlos Gonzalez-Cadenas >>>> CEO, ExperienceOn - New generation search >>>> http://www.experienceon.com >>>> >>>> Mobile: +34 652 911 201 >>>> Skype: carlosgonzalezcadenas >>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas >>>> >>>> >>>> On Thu, Feb 16, 2012 at 8:09 PM, Em <mailformailingli...@yahoo.de> >>> wrote: >>>> >>>>> Hello Carlos, >>>>> >>>>>> We have some more tests on that matter: now we're moving from issuing >>>>> this >>>>>> large query through the SOLR interface to creating our own >>>>> QueryParser. The >>>>>> initial tests we've done in our QParser (that internally creates >>> multiple >>>>>> queries and inserts them inside a DisjunctionMaxQuery) are very good, >>>>> we're >>>>>> getting very good response times and high quality answers. But when >>> we've >>>>>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. >>> with a >>>>>> QueryValueSource that wraps the DisMaxQuery), then the times move from >>>>>> 10-20 msec to 200-300msec. >>>>> I reviewed the sourcecode and yes, the FunctionQuery iterates over the >>>>> whole index, however... let's see! >>>>> >>>>> In relation to the DisMaxQuery you create within your parser: What kind >>>>> of clause is the FunctionQuery and what kind of clause are your other >>>>> queries (MUST, SHOULD, MUST_NOT...)? >>>>> >>>>> *I* would expect that with a shrinking set of matching documents to the >>>>> overall-query, the function query only checks those documents that are >>>>> guaranteed to be within the result set. >>>>> >>>>>> Note that we're using early termination of queries (via a custom >>>>>> collector), and therefore (as shown by the numbers I included above) >>> even >>>>>> if the query is very complex, we're getting very fast answers. The >>> only >>>>>> situation where the response time explodes is when we include a >>>>>> FunctionQuery. >>>>> Could you give us some details about how/where did you plugin the >>>>> Collector, please? >>>>> >>>>> Kind regards, >>>>> Em >>>>> >>>>> Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas: >>>>>> Hello Em: >>>>>> >>>>>> Thanks for your answer. >>>>>> >>>>>> Yes, we initially also thought that the excessive increase in response >>>>> time >>>>>> was caused by the several queries being executed, and we did another >>>>> test. >>>>>> We executed one of the subqueries that I've shown to you directly in >>> the >>>>>> "q" parameter and then we tested this same subquery (only this one, >>>>> without >>>>>> the others) with the function query "query($q1)" in the "q" parameter. >>>>>> >>>>>> Theoretically the times for these two queries should be more or less >>> the >>>>>> same, but the second one is several times slower than the first one. >>>>> After >>>>>> this observation we learned more about function queries and we learned >>>>> from >>>>>> the code and from some comments in the forums [1] that the >>>>> FunctionQueries >>>>>> are expected to match all documents. >>>>>> >>>>>> We have some more tests on that matter: now we're moving from issuing >>>>> this >>>>>> large query through the SOLR interface to creating our own >>> QueryParser. >>>>> The >>>>>> initial tests we've done in our QParser (that internally creates >>> multiple >>>>>> queries and inserts them inside a DisjunctionMaxQuery) are very good, >>>>> we're >>>>>> getting very good response times and high quality answers. But when >>> we've >>>>>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. >>> with a >>>>>> QueryValueSource that wraps the DisMaxQuery), then the times move from >>>>>> 10-20 msec to 200-300msec. >>>>>> >>>>>> Note that we're using early termination of queries (via a custom >>>>>> collector), and therefore (as shown by the numbers I included above) >>> even >>>>>> if the query is very complex, we're getting very fast answers. The >>> only >>>>>> situation where the response time explodes is when we include a >>>>>> FunctionQuery. >>>>>> >>>>>> Re: your question of what we're trying to achieve ... We're >>> implementing >>>>> a >>>>>> powerful query autocomplete system, and we use several fields to a) >>>>> improve >>>>>> performance on wildcard queries and b) have a very precise control >>> over >>>>> the >>>>>> score. >>>>>> >>>>>> Thanks a lot for your help, >>>>>> Carlos >>>>>> >>>>>> [1]: >>>>> >>> http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0 >>>>>> >>>>>> Carlos Gonzalez-Cadenas >>>>>> CEO, ExperienceOn - New generation search >>>>>> http://www.experienceon.com >>>>>> >>>>>> Mobile: +34 652 911 201 >>>>>> Skype: carlosgonzalezcadenas >>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas >>>>>> >>>>>> >>>>>> On Thu, Feb 16, 2012 at 7:09 PM, Em <mailformailingli...@yahoo.de> >>>>> wrote: >>>>>> >>>>>>> Hello Carlos, >>>>>>> >>>>>>> well, you must take into account that you are executing up to 8 >>> queries >>>>>>> per request instead of one query per request. >>>>>>> >>>>>>> I am not totally sure about the details of the implementation of the >>>>>>> max-function-query, but I guess it first iterates over the results of >>>>>>> the first max-query, afterwards over the results of the second >>> max-query >>>>>>> and so on. This is a much higher complexity than in the case of a >>> normal >>>>>>> query. >>>>>>> >>>>>>> I would suggest you to optimize your request. I don't think that this >>>>>>> particular function query is matching *all* docs. Instead I think it >>>>>>> just matches those docs specified by your inner-query (although I >>> might >>>>>>> be wrong about that). >>>>>>> >>>>>>> What are you trying to achieve by your request? >>>>>>> >>>>>>> Regards, >>>>>>> Em >>>>>>> >>>>>>> Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas: >>>>>>>> Hello Em: >>>>>>>> >>>>>>>> The URL is quite large (w/ shards, ...), maybe it's best if I paste >>> the >>>>>>>> relevant parts. >>>>>>>> >>>>>>>> Our "q" parameter is: >>>>>>>> >>>>>>>> >>>>>>> >>>>> >>> "q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)))))\"", >>>>>>>> >>>>>>>> The subqueries q8, q7, q4 and q3 are regular queries, for example: >>>>>>>> >>>>>>>> "q7":"stopword_phrase:colomba~1 AND stopword_phrase:santa AND >>>>>>>> wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR >>>>>>>> (stopword_phrase:las AND stopword_phrase:de)" >>>>>>>> >>>>>>>> We've executed the subqueries q3-q8 independently and they're very >>>>> fast, >>>>>>>> but when we introduce the function queries as described below, it >>> all >>>>>>> goes >>>>>>>> 10X slower. >>>>>>>> >>>>>>>> Let me know if you need anything else. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Carlos >>>>>>>> >>>>>>>> >>>>>>>> Carlos Gonzalez-Cadenas >>>>>>>> CEO, ExperienceOn - New generation search >>>>>>>> http://www.experienceon.com >>>>>>>> >>>>>>>> Mobile: +34 652 911 201 >>>>>>>> Skype: carlosgonzalezcadenas >>>>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Feb 16, 2012 at 4:02 PM, Em <mailformailingli...@yahoo.de> >>>>>>> wrote: >>>>>>>> >>>>>>>>> Hello carlos, >>>>>>>>> >>>>>>>>> could you show us how your Solr-call looks like? >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Em >>>>>>>>> >>>>>>>>> Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas: >>>>>>>>>> Hello all: >>>>>>>>>> >>>>>>>>>> We'd like to score the matching documents using a combination of >>>>> SOLR's >>>>>>>>> IR >>>>>>>>>> score with another application-specific score that we store within >>>>> the >>>>>>>>>> documents themselves (i.e. a float field containing the >>> app-specific >>>>>>>>>> score). In particular, we'd like to calculate the final score >>> doing >>>>>>> some >>>>>>>>>> operations with both numbers (i.e product, sqrt, ...) >>>>>>>>>> >>>>>>>>>> According to what we know, there are two ways to do this in SOLR: >>>>>>>>>> >>>>>>>>>> A) Sort by function [1]: We've tested an expression like >>>>>>>>>> "sort=product(score, query_score)" in the SOLR query, where score >>> is >>>>>>> the >>>>>>>>>> common SOLR IR score and query_score is our own precalculated >>> score, >>>>>>> but >>>>>>>>> it >>>>>>>>>> seems that SOLR can only do this with stored/indexed fields (and >>>>>>>>> obviously >>>>>>>>>> "score" is not stored/indexed). >>>>>>>>>> >>>>>>>>>> B) Function queries: We've used _val_ and function queries like >>> max, >>>>>>> sqrt >>>>>>>>>> and query, and we've obtained the desired results from a >>> functional >>>>>>> point >>>>>>>>>> of view. However, our index is quite large (400M documents) and >>> the >>>>>>>>>> performance degrades heavily, given that function queries are >>> AFAIK >>>>>>>>>> matching all the documents. >>>>>>>>>> >>>>>>>>>> I have two questions: >>>>>>>>>> >>>>>>>>>> 1) Apart from the two options I mentioned, is there any other >>>>> (simple) >>>>>>>>> way >>>>>>>>>> to achieve this that we're not aware of? >>>>>>>>>> >>>>>>>>>> 2) If we have to choose the function queries path, would it be >>> very >>>>>>>>>> difficult to modify the actual implementation so that it doesn't >>>>> match >>>>>>>>> all >>>>>>>>>> the documents, that is, to pass a query so that it only operates >>> over >>>>>>> the >>>>>>>>>> documents matching the query?. Looking at the FunctionQuery.java >>>>> source >>>>>>>>>> code, there's a comment that says "// instead of matching all >>> docs, >>>>> we >>>>>>>>>> could also embed a query. the score could either ignore the >>> subscore, >>>>>>> or >>>>>>>>>> boost it", which is giving us some hope that maybe it's possible >>> and >>>>>>> even >>>>>>>>>> desirable to go in this direction. If you can give us some >>> directions >>>>>>>>> about >>>>>>>>>> how to go about this, we may be able to do the actual >>> implementation. >>>>>>>>>> >>>>>>>>>> BTW, we're using Lucene/SOLR trunk. >>>>>>>>>> >>>>>>>>>> Thanks a lot for your help. >>>>>>>>>> Carlos >>>>>>>>>> >>>>>>>>>> [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >> >