Re: codecs for sorted indexes

2012-04-12 Thread Carlos Gonzalez-Cadenas
Hello Michael,

Yes, we are pre-sorting the documents before adding them to the index. We
have a score associated to every document (not an IR score but a
document-related score that reflects its "importance"). Therefore, the
document with the biggest score will have the lowest docid (we add it first
to the index). We do this in order to apply early termination effectively.
With the actual coded, we haven't seen much of a difference in terms of
space when we have the index sorted vs not sorted.

So, the question would be: if we force the docids to be sorted, what is the
best way to encode them?. We don't really care if the codec doesn't work
for cases where the documents are not sorted (i.e. if it throws an
exception if documents are not ordered when creating the index). Our idea
here is that it may be possible to trade off generality but achieve very
significant improvements for the specific case.

Would something along the lines of RLE coding work? i.e. if we have to
store docids 1 to 1500, we can represent it as "1::1499" (it would be 2
ints to represent 1500 docids).

Thanks a lot for your help,
Carlos

On Thu, Apr 12, 2012 at 6:19 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Do you mean you are pre-sorting the documents (by what criteria?)
> yourself, before adding them to the index?



> In which case... you should already be seeing some benefits (smaller
> index size) than had you "randomly" added them (ie the vInts should
> take fewer bytes), I think.  (Probably the savings would be greater
> for better intblock codecs like PForDelta, SimpleX, but I'm not
> sure...).
>
> Or do you mean having a codec re-sort the documents (on flush/merge)?
> I think this should be possible w/ the Codec API... but nobody has
> tried it yet that I know of.
>
> Note that the bulkpostings branch is effectively dead (nobody is
> iterating on it, and we've removed the old bulk API from trunk), but
> there is likely a GSoC project to add a PForDelta codec to trunk:
>
>https://issues.apache.org/jira/browse/LUCENE-3892
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
>
> On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas
>  wrote:
> > Hello,
> >
> > We're using a sorted index in order to implement early termination
> > efficiently over an index of hundreds of millions of documents. As of
> now,
> > we're using the default codecs coming with Lucene 4, but we believe that
> > due to the fact that the docids are sorted, we should be able to do much
> > better in terms of storage and achieve much better performance,
> especially
> > decompression performance.
> >
> > In particular, Robert Muir is commenting on these lines here:
> >
> >
> https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411
> >
> > We're aware that the in the bulkpostings branch there are different
> codecs
> > being implemented and different experiments being done. We don't know
> > whether we should implement our own codec (i.e. using some RLE-like
> > techniques) or we should use one of the codecs implemented there (PFOR,
> > Simple64, ...).
> >
> > Can you please give us some advice on this?
> >
> > Thanks
> > Carlos
> >
> > Carlos Gonzalez-Cadenas
> > CEO, ExperienceOn - New generation search
> > http://www.experienceon.com
> >
> > Mobile: +34 652 911 201
> > Skype: carlosgonzalezcadenas
> > LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>


codecs for sorted indexes

2012-04-12 Thread Carlos Gonzalez-Cadenas
Hello,

We're using a sorted index in order to implement early termination
efficiently over an index of hundreds of millions of documents. As of now,
we're using the default codecs coming with Lucene 4, but we believe that
due to the fact that the docids are sorted, we should be able to do much
better in terms of storage and achieve much better performance, especially
decompression performance.

In particular, Robert Muir is commenting on these lines here:

https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411

We're aware that the in the bulkpostings branch there are different codecs
being implemented and different experiments being done. We don't know
whether we should implement our own codec (i.e. using some RLE-like
techniques) or we should use one of the codecs implemented there (PFOR,
Simple64, ...).

Can you please give us some advice on this?

Thanks
Carlos

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


processing of merged tokens

2012-03-19 Thread Carlos Gonzalez-Cadenas
Hello,

For our search system we'd like to be able to process merged tokens (sorry,
I don't know what's the proper name for this), i.e. when a user enters a
query like "hotelsin barcelona", we'd like to know that the user means
"hotels in barcelona".

At some point in the past we implemented this kind of functionality with
shingles (using ShingleFilter), that is, if we were indexing the sentence
"hotels in barcelona" as a document, we'd be able to match at query time
merged tokens like "hotelsin" and "inbarcelona".

This solution has two problems:
1) The index size increases a lot.
2) We only catch a small % of the possibilities. Merged tokens derived from
different token positions in the user query, like "hotelsbarcelona" or
"barcelonahotels", cannot be processed.

Our intuition is that there should be a better solution. Maybe it's solved
in SOLR or Lucene and we haven't found it yet. If it's not solved, I can
imagine a simple solution that would use TermsEnum to identify whether a
token exists in the index or not, and then if it doesn't exist, use the
TermsEnum again to check whether it's a composition of two known tokens.

It's highly likely that there are much better solutions and algorithms for
this. It would be great if you can help us identify the best way to solve
this problem.

Thanks a lot for your help.

Carlos

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


Re: problems with DisjunctionMaxQuery and early-termination

2012-03-16 Thread Carlos Gonzalez-Cadenas
On Fri, Mar 16, 2012 at 9:26 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello Carlos,
>

Hello Mikhail:

Thanks for your answer.


>
> I have two concerns about your approach. First-K (not top-K honestly)
> collector approach impacts recall of your search and using disjunctive
> queries impacts precision e.g. I want to find some fairly small and quiet,
> and therefore unpopular "Lemond Hotel" you parse my phrase into Lemond OR
> Hotel and return 1K of popular hotels but not Lemond one because it's
> nearly a hapax. So, I don't believe that it's a great search.
>

Yes, I agree that OR queries combined with top-K (or first-K as you say)
doesn't work very well (your results will be full of very popular yet not
very precise matches) and this is also what I tried to explain in my email.


> And the also concern from the end of your letter is about joining separate
> query result. I'd like to remind that absolute scores from the different
> queries are not comparable at all, and maybe, but I'm not sure the relative
> ones scaled by max score are comparable.

 I suppose you need conjunctive queries instead. And the great stuff about
> them is "not-found for free" getting the zero result found cost is
> proportional to number of query terms i.e. miserable.
> so, search all terms with MUST first, you've got the best result in terms
> of precision and recall if you've got something. Otherwise you still have a
> lot of time. You need to drop one of the words or switch ones of them into
> SHOULD.


Agree, this is precisely what we're trying to do (the idea of having
multiple queries, from narrow to broad). My question was more of a
practical nature, that is, how can we do these queries without really
having to do independent SOLR queries. Now we use DisjunctionMaxQuery, but
it has the problems that I described in my former email w.r.t.
early-termination.

This morning we found two potential directions that might work (we're
testing them as of now):

   1. Implement a custom RequestHandler and execute several queries within
   SOLR (https://issues.apache.org/jira/browse/SOLR-1093). This is better
   than executing them from outside and having all the network / HTTP / ...
   overhead, but still not very good.
   2. Modify DisjunctionMaxQuery. In particular, modifying DisjunctionMaxScorer
   so that it doesn't use a min heap for the subscorers. We'll try several
   strategies to collect documents from the child subscorers, like round-robin
   or collecting the narrower subscorers first and then go broader until the
   upstream collector stops the collection. This looks like the most
   interesting option.


Enumerating all combinations is NPcomplete task I believe. But you have a
> good heuristics:
> * zero docFreq means that you can drop this term off or pass it through
> spell correction
> * if you have a instant suggest like app and has zero result for some
> phrase, maybe dropping the last word gives you the phrase which had some
> results before, and present in cache.
> * otherwise excluding less frequent term from conjunction probably gives
> non-zero results
>

This is not a problem in practice. We're using a bunch of heuristics in our
QueryParser (including a lot of info extracted from the TermsEnum, stopword
lists, etc ...) to severely cut the space.

Thanks
Carlos



>
> Regards
>
>
> On Thu, Mar 15, 2012 at 12:01 AM, Carlos Gonzalez-Cadenas <
> c...@experienceon.com> wrote:
>
>> Hello all,
>>
>> We have a SOLR index filled with user queries and we want to retrieve the
>> ones that are more similar to a given query entered by an end-user. It is
>> kind of a "related queries" system.
>>
>> The index is pretty big and we're using early-termination of queries (with
>> the index sorted so that the "more popular" queries have lower docids and
>> therefore the termination yields higher-quality results)
>>
>> Clearly, when the user enters a user-level query into the search box, i.e.
>> "cheap hotels barcelona offers", we don't know whether there exists a
>> document (query) in the index that contains these four words or not.
>>  Therefore, when we're building the SOLR query, the first intuition would
>> be to do a query like this "cheap OR hotels OR barcelona OR offers".
>>
>> If all the documents in the index where evaluated, the results of this
>> query would be good. For example, if there is no query in the index with
>> these four words but there's a query in the index with the text "cheap
>> hotels barcelona", it will probably be one of the top results, which is
>> precisely what we want.
>&

problems with DisjunctionMaxQuery and early-termination

2012-03-15 Thread Carlos Gonzalez-Cadenas
Hello all,

We have a SOLR index filled with user queries and we want to retrieve the
ones that are more similar to a given query entered by an end-user. It is
kind of a "related queries" system.

The index is pretty big and we're using early-termination of queries (with
the index sorted so that the "more popular" queries have lower docids and
therefore the termination yields higher-quality results)

Clearly, when the user enters a user-level query into the search box, i.e.
"cheap hotels barcelona offers", we don't know whether there exists a
document (query) in the index that contains these four words or not.
 Therefore, when we're building the SOLR query, the first intuition would
be to do a query like this "cheap OR hotels OR barcelona OR offers".

If all the documents in the index where evaluated, the results of this
query would be good. For example, if there is no query in the index with
these four words but there's a query in the index with the text "cheap
hotels barcelona", it will probably be one of the top results, which is
precisely what we want.

The problem is that we're doing early termination and therefore this query
will exhaust very fast the top-K result limit (our custom collector limits
on the number of evaluated documents), given that queries like "hotels in
madrid" or "hotels in NYC" will match the OR expression described above
(because they all match "hotels").

Our next step was to think in a DisjunctionMaxQuery, trying to write a
query like this:

DisjunctionMaxQuery:
 1) +cheap +hotels +barcelona +offers
 2) +cheap +hotels +barcelona
 3) +cheap +hotels
 4) +hotels

We were thinking that perhaps the sub-queries within the
DisjunctionMaxQuery were going to get evaluated in "parallel" given that
they're separated queries, but in fact from a runtime perspective it does
behave in a similar way than the OR query that we described above.

Our desired behavior is to try match documents with each subquery within
the DisjunctionMaxQuery (up to a per-subquery limit that we put) and then
score them and return them all together (therefore we don't want all the
matches being done by a single sub-query, like it's happening now).

Clearly, we could create a script external to SOLR that just runs the
several sub-queries as standalone queries and then joins all the results
together, but before going for this we'd like to know if you have any ideas
on how to solve this problem within SOLR. We do have our own QParser, and
therefore we'd be able to implement any arbitrary query construction that
you can come up with, or even create a new Query type if it's needed.

Thanks a lot for your help,
Carlos


Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


problems with DisjunctionMaxQuery and early-termination

2012-03-14 Thread Carlos Gonzalez-Cadenas
Hello all,

We have a SOLR index filled with user queries and we want to retrieve the
ones that are more similar to a given query entered by an end-user. It is
kind of a "related queries" system.

The index is pretty big and we're using early-termination of queries (with
the index sorted so that the "more popular" queries have lower docids and
therefore the termination yields higher-quality results)

Clearly, when the user enters a user-level query into the search box, i.e.
"cheap hotels barcelona offers", we don't know whether there exists a
document (query) in the index that contains these four words or not.
 Therefore, when we're building the SOLR query, the first intuition would
be to do a query like this "cheap OR hotels OR barcelona OR offers".

If all the documents in the index where evaluated, the results of this
query would be good. For example, if there is no query in the index with
these four words but there's a query in the index with the text "cheap
hotels barcelona", it will probably be one of the top results, which is
precisely what we want.

The problem is that we're doing early termination and therefore this query
will exhaust very fast the top-K result limit (our custom collector limits
on the number of evaluated documents), given that queries like "hotels in
madrid" or "hotels in NYC" will match the OR expression described above
(because they all match "hotels").

Our next step was to think in a DisjunctionMaxQuery, trying to write a
query like this:

DisjunctionMaxQuery:
 1) +cheap +hotels +barcelona +offers
 2) +cheap +hotels +barcelona
 3) +cheap +hotels
 4) +hotels

We were thinking that perhaps the sub-queries within the
DisjunctionMaxQuery were going to get evaluated in "parallel" given that
they're separated queries, but in fact from a runtime perspective it does
behave in a similar way than the OR query that we described above.

Our desired behavior is to try match documents with each subquery within
the DisjunctionMaxQuery (up to a per-subquery limit that we put) and then
score them and return them all together (therefore we don't want all the
matches being done by a single sub-query, like it's happening now).

Clearly, we could create a script external to SOLR that just runs the
several sub-queries as standalone queries and then joins all the results
together, but before going for this we'd like to know if you have any ideas
on how to solve this problem within SOLR. We do have our own QParser, and
therefore we'd be able to implement any arbitrary query construction that
you can come up with, or even create a new Query type if it's needed.

Thanks a lot for your help,
Carlos


Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


Re: custom scoring

2012-02-20 Thread Carlos Gonzalez-Cadenas
Hi Em:

The HTTP request is not gonna help you a lot because we use a custom
QParser (that builds the query that I've pasted before). In any case, here
it is:

http://localhost:8080/solr/core0/select?shards=…(shards
here)…&indent=on&wt=exon&timeAllowed=50&fl=resulting_phrase%2Cquery_id%2Ctype%2Chighlighting&start=0&rows=16&limit=20&q=%7B!exonautocomplete%7Dhoteles<http://localhost:8080/solr/core0/select?shards=exp302%3A8983%2Fsolr%2Fcore0%2Cexp302%3A8983%2Fsolr%2Fcore1%2Cexp302%3A8983%2Fsolr%2Fcore2%2Cexp302%3A8983%2Fsolr%2Fcore3%2Cexp302%3A8983%2Fsolr%2Fcore4%2Cexp302%3A8983%2Fsolr%2Fcore5%2Cexp302%3A8983%2Fsolr%2Fcore6%2Cexp302%3A8983%2Fsolr%2Fcore7%2Cexp302%3A8983%2Fsolr%2Fcore8%2Cexp302%3A8983%2Fsolr%2Fcore9%2Cexp302%3A8983%2Fsolr%2Fcore10%2Cexp302%3A8983%2Fsolr%2Fcore11&sort=score%20desc%2C%20query_score%20desc&indent=on&wt=exon&timeAllowed=50&fl=resulting_phrase%2Cquery_id%2Ctype%2Chighlighting&start=0&vrows=4&rows=16&limit=20&q=%7B!exonautocomplete%7DBARCELONA&gyvl7cn3>

We're implementing a query autocomplete system, therefore our Lucene
documents are queries. "query_score" is a field that is indexed and stored
with every document. It expresses how popular a given query is (i.e. common
queries like "hotels in barcelona" have a bigger query_score than less
common queries like "hotels in barcelona near the beach").

Let me know if you need something else.

Thanks,
Carlos





Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Mon, Feb 20, 2012 at 3:12 PM, Em  wrote:

> Could you please provide me the original request (the HTTP-request)?
> I am a little bit confused to what "query_score" refers.
> As far as I can see it isn't a magic-value.
>
> Kind regards,
> Em
>
> Am 20.02.2012 14:05, schrieb Carlos Gonzalez-Cadenas:
> > Yeah Em, it helped a lot :)
> >
> > Here it is (for the user query "hoteles"):
> >
> > *+(stopword_shortened_phrase:hoteles | stopword_phrase:hoteles |
> > wildcard_stopword_shortened_phrase:hoteles |
> > wildcard_stopword_phrase:hoteles) *
> >
> > *product(pow(query((stopword_shortened_phrase:hoteles |
> > stopword_phrase:hoteles | wildcard_stopword_shortened_phrase:hoteles |
> >
> wildcard_stopword_phrase:hoteles),def=0.0),const(0.5)),float(query_score))*
> >
> > Thanks a lot for your help.
> >
> > Carlos
> > Carlos Gonzalez-Cadenas
> > CEO, ExperienceOn - New generation search
> > http://www.experienceon.com
> >
> > Mobile: +34 652 911 201
> > Skype: carlosgonzalezcadenas
> > LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >
> >
> > On Mon, Feb 20, 2012 at 1:50 PM, Em 
> wrote:
> >
> >> Carlos,
> >>
> >> nice to hear that the approach helped you!
> >>
> >> Could you show us how your query-request looks like after reworking?
> >>
> >> Regards,
> >> Em
> >>
> >> Am 20.02.2012 13:30, schrieb Carlos Gonzalez-Cadenas:
> >>> Hello all:
> >>>
> >>> We've done some tests with Em's approach of putting a BooleanQuery in
> >> front
> >>> of our user query, that means:
> >>>
> >>> BooleanQuery
> >>> must (DismaxQuery)
> >>> should (FunctionQuery)
> >>>
> >>> The FunctionQuery obtains the SOLR IR score by means of a
> >> QueryValueSource,
> >>> then does the SQRT of this value, and then multiplies it by our custom
> >>> "query_score" float, pulling it by means of a FieldCacheSource.
> >>>
> >>> In particular, we've proceeded in the following way:
> >>>
> >>>- we've loaded the whole index in the page cache of the OS to make
> >> sure
> >>>we don't have disk IO problems that might affect the benchmarks (our
> >>>machine has enough memory to load all the index in RAM)
> >>>- we've executed an out-of-benchmark query 10-20 times to make sure
> >> that
> >>>everything is jitted and that Lucene's FieldCache is properly
> >> populated.
> >>>- we've disabled all the caches (filter query cache, document cache,
> >>>query cache)
> >>>    - we've executed 8 different user queries with and without
> >>>FunctionQueries, with early termination in both cases (our collector
> >> stops

Re: custom scoring

2012-02-20 Thread Carlos Gonzalez-Cadenas
Yeah Em, it helped a lot :)

Here it is (for the user query "hoteles"):

*+(stopword_shortened_phrase:hoteles | stopword_phrase:hoteles |
wildcard_stopword_shortened_phrase:hoteles |
wildcard_stopword_phrase:hoteles) *

*product(pow(query((stopword_shortened_phrase:hoteles |
stopword_phrase:hoteles | wildcard_stopword_shortened_phrase:hoteles |
wildcard_stopword_phrase:hoteles),def=0.0),const(0.5)),float(query_score))*

Thanks a lot for your help.

Carlos
Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Mon, Feb 20, 2012 at 1:50 PM, Em  wrote:

> Carlos,
>
> nice to hear that the approach helped you!
>
> Could you show us how your query-request looks like after reworking?
>
> Regards,
> Em
>
> Am 20.02.2012 13:30, schrieb Carlos Gonzalez-Cadenas:
> > Hello all:
> >
> > We've done some tests with Em's approach of putting a BooleanQuery in
> front
> > of our user query, that means:
> >
> > BooleanQuery
> > must (DismaxQuery)
> > should (FunctionQuery)
> >
> > The FunctionQuery obtains the SOLR IR score by means of a
> QueryValueSource,
> > then does the SQRT of this value, and then multiplies it by our custom
> > "query_score" float, pulling it by means of a FieldCacheSource.
> >
> > In particular, we've proceeded in the following way:
> >
> >- we've loaded the whole index in the page cache of the OS to make
> sure
> >we don't have disk IO problems that might affect the benchmarks (our
> >machine has enough memory to load all the index in RAM)
> >- we've executed an out-of-benchmark query 10-20 times to make sure
> that
> >everything is jitted and that Lucene's FieldCache is properly
> populated.
> >- we've disabled all the caches (filter query cache, document cache,
> >query cache)
> >- we've executed 8 different user queries with and without
> >FunctionQueries, with early termination in both cases (our collector
> stops
> >after collecting 50 documents per shard)
> >
> > Em was correct, the query is much faster with the BooleanQuery in front,
> > but it's still 30-40% slower than the query without FunctionQueries.
> >
> > Although one may think that it's reasonable that the query response time
> > increases because of the extra computations, we believe that the increase
> > is too big, given that we're collecting just 500-600 documents due to the
> > early query termination techniques we currently use.
> >
> > Any ideas on how to make it faster?.
> >
> > Thanks a lot,
> > Carlos
> >
> > Carlos Gonzalez-Cadenas
> > CEO, ExperienceOn - New generation search
> > http://www.experienceon.com
> >
> > Mobile: +34 652 911 201
> > Skype: carlosgonzalezcadenas
> > LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >
> >
> > On Fri, Feb 17, 2012 at 11:07 AM, Carlos Gonzalez-Cadenas <
> > c...@experienceon.com> wrote:
> >
> >> Thanks Em, Robert, Chris for your time and valuable advice. We'll make
> >> some tests and will let you know soon.
> >>
> >>
> >>
> >> On Thu, Feb 16, 2012 at 11:43 PM, Em 
> wrote:
> >>
> >>> Hello Carlos,
> >>>
> >>> I think we missunderstood eachother.
> >>>
> >>> As an example:
> >>> BooleanQuery (
> >>>  clauses: (
> >>> MustMatch(
> >>>   DisjunctionMaxQuery(
> >>>   TermQuery("stopword_field", "barcelona"),
> >>>   TermQuery("stopword_field", "hoteles")
> >>>   )
> >>> ),
> >>> ShouldMatch(
> >>>  FunctionQuery(
> >>>*please insert your function here*
> >>> )
> >>> )
> >>>  )
> >>> )
> >>>
> >>> Explanation:
> >>> You construct an artificial BooleanQuery which wraps your user's query
> >>> as well as your function query.
> >>> Your user's query - in that case - is just a DisjunctionMaxQuery
> >>> consisting of two TermQueries.
> >>> In the real world you might construct another BooleanQuery around your
> >>> DisjunctionMaxQuery in order to have more flexibility.
&g

Re: custom scoring

2012-02-20 Thread Carlos Gonzalez-Cadenas
Hello all:

We've done some tests with Em's approach of putting a BooleanQuery in front
of our user query, that means:

BooleanQuery
must (DismaxQuery)
should (FunctionQuery)

The FunctionQuery obtains the SOLR IR score by means of a QueryValueSource,
then does the SQRT of this value, and then multiplies it by our custom
"query_score" float, pulling it by means of a FieldCacheSource.

In particular, we've proceeded in the following way:

   - we've loaded the whole index in the page cache of the OS to make sure
   we don't have disk IO problems that might affect the benchmarks (our
   machine has enough memory to load all the index in RAM)
   - we've executed an out-of-benchmark query 10-20 times to make sure that
   everything is jitted and that Lucene's FieldCache is properly populated.
   - we've disabled all the caches (filter query cache, document cache,
   query cache)
   - we've executed 8 different user queries with and without
   FunctionQueries, with early termination in both cases (our collector stops
   after collecting 50 documents per shard)

Em was correct, the query is much faster with the BooleanQuery in front,
but it's still 30-40% slower than the query without FunctionQueries.

Although one may think that it's reasonable that the query response time
increases because of the extra computations, we believe that the increase
is too big, given that we're collecting just 500-600 documents due to the
early query termination techniques we currently use.

Any ideas on how to make it faster?.

Thanks a lot,
Carlos

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Fri, Feb 17, 2012 at 11:07 AM, Carlos Gonzalez-Cadenas <
c...@experienceon.com> wrote:

> Thanks Em, Robert, Chris for your time and valuable advice. We'll make
> some tests and will let you know soon.
>
>
>
> On Thu, Feb 16, 2012 at 11:43 PM, Em  wrote:
>
>> Hello Carlos,
>>
>> I think we missunderstood eachother.
>>
>> As an example:
>> BooleanQuery (
>>  clauses: (
>> MustMatch(
>>   DisjunctionMaxQuery(
>>   TermQuery("stopword_field", "barcelona"),
>>   TermQuery("stopword_field", "hoteles")
>>   )
>> ),
>> ShouldMatch(
>>  FunctionQuery(
>>*please insert your function here*
>> )
>> )
>>  )
>> )
>>
>> Explanation:
>> You construct an artificial BooleanQuery which wraps your user's query
>> as well as your function query.
>> Your user's query - in that case - is just a DisjunctionMaxQuery
>> consisting of two TermQueries.
>> In the real world you might construct another BooleanQuery around your
>> DisjunctionMaxQuery in order to have more flexibility.
>> However the interesting part of the given example is, that we specify
>> the user's query as a MustMatch-condition of the BooleanQuery and the
>> FunctionQuery just as a ShouldMatch.
>> Constructed that way, I am expecting the FunctionQuery only scores those
>> documents which fit the MustMatch-Condition.
>>
>> I conclude that from the fact that the FunctionQuery-class also has a
>> skipTo-method and I would expect that the scorer will use it to score
>> only matching documents (however I did not search where and how it might
>> get called).
>>
>> If my conclusion is wrong than hopefully Robert Muir (as far as I can
>> see the author of that class) can tell us what was the intention by
>> constructing an every-time-match-all-function-query.
>>
>> Can you validate whether your QueryParser constructs a query in the form
>> I drew above?
>>
>> Regards,
>> Em
>>
>> Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
>> > Hello Em:
>> >
>> > 1) Here's a printout of an example DisMax query (as you can see mostly
>> MUST
>> > terms except for some SHOULD terms used for boosting scores for
>> stopwords)
>> > *
>> > *
>> > *((+stopword_shortened_phrase:hoteles
>> +stopword_shortened_phrase:barcelona
>> > stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
>> > +stopword_phrase:barcelona
>> > stopword_phrase:en) | (+stopword_shortened_phrase:hoteles
>> +stopword_short
>> > ened_phrase:barcelona stopword_shortened_phrase:en) |
>> (+stopword_phrase:hoteles
>> > +stopword_phrase:barcelona stopword_phrase:en) |

processing of merged tokens

2012-02-20 Thread Carlos Gonzalez-Cadenas
Hello all,

For our search system we'd like to be able to process merged tokens, i.e.
when a user enters a query like "hotelsin barcelona", we'd like to know
that the user means "hotels in barcelona".

At some point in the past we implemented this kind of functionality with
shingles (using ShingleFilter), that is, if we were indexing the sentence
"hotels in barcelona" as a document, we'd be able to match at query time
merged tokens like "hotelsin" and "inbarcelona".

This solution has two problems:
1) The index size increases a lot.
2) We only catch a small % of the possibilities. Merged tokens like
"hotelsbarcelona" or "barcelonahotels" cannot be processed.

Our intuition is that there should be a better solution. Maybe it's solved
in SOLR or Lucene and we haven't found it yet. If it's not solved, I can
imagine a naive solution that would use TermsEnum to identify whether a
token exists in the index or not, and then if it doesn't exist, use the
TermsEnum again to check whether it's a composition of two known tokens.

It's highly likely that there are much better solutions and algorithms for
this. It would be great if you can help us identify the best way to solve
this problem.

Thanks a lot for your help.

Carlos

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


Re: custom scoring

2012-02-17 Thread Carlos Gonzalez-Cadenas
Thanks Em, Robert, Chris for your time and valuable advice. We'll make some
tests and will let you know soon.



On Thu, Feb 16, 2012 at 11:43 PM, Em  wrote:

> Hello Carlos,
>
> I think we missunderstood eachother.
>
> As an example:
> BooleanQuery (
>  clauses: (
> MustMatch(
>   DisjunctionMaxQuery(
>   TermQuery("stopword_field", "barcelona"),
>   TermQuery("stopword_field", "hoteles")
>   )
> ),
> ShouldMatch(
>  FunctionQuery(
>*please insert your function here*
> )
> )
>  )
> )
>
> Explanation:
> You construct an artificial BooleanQuery which wraps your user's query
> as well as your function query.
> Your user's query - in that case - is just a DisjunctionMaxQuery
> consisting of two TermQueries.
> In the real world you might construct another BooleanQuery around your
> DisjunctionMaxQuery in order to have more flexibility.
> However the interesting part of the given example is, that we specify
> the user's query as a MustMatch-condition of the BooleanQuery and the
> FunctionQuery just as a ShouldMatch.
> Constructed that way, I am expecting the FunctionQuery only scores those
> documents which fit the MustMatch-Condition.
>
> I conclude that from the fact that the FunctionQuery-class also has a
> skipTo-method and I would expect that the scorer will use it to score
> only matching documents (however I did not search where and how it might
> get called).
>
> If my conclusion is wrong than hopefully Robert Muir (as far as I can
> see the author of that class) can tell us what was the intention by
> constructing an every-time-match-all-function-query.
>
> Can you validate whether your QueryParser constructs a query in the form
> I drew above?
>
> Regards,
> Em
>
> Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
> > Hello Em:
> >
> > 1) Here's a printout of an example DisMax query (as you can see mostly
> MUST
> > terms except for some SHOULD terms used for boosting scores for
> stopwords)
> > *
> > *
> > *((+stopword_shortened_phrase:hoteles
> +stopword_shortened_phrase:barcelona
> > stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
> > +stopword_phrase:barcelona
> > stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short
> > ened_phrase:barcelona stopword_shortened_phrase:en) |
> (+stopword_phrase:hoteles
> > +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
> > tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
> > stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw
> > ord_phrase:barcelona stopword_phrase:en) |
> (+stopword_shortened_phrase:hoteles
> > +wildcard_stopword_shortened_phrase:barcelona
> stopword_shortened_phrase:en)
> > | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
> > stopword_phrase:en))*
> > *
> > *
> > 2)* *The collector is inserted in the SolrIndexSearcher (replacing the
> > TimeLimitingCollector). We trigger it through the SOLR interface by
> passing
> > the timeAllowed parameter. We know this is a hack but AFAIK there's no
> > out-of-the-box way to specify custom collectors by now (
> > https://issues.apache.org/jira/browse/SOLR-1680). In any case the
> collector
> > part works perfectly as of now, so clearly this is not the problem.
> >
> > 3) Re: your sentence:
> > *
> > *
> > **I* would expect that with a shrinking set of matching documents to
> > the overall-query, the function query only checks those documents that
> are
> > guaranteed to be within the result set.*
> > *
> > *
> > Yes, I agree with this, but this snippet of code in FunctionQuery.java
> > seems to say otherwise:
> >
> > // instead of matching all docs, we could also embed a query.
> > // the score could either ignore the subscore, or boost it.
> > // Containment:  floatline(foo:myTerm, "myFloatField", 1.0, 0.0f)
> > // Boost:foo:myTerm^floatline("myFloatField",1.0,0.0f)
> > @Override
> > public int nextDoc() throws IOException {
> >   for(;;) {
> > ++doc;
> > if (doc>=maxDoc) {
> >   return doc=NO_MORE_DOCS;
> > }
> > if (acceptDocs != null && !acceptDocs.get(doc)) continue;
> > return doc;
> >   }
> > }
> >
> > It seems that the author also thought of maybe embedding a query in order
> >

Re: custom scoring

2012-02-16 Thread Carlos Gonzalez-Cadenas
Hello Em:

1) Here's a printout of an example DisMax query (as you can see mostly MUST
terms except for some SHOULD terms used for boosting scores for stopwords)
*
*
*((+stopword_shortened_phrase:hoteles +stopword_shortened_phrase:barcelona
stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
+stopword_phrase:barcelona
stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short
ened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
+stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw
ord_phrase:barcelona stopword_phrase:en) | (+stopword_shortened_phrase:hoteles
+wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en)
| (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
stopword_phrase:en))*
*
*
2)* *The collector is inserted in the SolrIndexSearcher (replacing the
TimeLimitingCollector). We trigger it through the SOLR interface by passing
the timeAllowed parameter. We know this is a hack but AFAIK there's no
out-of-the-box way to specify custom collectors by now (
https://issues.apache.org/jira/browse/SOLR-1680). In any case the collector
part works perfectly as of now, so clearly this is not the problem.

3) Re: your sentence:
*
*
**I* would expect that with a shrinking set of matching documents to
the overall-query, the function query only checks those documents that are
guaranteed to be within the result set.*
*
*
Yes, I agree with this, but this snippet of code in FunctionQuery.java
seems to say otherwise:

// instead of matching all docs, we could also embed a query.
// the score could either ignore the subscore, or boost it.
// Containment:  floatline(foo:myTerm, "myFloatField", 1.0, 0.0f)
// Boost:foo:myTerm^floatline("myFloatField",1.0,0.0f)
@Override
public int nextDoc() throws IOException {
  for(;;) {
++doc;
if (doc>=maxDoc) {
  return doc=NO_MORE_DOCS;
}
if (acceptDocs != null && !acceptDocs.get(doc)) continue;
return doc;
  }
}

It seems that the author also thought of maybe embedding a query in order
to restrict matches, but this doesn't seem to be in place as of now (or
maybe I'm not understanding how the whole thing works :) ).

Thanks
Carlos
*
*

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Thu, Feb 16, 2012 at 8:09 PM, Em  wrote:

> Hello Carlos,
>
> > We have some more tests on that matter: now we're moving from issuing
> this
> > large query through the SOLR interface to creating our own
> QueryParser. The
> > initial tests we've done in our QParser (that internally creates multiple
> > queries and inserts them inside a DisjunctionMaxQuery) are very good,
> we're
> > getting very good response times and high quality answers. But when we've
> > tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a
> > QueryValueSource that wraps the DisMaxQuery), then the times move from
> > 10-20 msec to 200-300msec.
> I reviewed the sourcecode and yes, the FunctionQuery iterates over the
> whole index, however... let's see!
>
> In relation to the DisMaxQuery you create within your parser: What kind
> of clause is the FunctionQuery and what kind of clause are your other
> queries (MUST, SHOULD, MUST_NOT...)?
>
> *I* would expect that with a shrinking set of matching documents to the
> overall-query, the function query only checks those documents that are
> guaranteed to be within the result set.
>
> > Note that we're using early termination of queries (via a custom
> > collector), and therefore (as shown by the numbers I included above) even
> > if the query is very complex, we're getting very fast answers. The only
> > situation where the response time explodes is when we include a
> > FunctionQuery.
> Could you give us some details about how/where did you plugin the
> Collector, please?
>
> Kind regards,
> Em
>
> Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas:
> > Hello Em:
> >
> > Thanks for your answer.
> >
> > Yes, we initially also thought that the excessive increase in response
> time
> > was caused by the several queries being executed, and we did another
> test.
> > We executed one of the subqueries that I've shown to you directly in the
> > "q" parameter and then we tested this same subquery (only this one,
> without
> > the others) with the function query "query($q1)" in 

Re: custom scoring

2012-02-16 Thread Carlos Gonzalez-Cadenas
Hello Em:

Thanks for your answer.

Yes, we initially also thought that the excessive increase in response time
was caused by the several queries being executed, and we did another test.
We executed one of the subqueries that I've shown to you directly in the
"q" parameter and then we tested this same subquery (only this one, without
the others) with the function query "query($q1)" in the "q" parameter.

Theoretically the times for these two queries should be more or less the
same, but the second one is several times slower than the first one. After
this observation we learned more about function queries and we learned from
the code and from some comments in the forums [1] that the FunctionQueries
are expected to match all documents.

We have some more tests on that matter: now we're moving from issuing this
large query through the SOLR interface to creating our own QueryParser. The
initial tests we've done in our QParser (that internally creates multiple
queries and inserts them inside a DisjunctionMaxQuery) are very good, we're
getting very good response times and high quality answers. But when we've
tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a
QueryValueSource that wraps the DisMaxQuery), then the times move from
10-20 msec to 200-300msec.

Note that we're using early termination of queries (via a custom
collector), and therefore (as shown by the numbers I included above) even
if the query is very complex, we're getting very fast answers. The only
situation where the response time explodes is when we include a
FunctionQuery.

Re: your question of what we're trying to achieve ... We're implementing a
powerful query autocomplete system, and we use several fields to a) improve
performance on wildcard queries and b) have a very precise control over the
score.

Thanks a lot for your help,
Carlos

[1]: http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Thu, Feb 16, 2012 at 7:09 PM, Em  wrote:

> Hello Carlos,
>
> well, you must take into account that you are executing up to 8 queries
> per request instead of one query per request.
>
> I am not totally sure about the details of the implementation of the
> max-function-query, but I guess it first iterates over the results of
> the first max-query, afterwards over the results of the second max-query
> and so on. This is a much higher complexity than in the case of a normal
> query.
>
> I would suggest you to optimize your request. I don't think that this
> particular function query is matching *all* docs. Instead I think it
> just matches those docs specified by your inner-query (although I might
> be wrong about that).
>
> What are you trying to achieve by your request?
>
> Regards,
> Em
>
> Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas:
> > Hello Em:
> >
> > The URL is quite large (w/ shards, ...), maybe it's best if I paste the
> > relevant parts.
> >
> > Our "q" parameter is:
> >
> >
> "q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)\"",
> >
> > The subqueries q8, q7, q4 and q3 are regular queries, for example:
> >
> > "q7":"stopword_phrase:colomba~1 AND stopword_phrase:santa AND
> > wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
> > (stopword_phrase:las AND stopword_phrase:de)"
> >
> > We've executed the subqueries q3-q8 independently and they're very fast,
> > but when we introduce the function queries as described below, it all
> goes
> > 10X slower.
> >
> > Let me know if you need anything else.
> >
> > Thanks
> > Carlos
> >
> >
> > Carlos Gonzalez-Cadenas
> > CEO, ExperienceOn - New generation search
> > http://www.experienceon.com
> >
> > Mobile: +34 652 911 201
> > Skype: carlosgonzalezcadenas
> > LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >
> >
> > On Thu, Feb 16, 2012 at 4:02 PM, Em 
> wrote:
> >
> >> Hello carlos,
> >>
> >> could you show us how your Solr-call looks like?
> >>
> >> Regards,
> >> Em
> >>
> >> Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
> >>> Hello all:
> >>>
> >>> We'd like to score the matching documents using a combination of SOLR's
> >> IR
> >>> score with another application-specific score that we sto

Re: custom scoring

2012-02-16 Thread Carlos Gonzalez-Cadenas
Hello Em:

The URL is quite large (w/ shards, ...), maybe it's best if I paste the
relevant parts.

Our "q" parameter is:

  
"q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)\"",

The subqueries q8, q7, q4 and q3 are regular queries, for example:

"q7":"stopword_phrase:colomba~1 AND stopword_phrase:santa AND
wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
(stopword_phrase:las AND stopword_phrase:de)"

We've executed the subqueries q3-q8 independently and they're very fast,
but when we introduce the function queries as described below, it all goes
10X slower.

Let me know if you need anything else.

Thanks
Carlos


Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Thu, Feb 16, 2012 at 4:02 PM, Em  wrote:

> Hello carlos,
>
> could you show us how your Solr-call looks like?
>
> Regards,
> Em
>
> Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
> > Hello all:
> >
> > We'd like to score the matching documents using a combination of SOLR's
> IR
> > score with another application-specific score that we store within the
> > documents themselves (i.e. a float field containing the app-specific
> > score). In particular, we'd like to calculate the final score doing some
> > operations with both numbers (i.e product, sqrt, ...)
> >
> > According to what we know, there are two ways to do this in SOLR:
> >
> > A) Sort by function [1]: We've tested an expression like
> > "sort=product(score, query_score)" in the SOLR query, where score is the
> > common SOLR IR score and query_score is our own precalculated score, but
> it
> > seems that SOLR can only do this with stored/indexed fields (and
> obviously
> > "score" is not stored/indexed).
> >
> > B) Function queries: We've used _val_ and function queries like max, sqrt
> > and query, and we've obtained the desired results from a functional point
> > of view. However, our index is quite large (400M documents) and the
> > performance degrades heavily, given that function queries are AFAIK
> > matching all the documents.
> >
> > I have two questions:
> >
> > 1) Apart from the two options I mentioned, is there any other (simple)
> way
> > to achieve this that we're not aware of?
> >
> > 2) If we have to choose the function queries path, would it be very
> > difficult to modify the actual implementation so that it doesn't match
> all
> > the documents, that is, to pass a query so that it only operates over the
> > documents matching the query?. Looking at the FunctionQuery.java source
> > code, there's a comment that says "// instead of matching all docs, we
> > could also embed a query. the score could either ignore the subscore, or
> > boost it", which is giving us some hope that maybe it's possible and even
> > desirable to go in this direction. If you can give us some directions
> about
> > how to go about this, we may be able to do the actual implementation.
> >
> > BTW, we're using Lucene/SOLR trunk.
> >
> > Thanks a lot for your help.
> > Carlos
> >
> > [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
> >
>


custom scoring

2012-02-16 Thread Carlos Gonzalez-Cadenas
Hello all:

We'd like to score the matching documents using a combination of SOLR's IR
score with another application-specific score that we store within the
documents themselves (i.e. a float field containing the app-specific
score). In particular, we'd like to calculate the final score doing some
operations with both numbers (i.e product, sqrt, ...)

According to what we know, there are two ways to do this in SOLR:

A) Sort by function [1]: We've tested an expression like
"sort=product(score, query_score)" in the SOLR query, where score is the
common SOLR IR score and query_score is our own precalculated score, but it
seems that SOLR can only do this with stored/indexed fields (and obviously
"score" is not stored/indexed).

B) Function queries: We've used _val_ and function queries like max, sqrt
and query, and we've obtained the desired results from a functional point
of view. However, our index is quite large (400M documents) and the
performance degrades heavily, given that function queries are AFAIK
matching all the documents.

I have two questions:

1) Apart from the two options I mentioned, is there any other (simple) way
to achieve this that we're not aware of?

2) If we have to choose the function queries path, would it be very
difficult to modify the actual implementation so that it doesn't match all
the documents, that is, to pass a query so that it only operates over the
documents matching the query?. Looking at the FunctionQuery.java source
code, there's a comment that says "// instead of matching all docs, we
could also embed a query. the score could either ignore the subscore, or
boost it", which is giving us some hope that maybe it's possible and even
desirable to go in this direction. If you can give us some directions about
how to go about this, we may be able to do the actual implementation.

BTW, we're using Lucene/SOLR trunk.

Thanks a lot for your help.
Carlos

[1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function


custom scoring

2012-02-13 Thread Carlos Gonzalez-Cadenas
Hello all:

We'd like to score the matching documents using a combination of SOLR's IR
score with another application-specific score that we store within the
documents themselves (i.e. a float field containing the app-specific
score). In particular, we'd like to calculate the final score doing some
operations with both numbers (i.e product, sqrt, ...)

According to what we know, there are two ways to do this in SOLR:

A) Sort by function [1]: We've tested an expression like
"sort=product(score, query_score)" in the SOLR query, where score is the
common SOLR IR score and query_score is our own precalculated score, but it
seems that SOLR can only do this with stored/indexed fields (and obviously
"score" is not stored/indexed).

B) Function queries: We've used _val_ and function queries like max, sqrt
and query, and we've obtained the desired results from a functional point
of view. However, our index is quite large (400M documents) and the
performance degrades heavily, given that function queries are AFAIK
matching all the documents.

I have two questions:

1) Apart from the two options I mentioned, is there any other (simple) way
to achieve this that we're not aware of?

2) If we have to choose the function queries path, would it be very
difficult to modify the actual implementation so that it doesn't match all
the documents, that is, to pass a query so that it only operates over the
documents matching the query?. Looking at the FunctionQuery.java source
code, there's a comment that says "// instead of matching all docs, we
could also embed a query. the score could either ignore the subscore, or
boost it", which is giving us some hope that maybe it's possible and even
desirable to go in this direction. If you can give us some directions about
how to go about this, we may be able to do the actual implementation.

BTW, we're using Lucene/SOLR trunk.

Thanks a lot for your help.
Carlos

[1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function