Hi Scott,

I don't know your reasons for splitting your index up, but assuming you want to 
do that and then merge the search results back together I think you could 
re-unify the term document frequencies across all your indexes and then extend 
IndexSearcher and override termStatistics and collectionStatistics methods to 
return the statistics from the re-unified data.  Our index is partitioned for 
performance reasons and we re-combine the document frequencies from all the 
partitions into a file during our indexing workflow so that we can add new 
partitions as needed without suffering from the low statistics problem Erick 
described.  We use an FST (see org.apache.lucene.util.fst.Builder) to hold the 
stats in memory so that the lookups are fast.

Jim
________________________________________
From: Erick Erickson <erickerick...@gmail.com>
Sent: 22 October 2015 15:15
To: java-user
Subject: Re: Scoring over Multiple Indexes

bq:  Given that the content loaded for these indexes
represents individually curated terminologies, I think we can argue to our
users that what comes from combined queries over the latter is as
meaningful in it¹s own right as those run over the monolithic index

If one assumes that the individually curated terminologies are that way
for a reason, putting these all into a single index in some sense undid the
reason for curating them. Presumably an index specialized for
pharmaceuticals has a much different set of characteristics than for
an index specific to financials. I doubt that "leveraged buyout" appears
very often in a pharmaceutical index...

But let's say that two documents in your pharmaceutical index do
mention this phrase. The score in that index will be high since
the terms are so rare. How does one even theoretically relate
the scores coming from the financial index to one coming from the
pharmaceutical index?

None of which you can explain to an end user ;). Often the most use
to the _user_ is achieved by giving them some way to indicate which
sources they're most interested in and presenting those first.

FWIW,
Erick

On Thu, Oct 22, 2015 at 11:29 AM, Bauer, Herbert S. (Scott)
<bauer.sc...@mayo.edu> wrote:
> Thanks for your reply.  We¹ve recently moved from a single large index to
> multiple indexes. Given that the content loaded for these indexes
> represents individually curated terminologies, I think we can argue to our
> users that what comes from combined queries over the latter is as
> meaningful in it¹s own right as those run over the monolithic index. We
> had to consider that our changes to the back end of our application might
> change sorting orders for results which is what we normally want to avoid.
>
>
> On 10/22/15, 10:43 AM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>
>>In a word, no. At least not that I've heard of. "normalizing scores"
>>is one of those things
>>that sounds reasonable on the surface, but is really meaningless.
>>Scores don't really
>>_tell_ you anything about the abstract "goodness" of a doc, they just
>>tell you that
>>doc1 is likely better than doc2 _within a single query_. You can't even
>>compare
>>scores in the _same_ index across two different queries.
>>
>>At its lowest level, say one index has 1,000,000 occurrences of
>>"erick", while index 2 has
>>exactly 1. Term frequency is one of the numbers that is used to
>>calculate the score.
>>How does one normalize the part of the calculation resulting from
>>matching "erick"
>>between the two indexes? Anything you do is wrong.
>>
>>Similarly, expecting documents to be returned in a particular order
>>because of boosting
>>is not going to be satisfactory. Boosting will influence the final
>>score and thus the
>>position of the document, but not absolutely order them unless you put
>>in insane boosts.
>>Tests based on boosting and doc ordering will be very fragile I'd guess.
>>
>>Best,
>>Erick
>>
>>On Thu, Oct 22, 2015 at 8:34 AM, Bauer, Herbert S. (Scott)
>><bauer.sc...@mayo.edu> wrote:
>>> We have a test case that boosts a set of terms.  Something along the
>>>lines of ³term1^2 AND term2^3 AND term3^4 and this query runs over a two
>>>content distinct indexes.  Our expectation is that the terms would be
>>>returned to us as term3, term2 and term1.  Instead we get something
>>>along the lines of term3, term1 and term2.  I realize from a number of
>>>postings that this is the result of the scoring methods action taking
>>>place within an individual index rather than against several indexes.
>>>At the same time I don¹t see a lot of solutions offered. Is there an out
>>>of the box solution to normalize scoring over diverse indexes?  If not
>>>is there a strategy for rolling your own normalizing solution?  I¹m
>>>assuming this has to be a common problem.    -scott
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to