You are right, but here's my null hypothesis for studying the impact on
relevance.    Hash the query to deterministically seed random number
generator.    Pick one from column A or column B randomly.

This is of course wrong - a query might find two non-relevant results in
corpus A and lots of relevant results in corpus B, leading to poor
precision because the two non-relevant documents are likely to show up on
the first page.   You can weight on the size of the corpus, but weighting
is probably wrong then on any specifc query.

It was an interesting thought experiment though.

Erik,

Since LucidWorks was dinged in the 2013 Magic Quadrant on Enterprise Search
due to a lack of "Federated Search", the for-profit Enterprise Search
companies must be doing it some way.    Maybe relevance suffers (a lot),
but you can do it if you want to.

I have read very little of the IR literature - enough to sound like I know
a little, but it is a very little.  If there is literature on this, it
would be an interesting read.


On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> The lack of global TF/IDF has been answered in the past,
> in the sharded case, by "usually you have similar enough
> stats that it doesn't matter". This pre-supposes a fairly
> evenly distributed set of documents.
>
> But if you're talking about federated search across different
> types of documents, then what would you "rescore" with?
> How would you even consider scoring docs that are somewhat/
> totally different? Think magazine articles an meta-data associated
> with pictures.
>
> What I've usually found is that one can use grouping to show
> the top N of a variety of results. Or show tabs with different
> types. Or have the app intelligently combine the different types
> of documents in a way that "makes sense". But I don't know
> how you'd just get "the right thing" to happen with some kind
> of scoring magic.
>
> Best
> Erick
>
>
> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis <dansm...@gmail.com> wrote:
>
>> I've thought about it, and I have no time to really do a meta-search
>> during
>> evaluation.  What I need to do is to create a single core that contains
>> both of my data sets, and then describe the architecture that would be
>> required to do blended results, with liberal estimates.
>>
>> From the perspective of evaluation, I need to understand whether any of
>> the
>> solutions to better ranking in the absence of global IDF have been
>> explored?    I suspect that one could retrieve a much larger than N set of
>> results from a set of shards, re-score in some way that doesn't require
>> IDF, e.g. storing both results in the same priority queue and *re-scoring*
>> before *re-ranking*.
>>
>> The other way to do this would be to have a custom SearchHandler that
>> works
>> differently - it performs the query, retries all results deemed relevant
>> by
>> another engine, adds them to the Lucene index, and then performs the query
>> again in the standard way.   This would be quite slow, but perhaps useful
>> as a way to evaluate my method.
>>
>> I still welcome any suggestions on how such a SearchHandler could be
>> implemented.
>>
>
>

Reply via email to