Great response that I haven't had enough time to fully digest yet.

A couple preliminary queries though:

> So long as we're going to support TF/IDF, its complexity can only be hidden,
> not eliminated.  Many alternative weighting and matching schemes (BM25, 
> TF/ICF,
> LSA, etc.) also require corpus-wide statistics.

BM25 is pretty clear as such things go: http://en.wikipedia.org/wiki/Okapi_BM25

I hadn't seen TF/ICF before: http://aser.ornl.gov/publications/ICMLA06.pdf
I don't yet understand what it's doing differently than TF/IDF.  Is it
that it's counting the number of documents that use a term rather than
the number of term occurrences?

I think I understand Latent Semantic Analysis, and how it could be
used for search in place of an inverted index.
I'm not sure how it could be used for scoring though.

Are there other scoring methods that you anticipate as useful? What
other corpus-wide data they would require?  What other corpus wide
data exists?

> When weighting an arbitrarily complex query, we have to allow the scoring
> model the option of having member variables and methods which perform the
> weighting, and we have to allow for the possibility that it will proceed in an
> arbitrary number of stages, requiring gradual modifications to complex
> internal states before collapsing down to a final "weight" -- if it ever does.

Does your "if ever" imply that we indeed should try to support scorers
that might return additional information beyond a single float, such
as field name, position data, or matched string?  I'd like to be able
to do this, but don't see an easy framework.

Also, do you feel a Scorer needs to be able to do "incremental"
scoring, or is it OK if scoring is only possible after a Matcher has
finished?  Essentially, will it ever be necessary to score a subquery
so that a Matcher can decide whether to skip to the next document?

More coherent replies to follow in a few days,

--nate

Reply via email to