On Tue, Mar 09, 2010 at 01:18:12PM -0500, Michael McCandless wrote: > > >> You said "of course" before but... how in your proposal could one > >> store all stats for a given field during indexing, but then sometimes > >> use match-only and sometimes full-scoring when querying against that > >> field? > > > > The same way that Lucene knows that sometimes it needs a docs-only-enum and > > sometimes it needs a docs-and-positions enum. Sometimes you need scores, > > sometimes you don't. > > But if user had specified BM25Sim when indexing... can they later just > change that to MatchOnlySim at search time?
The user won't be able to modify the Schema by reaching into a FieldType object and replacing its Similarity instance. However, internally, match-only iteration of a posting list would work just fine. I mean, the doc id data is there in one form or another. Under a field spec'd to use MatchSimilarity, the default would be to write only one file, holding nothing but delta-encoded doc ids. Under LuceneSimilarity, freq would probably be embedded in the doc id file, but iterating that with match-only just means throwing away freq. Slightly less efficient, but still pretty good. So there would be polymorphism in the decoding phase while we're supplying information the Similarity object needs to make its similarity judgments. However, that polymorphism would be handled internally -- it wouldn't be the responsibility of the user to determine whether a codec supported a particular scoring model. What Lucy users absolutely wouldn't be able to do is change up BM25 weighting to standard Lucene weighting at search time, because we'll be writing pre-calculated boost bytes at index time. Re-indexing will be required. I think that's a nice feature for Lucene to provide, but Lucy will have to skip it because of our cheap-searcher requirement. > > What users will be able to tell us is how they want the field to be used, > > and > > we can use that information to help us optimize. For example, when a user > > declares that they want a field to be "match-only", we know we don't have to > > write boost bytes, freq or positions, saving space. > > Yeah.... so, I don't like that in Lucene you call "Field.setOmitTFAP" > instead of saying "Field.matchOnly" (or something). So I do agree > that it'd be better if the API made it clear what the *search* time > impact is of using this advanced Field API. In my opinion, it makes sense to communicate "match only" by way of the Similarity object as opposed to a boolean. I think it's a good way to introduce the Similarity class and get people comfortable with it, and I also think that it's good to keep stuff out of the FieldType API when we can. > We get users who are baffled that their phrase queries no longer work > after setting omitTFAP. This is still a weakness of MatchSimilarity. The default behavior of the KinoSearch QueryParser, which I expect Lucy to follow, is to expand all TermQueries and PhraseQueries out to cover all indexed fields. If we include MatchSimilarity fields in that expansion, we'll match terms but not phrases. Maybe that would be a little hard for users to understand -- shouldn't a MatchSimilarity field allow phrases to match without contributing to scores? On the other hand, typical candidates for MatchSimilarity... * unique_id * category * tags ... either won't contain multiple tokens, or won't generally return sensible results for phrase queries. > (Today it silently returns no results... with flex you'll get an exception). Mmm, tough call. > > They could use better codecs under the format-follows-Similarity model, too. > > They'd just have to subclass and override the factory methods that spawn > > posting encoders/decoders. > > Ahh, OK so that's how they'd do it. > > So... I think we're making a mountain out of a molehill. Well, I don't see it that way, because I place great value on designing good public APIs, and I think it's important that we avoid forcing users to know about codecs. > In format-follows-Sim, it sounds like that simply means the Sim has a > default codec, but you can override it if you want (and it's the Sim > that "owns" (has the method for) handing out the Codec you'll use). Yes. > Whereas in Lucene the same defaulting will take place. It's just that > Sim won't "own" picking the Codec. However, *something* down in Lucene besides the codec itself will be influencing decoder polymorphism. If there was only one decoding function, you'd always iterate positions. :) Under format-follows-Sim, it would be the Similarity object that knows all supported decoding configurations for the field. > > You don't want to use the stronger, more constrictive check, right? > > You mean single inheritance? No. Because then we hardwire the attrs > to the Codec. Standard codec should encode whatever attrs the app > hands us... I think. I might approach things the same way if Clownfish supported interface method dispatch. :) As it is, though, I'm not sure that the single inheritance requirement is an important liability. Marvin Humphrey --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org