On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote: > It won't encounter an unknown posting format. It's the codec. It > knows all posting formats by the time it sees it.
OK, so you're not going to handle this the way Lucene handles field types and accept a new codec spec reference with each field in each Document. There will be per-index associations between field names and codecs and it will be invalid to change those associations. > Well, Codec is intentionally generic -- currently it "only" serves up > readers & writers for postings, but over time I expect it'll > be the class Lucene uses to get reader/writer for other parts of the > index. Huh? What does the posting format specifier have to do with e.g. stored fields? What you're describing sounds more like the Architecture class in KinoSearch. > I'm a little confused: if I indexed a field with full postings data, > shouldn't I still be allowed score with match only scoring? Of course. > When a movie is encoded to a file, the codec(s) determine all sorts of > interesting details. Then when you watch the movie you're free to do > whatever you want -- watch as hidef, as normal def, cropped, sound > only, listen to different languages, pick subtitles, etc. How it's > specifically encoded is strongly decoupled from how you use it. I see what you're getting at. However, Similarity *already* affects the contents of the index, via encodeNorm()/decodeNorm() and lengthNorm(). So if you want to divorce Similarity from index format, you'll need to remove those methods. In my opinion, it makes more sense to go the opposite direction, and have Similarity objects spawn PostingEncoder objects which define the index format. The ability of a search-time Similarity object to make relevance judgements and assign scores is intimately tied to the information prepared for it in advance and written at index-time. I also like the idea of novice/intermediate users being able to express the intent for how a field gets scored by choosing a Similarity subclass, without having to worry about the underlying details of posting format. > > What's the flex API for specifying a custom posting format? > > You implement a Codecs class, which within it knows about any number > of Codec impls that it can retrieve by name. So you have both a class named "Codec" and a class named "Codecs"? :( Tell me, is this an array of Codecs or a Codecs? return codecs; > Here's the default > Codecs on flex now: > > class DefaultCodecs extends Codecs { > DefaultCodecs() { > register(new StandardCodec()); > register(new IntBlockCodec()); > register(new PreFlexCodec()); > register(new PulsingCodec()); > register(new SepCodec()); > } > > @Override > public Codec getWriter(SegmentWriteState state) { > return lookup("Standard"); > //return lookup("Pulsing"); > //return lookup("Sep"); > //return lookup("IntBlock"); > } > } > > getWriter returns the Codec that will write the current segment. So... * The user needs to know about SegmentWriteState? * The "codec" is per-index, not per-field? Presumably this will change? * The "codec" is a writer in this case, but since the name "codec" implies both coding and decoding, it must also be capable of functioning as a reader? > > Right. But what happens when you want a custom codec to use BM25 weighting > > *and* inline a part-of-speech ID *and* use PFOR? > > You'd use the PForCodec, and make an attr that injects POS. OK. I don't think we're likely to do things that way in Lucy. The functions which decode postings will operate directly on raw mmap'd memory, and they typically won't make any external calls to either methods or non-inline functions. If you wanted to use an esoteric custom format, you'd write your own decoder function. There won't be a lot of code reuse at this inner-loop level -- unrolling will be the rule rather than the exception. > > I think we have to supply a class object or class name when asking for the > > enumerator, like you do with AttributeSource. > > > > PostingList plist = null; > > PostingListReader pListReader = segReader.fetch(PostingListReader); > > if (pListReader != null) { > > PostingsReader pReader = pListReader.fetch(field); > > if (pReader != null) { > > plist = pReader.makePostingList(klass); // e.g. PartOfSpeechPostingList > > } > > } > > But is plist a "normal" postings iterator (ie, subclasses it) that has > also exposed a dedicated POS API? It's definitely a "normal" postings iterator. As to whether we expose the part-of-speech via an attribute or via a method, that's up in the air. Hmm. >From a class-design perspective, it would probably be best to go with an attribute, since Lucy has only single-inheritance and no interfaces. A rigid class hierarchy is going to cause problems when you need an iterator that combines unrelated concepts like BM25 weighting and part-of-speech tagging. > In flex you'd get a "normal" DocsAndPositionsEnum, pull the POS attr > up front, and as you're next'ing your way through it, optionally look > up the POS of each position you step through, using the POS attr. Just a thought: why not make positions an attribute on a DocsEnum? Marvin Humphrey --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org