On Sat, Mar 14, 2009 at 5:41 PM, Marvin Humphrey <[email protected]> wrote: > That sounds like a fun exercise. Let's start from a clean slate, and try to > build up interfaces for Indexer and Searchable. Just so that it's clear that > nothing final is being decided, let's call our experimental project "Luser".
Thanks for playing along! I think this could be a very useful thought experiment, at least for me. > Since Lucy's internal > doc numbers will be ephemeral, they wouldn't work. We'd need to add a primary > key field. While I understand this is mostly true, I had assumed that it was possible to create some synthetic primary key, say, 'segment - doc'. If not, how does one know which record to mark for deletion with a delete/add style of update? Does one save some unique identifier as field value, and then search for it? > The very nature of an inverted index is that a single document gets broken up > and listed in many posting lists. Updating all those posting lists is > basically impossible without rebuilding the index. While I routinely mangle everything related to the Lucy object hierarchy, the physical layout of the inverted index is one area where I feel comfortable. Yes, I'm suggesting that updating each and every one of those posting lists both feasible and possibly the best approach. It only becomes impossible if the update rate exceeds the available memory bandwidth, which for a modern processor is mighty high. Given a reasonable read/write ratio, I think it's a win, at least for the cases I'm interested in. I'm not suggesting you move to such an approach now, but it's a route I strongly want left open until conclusively proven to be unworkable. > Getting back to the point... I think IndexReader has to have Doc_Max() and > Doc_Freq(). How do we avoid those and still support TF/IDF? You are right that we need these, but it's just a question of where the support should go. As a non-OO thinker, I view these as being data members within some standardized Posting struct. Thus I'd guess that these should be part of the Posting object, accessible only during the Scoring phase, rather than on the IndexReader. But likely I'm not understanding you here. > It's also hard to imagine life without Lexicon(), or support for sort caches. Again, it's just a question of where these should go. While you view the Lexicon as being part of the Index (and commented appropriately on the overloading of this term), I view them as independent. I can view cases where different inverted indexes might share a common lexicon in the same way that segments currently share one, perhaps even with the Lexicon being generated and stored by an unrelated app (say, for user applied tags). > If you were to write an Architecture class to support pluggability, what would > it look like? I guess that's what I'm hoping to answer. I don't currently know. My instinct is that it might be best to treat the problem as one of data exchange between levels rather than pluggability, but that might just be my poor OO intuition. Nathan Kurz [email protected]
