On Sat, Sep 27, 2008 at 3:14 PM, Marvin Humphrey <[EMAIL PROTECTED]> wrote: >> The downside is that each Scorer remains tied to particular index >> format. Long-term I still think this is disastrous, but in the short >> term it's not that bad. > > Can you please elaborate on what you see as the downsides?
I could be wrong about this, but I'll start here, as this question relates closely to my motivations. In the time I was trying to make use of KinoSearch, I found it very difficult to experiment with new scoring systems and new index formats. Each time I wanted to try something new (positional scoring instead of TF/IDF, or reading posting lists from SQLite) it felt like I had reinvent the whole wheel. Despite the layers of abstraction, there are still a lot of cross-dependencies. Currently, each posting class is tied to the internal binary format of the index in use. And the low level scorers (like PhraseScorer) presume a binary layout of the Posting. Creating a new index format involves either involves creating a whole bunch of classes, or understanding the interactions of the existing classes well enough to maintain full compatibility. Despite considerable time spent, I still don't feel like I understand these interactions. Worse, my own uses of KinoSearch are likely to include custom scorers interacting with custom indexes. It's highly unlikely that anyone else is going to have exactly the same needs. But it seems reasonably likely that others would be interested in just the scoring approach, or just the index format. I think development might go much faster if these two could be decoupled. The goal would be to make it possible to write a new index format as a single class and to have the existing scorers just keep working. Conversely, I want to have my (theoretical) custom scorers keep working even though the underlying index format changes. I want it to be possible to use others' components piece by piece without having to replace the whole Scorer/Posting/InStream/etc complex, and to make it possible for others to use and test my components without having to use my whole system. I don't think this is possible with the current approach, and I fear this will hinder future development and developers. This could be just evidence of my own limitations, though. Perhaps I just need better examples of how to accomplish these things with the current system. Thus my suggestions for adding parallel support for P4Delta compression, reading Lucene indexes directly, and non-TF/IDF positional scoring. I'm hoping that either you'll show the way to do this effectively, or realize the need for architectural changes to allow this. Nathan Kurz [EMAIL PROTECTED]
