I'd like to try some experiments to see if I can improve search performance by changing analysis (e.g. adding/removing word bigrams or commongrams), or by changing how I map my source records into Lucene documents. The problem is that my index currently is about 1TB in size and takes about 2-3 weeks to build, so if I have to rebuild the entire index in order to test each potential improvement, then I'm going to be waiting around a lot.
One option is to test potential performance improvements by building indexes not for the full dataset, but rather for, say, a 1% sample of the full dataset. (That is, I'll just index 1% of the source records.) I would build one small control index, and then n small test indexes, one for each intervention I wish to try. The hope would be that, if an indexing intervention significantly improves performance for the small indexes, then it would also significantly improve performance of the full dataset. (Similarly, you'd hope that if an intervention *didn't* significantly improve performance on the small indexes, then it would *not* significantly improve performance of the full dataset.) This would allow me to quickly accept and reject interventions (as least provisionally), and only try applying the most obviously promising ones to the full dataset. Any thoughts on how naive this is? Does it sound more like a way to save time, or like a way to waste time misleading myself? Cheers, Chris --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org