[ http://issues.apache.org/jira/browse/LUCENE-505?page=all ]
Steven Tamm updated LUCENE-505: ------------------------------- Attachment: NormFactors.patch Sorry, I didn't remove whitespace in the previous patch. This one's easier to read. svn diff --diff-cmd diff -x "-b -u" works better than svn diff --diff-cmd diff -x -b -x -u > MultiReader.norm() takes up too much memory: norms byte[] should be made into > an Object > --------------------------------------------------------------------------------------- > > Key: LUCENE-505 > URL: http://issues.apache.org/jira/browse/LUCENE-505 > Project: Lucene - Java > Type: Improvement > Components: Index > Versions: 1.9 > Environment: Patch is against Lucene 1.9 trunk (as of Mar 1 06) > Reporter: Steven Tamm > Attachments: NormFactors.patch, NormFactors.patch > > MultiReader.norms() is very inefficient: it has to construct a byte array > that's as long as all the documents in every segment. This doubles the > memory requirement for scoring MultiReaders vs. Segment Readers. Although > this is cached, it's still a baseline of memory that is unnecessary. > The problem is that the Normalization Factors are passed around as a byte[]. > If it were instead replaced with an Object, you could perform a whole host of > optimizations > a. When reading, you wouldn't have to construct a "fakeNorms" array of all > 1.0fs. You could instead return a singleton object that would just return > 1.0f. > b. MultiReader could use an object that could delegate to NormFactors of the > subreaders > c. You could write an implementation that could use mmap to access the norm > factors. Or if the index isn't long lived, you could use an implementation > that reads directly from the disk. > The patch provided here replaces the use of byte[] with a new abstract class > called NormFactors. > NormFactors has two methods on it > public abstract byte getByte(int doc) throws IOException; // Returns the > byte[doc] > public float getFactor(int doc) throws IOException; // Calls > Similarity.decodeNorm(getByte(doc)) > There are four implementations of this abstract class > 1. NormFactors.EmptyNormFactors - This replaces the fakeNorms with a > singleton that only returns 1.0 > 2. NormFactors.ByteNormFactors - Converts a byte[] to a NormFactors for > backwards compatibility in constructors. > 3. MultiNormFactors - Multiplexes the NormFactors in MultiReader to prevent > the need to construct the gigantic norms array. > 4. SegmentReader.Norm - Same class, but now extends NormFactors to provide > the same access. > In addition, Many of the Query and Scorer classes were changes to pass around > NormFactors instead of byte[], and to call getFactor() instead of using the > byte[]. I have kept around IndexReader.norms(String) for backwards > compatibiltiy, but marked it as deprecated. I believe that the use of > ByteNormFactors in IndexReader.getNormFactors() will keep backward > compatibility with other IndexReader implementations, but I don't know how to > test that. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]