[jira] Commented: (LUCENE-505) MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object

Steven Tamm (JIRA) Wed, 01 Mar 2006 15:02:07 -0800

    [ 
http://issues.apache.org/jira/browse/LUCENE-505?page=comments#action_12368389 ]


Steven Tamm commented on LUCENE-505:
------------------------------------

> I also worry about performance with this change. Have you benchmarked this 
> while searching large indexes?
yes.  see below.  

> For example, in TermScorer.score(HitCollector, int), Lucene's innermost loop, 
> you change two array accesses into a call to an interface. That could make a 
> substantial difference. Small changes to that method can cause significant 
> performance changes. 

Specifically "you change two array accesses into a call to an interface."  I 
have changed two byte array references (one of which is static), to a method 
call on an abstract class.  I'm using JDK 1.5.0_06.  Hotspot inlines both calls 
and performance was about the same with a 1M docs index (we have a low term/doc 
ratio, so we have about 8.5M terms).  HPROF doesn't even see the call to 
Similarity.decodeNorm.  If I was using JDK 1.3, I'd probably agree with you, 
but HotSpot is very good at figuring this stuff out and autoinlining the calls.

As for the numbers: an average request returning 5000 hits from our 0.5G index 
was at ~485ms average on my box before.  It's now at ~480ms.  (50 runs each).  
Most of that is overhead, granted.  

The increase in performance may be obscured by my other change in TermScorer 
(LUCENE-502).  I'm not sure of the history of TermScorer, but it seems heavily 
optimized for a Large # Terms/Document.  We have a low # Terms/Document, so 
performance suffers greatly..  Performance was dramatically improved by not 
unnecessarily caching things.   TermScorer seems to be heavily optimized for a 
non-modern VM (like inlining next() into score(), caching result of Math.sqrt 
for each term being queried, having a doc/freq cache that provides no benefit 
unless iterating backwards, etc).  The total of the term scorer changes brought 
the average down from ~580ms. 

Since we use a lot of large indexes and don't keep them in memory all that 
often, our performance increases dramatically due to the reduction in GC 
overhead.  As we move to not actually storing the Norms array in memory but 
instead using the disk, this change will have an even higher benefit.  I'm in 
the process of preparing a set of patches that will help people that don't have 
long-lived indexes, and this is just one part.

> MultiReader.norm() takes up too much memory: norms byte[] should be made into 
> an Object
> ---------------------------------------------------------------------------------------
>
>          Key: LUCENE-505
>          URL: http://issues.apache.org/jira/browse/LUCENE-505
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Index
>     Versions: 2.0
>  Environment: Patch is against Lucene 1.9 trunk (as of Mar 1 06)
>     Reporter: Steven Tamm
>  Attachments: NormFactors.patch, NormFactors.patch
>
> MultiReader.norms() is very inefficient: it has to construct a byte array 
> that's as long as all the documents in every segment.  This doubles the 
> memory requirement for scoring MultiReaders vs. Segment Readers.  Although 
> this is cached, it's still a baseline of memory that is unnecessary.
> The problem is that the Normalization Factors are passed around as a byte[].  
> If it were instead replaced with an Object, you could perform a whole host of 
> optimizations
> a.  When reading, you wouldn't have to construct a "fakeNorms" array of all 
> 1.0fs.  You could instead return a singleton object that would just return 
> 1.0f.
> b.  MultiReader could use an object that could delegate to NormFactors of the 
> subreaders
> c.  You could write an implementation that could use mmap to access the norm 
> factors.  Or if the index isn't long lived, you could use an implementation 
> that reads directly from the disk.
> The patch provided here replaces the use of byte[] with a new abstract class 
> called NormFactors.  
> NormFactors has two methods on it
>     public abstract byte getByte(int doc) throws IOException;  // Returns the 
> byte[doc]
>     public float getFactor(int doc) throws IOException;            // Calls 
> Similarity.decodeNorm(getByte(doc))
> There are four implementations of this abstract class
> 1.  NormFactors.EmptyNormFactors - This replaces the fakeNorms with a 
> singleton that only returns 1.0
> 2.  NormFactors.ByteNormFactors - Converts a byte[] to a NormFactors for 
> backwards compatibility in constructors.
> 3.  MultiNormFactors - Multiplexes the NormFactors in MultiReader to prevent 
> the need to construct the gigantic norms array.
> 4.  SegmentReader.Norm - Same class, but now extends NormFactors to provide 
> the same access.
> In addition, Many of the Query and Scorer classes were changes to pass around 
> NormFactors instead of byte[], and to call getFactor() instead of using the 
> byte[].  I have kept around IndexReader.norms(String) for backwards 
> compatibiltiy, but marked it as deprecated.  I believe that the use of 
> ByteNormFactors in IndexReader.getNormFactors() will keep backward 
> compatibility with other IndexReader implementations, but I don't know how to 
> test that.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-505) MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object

Reply via email to