I'm trying to work on lucene-2335 as a gsoc project.
This is my proposal. Some parts reference Toke Eskildsen's blog. Please feel 
free to comment. Thanks.

Background knowledge:
Given an ordinal, the term is returned by querying the index. This is just a 
logical mapping and requires practically no memory.
The ordinals are sorted, typically with respect to a locale, and the sorted 
lists is called the indirects list. If an index in the indirect is lower than 
another, it means that its corresponding term comes before the other indirect 
entry’s term with respect to sorting. We always need to sort in order to have 
indirects, even if the terms in the segments are already in order.   
     For each document id, a list of corresponding indirects is kept. By 
following the indirects through the ordinals, the corresponding terms can be 
resolved. Memory wise this requires a list of integers as long as the number of 
documents plus a list of integers as long as the total number of indirects for 
all documents. 
     Problem:
     Now Lucene loads ordinals and strings together whether they are in cache 
or not. However, there's one circumstance where strings are not needed that 
index is only one segment, and the search does not require fields being filled 
by term strings. It would save some memory if Lucene only loads strings when 
necessary without losing the ordinal information already obtained. 
     Class involved: 
     FieldCacheImpl implements interface FieldCache. 
StringIndexCache.createValue(...) is where StringIndex objects are created.   
      StringIndex contains two related fields, String[] lookup and  int[] 
order. The former contains "All the term values, in natural order". The latter 
is "For each document, an index into the lookup array. "
   StringOrdValComparator.setNextReader(...) calls 
FieldCache.getStringIndex(....).
   When StringOrdValComparator is initialized, it has allocated space for ords 
and values.
   FieldValueHitQueue is a priority queue that should hold values.
   Modification:
  So the idea should be to add condition before values are copied into cache.  
If a StringIndex is created via two ways, one is to resolve term.text and the 
other doesn't resolve. It might work. This is my initial thought. I haven't got 
to how to take care of sharing ords across invocations.  

Reply via email to