I'm trying to work on lucene-2335 as a gsoc project. This is my proposal. Some parts reference Toke Eskildsen's blog. Please feel free to comment. Thanks.
Background knowledge: Given an ordinal, the term is returned by querying the index. This is just a logical mapping and requires practically no memory. The ordinals are sorted, typically with respect to a locale, and the sorted lists is called the indirects list. If an index in the indirect is lower than another, it means that its corresponding term comes before the other indirect entry’s term with respect to sorting. We always need to sort in order to have indirects, even if the terms in the segments are already in order. For each document id, a list of corresponding indirects is kept. By following the indirects through the ordinals, the corresponding terms can be resolved. Memory wise this requires a list of integers as long as the number of documents plus a list of integers as long as the total number of indirects for all documents. Problem: Now Lucene loads ordinals and strings together whether they are in cache or not. However, there's one circumstance where strings are not needed that index is only one segment, and the search does not require fields being filled by term strings. It would save some memory if Lucene only loads strings when necessary without losing the ordinal information already obtained. Class involved: FieldCacheImpl implements interface FieldCache. StringIndexCache.createValue(...) is where StringIndex objects are created. StringIndex contains two related fields, String[] lookup and int[] order. The former contains "All the term values, in natural order". The latter is "For each document, an index into the lookup array. " StringOrdValComparator.setNextReader(...) calls FieldCache.getStringIndex(....). When StringOrdValComparator is initialized, it has allocated space for ords and values. FieldValueHitQueue is a priority queue that should hold values. Modification: So the idea should be to add condition before values are copied into cache. If a StringIndex is created via two ways, one is to resolve term.text and the other doesn't resolve. It might work. This is my initial thought. I haven't got to how to take care of sharing ords across invocations.
