Pawel Rog created LUCENE-7253:
---------------------------------

             Summary: Sparse data in doc values and segments merging 
                 Key: LUCENE-7253
                 URL: https://issues.apache.org/jira/browse/LUCENE-7253
             Project: Lucene - Core
          Issue Type: Improvement
    Affects Versions: 5.5, 6.0
            Reporter: Pawel Rog


Doc Values were optimized recently to efficiently store sparse data. 
Unfortunately there is still big problem with Doc Values merges for sparse 
fields. When we imagine 1 billion documents index it seems it doesn't matter if 
all documents have value for this field or there is only 1 document with value. 
Segment merge time is the same for both cases. In most cases this is not a 
problem but there are several cases in which one can expect having many fields 
with sparse doc values.

I can describe an example. During performance tests of a system with large 
number of sparse fields I realized that Doc Values merges are a bottleneck. I 
had hundreds of different numeric fields. Each document contained only small 
subset of all fields. Average document contains 5-7 different numeric values. 
As you can see data was very sparse in these fields. It turned out that 
ingestion process was CPU-bound. Most of CPU time was spent in DocValues 
related methods (SingletonSortedNumericDocValues#setDocument, 
DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, 
DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.

Adrien Grand suggested to reduce the number of sparse fields and replace them 
with smaller number of denser fields. This helped a lot but complicated fields 
naming. 

I am not much familiar with Doc Values source code but I have small suggestion 
how to improve Doc Values merges for sparse fields. I realized that Doc Values 
producers and consumers use Iterators. Let's take an example of numeric Doc 
Values. Would it be possible to replace Iterator which "travels" through all 
documents with Iterator over collection of non empty values? Of course this 
would require storing object (instead of numeric) which contains value and 
document ID. Such an iterator could significantly improve merge time of sparse 
Doc Values fields. IMHO this won't cause big overhead for dense structures but 
it can be game changer for sparse structures.

This is what happens in NumericDocValuesWriter on flush

{code}
    dvConsumer.addNumericField(fieldInfo,
                               new Iterable<Number>() {
                                 @Override
                                 public Iterator<Number> iterator() {
                                   return new NumericIterator(maxDoc, values, 
docsWithField);
                                 }
                               });
{code}

Before this happens during addValue, this loop is executed to fill holes.

{code}
    // Fill in any holes:
    for (int i = (int)pending.size(); i < docID; ++i) {
      pending.add(MISSING);
    }
{code}

It turns out that variable called pending is used only internally in 
NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be 
good to change it with different class (some kind of list) because this may 
break DV performance for dense fields. I hope someone can suggest interesting 
solutions for this problem :).

It would be great if discussion about sparse Doc Values merge performance can 
start here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to