[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

Varun Thacker (JIRA) Fri, 30 May 2014 01:26:19 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013414#comment-14013414
 ]


Varun Thacker commented on LUCENE-5688:
---------------------------------------

Hi Shai,

Thanks for reviewing.

bq. Perhaps you can also experiment with a tiny hash-map, using plain 
int[]+long[] or a pair of packed arrays, instead of the binary search tree. I 
am writing one now because I am experimenting with improvements to updatable 
DocValues. It's based on Solr's HashDocSet which I modify to act as an 
int-to-long map. I can share the code here if you want

Sure this approach looks promising also. Faster access vs more memory. Perhaps 
we could provide both options in the same codec.

bq. Another thing, maybe this codec should wrap another and delegate to in case 
the number of docs-with-values exceeds some threshold? For instance, ignoring 
packing, the default DV encodes 8 bytes per document, while this codec encodes 
12 bytes (doc+value) per document which has a value. So I'm thinking that 
unless the field is really sparse, we might prefer the default encoding. We 
should fold that as well into the benchmark.

I thought about it, but since we are writing a codec dedicated to sparse values 
and not adding it as an optimization for the default codec I did not include it 
in my patch. If you feel that we should then I will add it.

A couple of other general doubts that I had - 
- Currently only addNumericField is implemented. Looking at the 
Lucene45DocValuesConsumer - addBinaryField does not write missing value so the 
same code can be reused?
- For addSortedField, addSortedSetField the only method which needs to be 
changed would be addTermsDict?

> NumericDocValues fields with sparse data can be compressed better 
> ------------------------------------------------------------------
>
>                 Key: LUCENE-5688
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5688
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Varun Thacker
>            Priority: Minor
>         Attachments: LUCENE-5688.patch, LUCENE-5688.patch
>
>
> I ran into this problem where I had a dynamic field in Solr and indexed data 
> into lots of fields. For each field only a few documents had actual values 
> and the remaining documents the default value ( 0 ) got indexed. Now when I 
> merge segments, the index size jumps up.
> For example I have 10 segments - Each with 1 DV field. When I merge segments 
> into 1 that segment will contain all 10 DV fields with lots if 0s. 
> This was the motivation behind trying to come up with a compression for a use 
> case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

Reply via email to