[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

Shai Erera (JIRA) Thu, 29 May 2014 01:09:27 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012177#comment-14012177
 ]


Shai Erera commented on LUCENE-5688:
------------------------------------

bq. It does a binary search on the position data which is read using 
MonotonicBlockPackedReader.

Perhaps you can also experiment with a tiny hash-map, using plain int[]+long[] 
or a pair of packed arrays, instead of the binary search tree. I am writing one 
now because I am experimenting with improvements to updatable DocValues. It's 
based on Solr's {{HashDocSet}} which I modify to act as an int-to-long map. I 
can share the code here if you want.

bq. Also I am not too familiar with lucene-util but is there a test which 
benchmarks DocValue read times? Should be interesting to see the read time 
difference.

Luceneutil has a search task benchmark (searchBench.py) which you can use. I 
recently augmented it (while benchmarking updatable DV) with a 
sort-by-DocValues, so I think you can use that to exercise the sparse DV? Once 
you're ready to run the benchmark let me know, I can share the tasks file with 
you. You will also need to modify the indexer to create sparse DVs (make it 
configurable) as currently when DV is turned on, each document is indexed a set 
of fields.

About the patch, I see you always encode a bitset + the values (sparse). I 
wonder if you used a hashtable-approach as I described above, you could just 
encode the docs that have a value. Then in the producer you can load them into 
memory (it's supposed to be small) and implement both getDocsWithField and 
getNumeric. It will impact docsWithField, but it's worth benchmarking I think.

Another thing, maybe this codec should wrap another and delegate to in case the 
number of docs-with-values exceeds some threshold? For instance, ignoring 
packing, the default DV encodes 8 bytes per document, while this codec encodes 
12 bytes (doc+value) per document which has a value. So I'm thinking that 
unless the field is really sparse, we might prefer the default encoding. We 
should fold that as well into the benchmark.

> NumericDocValues fields with sparse data can be compressed better 
> ------------------------------------------------------------------
>
>                 Key: LUCENE-5688
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5688
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Varun Thacker
>            Priority: Minor
>         Attachments: LUCENE-5688.patch, LUCENE-5688.patch
>
>
> I ran into this problem where I had a dynamic field in Solr and indexed data 
> into lots of fields. For each field only a few documents had actual values 
> and the remaining documents the default value ( 0 ) got indexed. Now when I 
> merge segments, the index size jumps up.
> For example I have 10 segments - Each with 1 DV field. When I merge segments 
> into 1 that segment will contain all 10 DV fields with lots if 0s. 
> This was the motivation behind trying to come up with a compression for a use 
> case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

Reply via email to