[ https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Varun Thacker updated LUCENE-5688: ---------------------------------- Attachment: LUCENE-5688.patch bq. Can you perhaps make the Consumer/Producer package-private? I think only the Format needs to be public? Done. Lucene45DocValues producer and consumer are public. We could fix that in some other issue? In SparseDocValuesProducer I have implemented 2 ways to get numerics - getNumericUsingBinarySearch -> uses the binary search approach getNumericUsingHashMap -> uses a hash map based approach Test passes for both. So I think we should benchmark both approaches. based on the results we could pick one approach or even have both on them and pick the right stratergy using data from the benchmark results. bq. It's based on Solr's HashDocSet which I modify to act as an int-to-long map. I can share the code here if you want. That will be great. We can replace the HashMap in getNumericUsingHashMap with this. >From what I understand this is how we can run luceneutil benchmark tests - - python setup.py -prepareTrunk - svn checkout https://svn.apache.org/repos/asf/lucene/dev/trunk patch - Apply the patch in this checkout - We need a task file and then call searchBench.py with -index and -search > NumericDocValues fields with sparse data can be compressed better > ------------------------------------------------------------------ > > Key: LUCENE-5688 > URL: https://issues.apache.org/jira/browse/LUCENE-5688 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Varun Thacker > Priority: Minor > Attachments: LUCENE-5688.patch, LUCENE-5688.patch, LUCENE-5688.patch > > > I ran into this problem where I had a dynamic field in Solr and indexed data > into lots of fields. For each field only a few documents had actual values > and the remaining documents the default value ( 0 ) got indexed. Now when I > merge segments, the index size jumps up. > For example I have 10 segments - Each with 1 DV field. When I merge segments > into 1 that segment will contain all 10 DV fields with lots if 0s. > This was the motivation behind trying to come up with a compression for a use > case like this. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org