[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

GitBox Wed, 27 Jan 2021 07:32:20 -0800


bruno-roustant commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r565404249




##########
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -736,49 +736,92 @@ private void doAddSortedField(FieldInfo field, 
DocValuesProducer valuesProducer)
   private void addTermsDict(SortedSetDocValues values) throws IOException {
     final long size = values.getValueCount();
     meta.writeVLong(size);
-    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_SHIFT);
+    boolean compress =
+        Lucene80DocValuesFormat.Mode.BEST_COMPRESSION == mode

Review comment:
       @zhaih could you be more specific about this? Could you post something 
in the Jira issue discussion to explain how this compression is harmful for 
BinaryDocValues? It needs to be more visible than just in this review.

##########
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -790,6 +833,28 @@ private void addTermsDict(SortedSetDocValues values) 
throws IOException {
     writeTermsIndex(values);
   }
 
+  private int compressAndGetTermsDictBlockLength(
+      ByteArrayDataOutput bufferedOutput, LZ4.FastCompressionHashTable ht) 
throws IOException {
+    int uncompressedLength = bufferedOutput.getPosition();
+    data.writeVInt(uncompressedLength);
+    long before = data.getFilePointer();
+    LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+    int compressedLength = (int) (data.getFilePointer() - before);
+    // Block length will be used for creating buffer for decompression, one 
corner case is that
+    // compressed length might be bigger than un-compressed length, so just 
return the bigger one.
+    return Math.max(uncompressedLength, compressedLength);
+  }
+
+  private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput 
bufferedOutput, int termLength) {
+    int pos = bufferedOutput.getPosition(), originalLength = 
termsDictBuffer.length;
+    if (pos + termLength >= originalLength - 1) {
+      int targetLength = (originalLength + termLength) << 1;

Review comment:
       The idea is to remove this line 851 (double the size) to let the 
ArrayUtil.oversize() inside ArrayUtil.grow() decide what is the appropriate 
size growth (exponential growth with less mem waste than classic x2).
   The following line would become something like:
   termsDictBuffer = ArrayUtil.grow(termsDictBuffer, pos + termLength)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Reply via email to