[ 
https://issues.apache.org/jira/browse/PARQUET-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453847#comment-17453847
 ] 

ASF GitHub Bot commented on PARQUET-2106:
-----------------------------------------

gszadovszky commented on a change in pull request #940:
URL: https://github.com/apache/parquet-mr/pull/940#discussion_r762784053



##########
File path: 
parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java
##########
@@ -183,10 +183,10 @@ public String toString() {
   private static abstract class BinaryComparator extends 
PrimitiveComparator<Binary> {
     @Override
     int compareNotNulls(Binary o1, Binary o2) {
-      return compare(o1.toByteBuffer(), o2.toByteBuffer());
+      return compareBinary(o1, o2);
     }
 
-    abstract int compare(ByteBuffer b1, ByteBuffer b2);
+    abstract int compareBinary(Binary b1, Binary b2);

Review comment:
       nit: Maybe `compareBinaries` would be a better naming

##########
File path: parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java
##########
@@ -613,4 +681,60 @@ private static final boolean equals(byte[] array1, int 
offset1, int length1, byt
     }
     return true;
   }
+
+  // TODO java-doc

Review comment:
       Please, do not leave TODOs in the final code. (javadoc comments wouldn't 
be required for private methods anyway.)

##########
File path: parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java
##########
@@ -328,7 +350,23 @@ public int compareTo(Binary other) {
       return 
PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR.compare(this, 
other);
     }
 
-   @Override
+    @Override
+    public int compareToLexicographic(Binary other) {
+      // NOTE: We have to flip the sign, since we swap operands sides
+      return -other.compareToLexicographic(value, 0, value.length);
+    }
+
+    @Override
+    public int compareToLexicographic(byte[] other, int otherOffset, int 
otherLength) {
+      return Binary.compareToLexicographic(this.value, 0, value.length, other, 
otherOffset, otherLength);
+    }
+
+    @Override
+    public int compareToLexicographic(ByteBuffer other, int otherOffset, int 
otherLength) {
+      return Binary.compareToLexicographic(this.value, 0, value.length, other, 
otherOffset, otherLength);
+    }

Review comment:
       These methods should be package private.

##########
File path: parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java
##########
@@ -613,4 +681,60 @@ private static final boolean equals(byte[] array1, int 
offset1, int length1, byt
     }
     return true;
   }
+
+  // TODO java-doc
+  private static final int compareToLexicographic(byte[] array1, int offset1, 
int length1, byte[] array2, int offset2, int length2) {
+    if (array1 == null && array2 == null) return 0;
+    if (array1 == null || array2 == null) return array1 != null ? 1 : -1;
+
+    int minLen = Math.min(length1, length2);
+    for (int i = 0; i < minLen; i++) {
+      int res = unsignedCompare(array1[i + offset1], array2[i + offset2]);
+      if (res != 0) {
+        return res;
+      }
+    }
+
+    return length1 - length2;
+  }
+
+  // TODO java-doc
+  private static final int compareToLexicographic(byte[] array1, int offset1, 
int length1, ByteBuffer array2, int offset2, int length2) {
+    if (array1 == null && array2 == null) return 0;
+    if (array1 == null || array2 == null) return array1 != null ? 1 : -1;
+
+    int minLen = Math.min(length1, length2);
+    for (int i = 0; i < minLen; i++) {
+      int res = unsignedCompare(array1[i + offset1], array2.get(i + offset2));
+      if (res != 0) {
+        return res;
+      }
+    }
+
+    return length1 - length2;
+  }
+
+  // TODO java-doc
+  private static final int compareToLexicographic(ByteBuffer array1, int 
offset1, int length1, ByteBuffer array2, int offset2, int length2) {

Review comment:
       nit: I would suggest renaming `array1` to `buffer1` and `array2` to 
`buffer2`.

##########
File path: parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java
##########
@@ -77,6 +77,12 @@ private Binary() { }
   @Deprecated
   abstract public int compareTo(Binary other);
 
+  abstract int compareToLexicographic(Binary other);
+
+  abstract int compareToLexicographic(byte[] other, int otherOffset, int 
otherLength);
+
+  abstract int compareToLexicographic(ByteBuffer other, int otherOffset, int 
otherLength);
+

Review comment:
       See the static method for better naming.

##########
File path: parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java
##########
@@ -191,6 +197,22 @@ public int compareTo(Binary other) {
       return 
PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR.compare(this, 
other);
     }
 
+    @Override
+    public int compareToLexicographic(Binary other) {
+      // NOTE: We have to flip the sign, since we swap operands sides
+      return -other.compareToLexicographic(value, offset, length);
+    }
+
+    @Override
+    public int compareToLexicographic(byte[] other, int otherOffset, int 
otherLength) {
+      return Binary.compareToLexicographic(value, offset, length, other, 
otherOffset, otherLength);
+    }
+
+    @Override
+    public int compareToLexicographic(ByteBuffer other, int otherOffset, int 
otherLength) {
+      return Binary.compareToLexicographic(value, offset, length, other, 
otherOffset, otherLength);
+    }
+

Review comment:
       These methods should be package private.

##########
File path: parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java
##########
@@ -542,6 +606,10 @@ public static Binary fromCharSequence(CharSequence value) {
     return new FromCharSequenceBinary(value);
   }
 
+  public static int compareToLexicographic(Binary one, Binary other) {

Review comment:
       I think, `lexicographicCompare` would be a better naming.

##########
File path: parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java
##########
@@ -613,4 +681,60 @@ private static final boolean equals(byte[] array1, int 
offset1, int length1, byt
     }
     return true;
   }
+
+  // TODO java-doc
+  private static final int compareToLexicographic(byte[] array1, int offset1, 
int length1, byte[] array2, int offset2, int length2) {
+    if (array1 == null && array2 == null) return 0;
+    if (array1 == null || array2 == null) return array1 != null ? 1 : -1;
+
+    int minLen = Math.min(length1, length2);
+    for (int i = 0; i < minLen; i++) {
+      int res = unsignedCompare(array1[i + offset1], array2[i + offset2]);
+      if (res != 0) {
+        return res;
+      }
+    }
+
+    return length1 - length2;
+  }
+
+  // TODO java-doc
+  private static final int compareToLexicographic(byte[] array1, int offset1, 
int length1, ByteBuffer array2, int offset2, int length2) {

Review comment:
       nit: I would suggest renaming `array2` to `buffer` and `array1` to 
`array`.

##########
File path: parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java
##########
@@ -475,6 +513,32 @@ public int compareTo(Binary other) {
       return 
PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR.compare(this, 
other);
     }
 
+    @Override
+    public int compareToLexicographic(Binary other) {
+      if (value.hasArray()) {
+        // NOTE: We have to flip the sign, since we swap operands sides
+        return -other.compareToLexicographic(value.array(), 
value.arrayOffset() + offset, length);
+      } else {
+        // NOTE: We have to flip the sign, since we swap operands sides
+        return -other.compareToLexicographic(value, offset, length);
+      }
+    }
+
+    @Override
+    public int compareToLexicographic(byte[] other, int otherOffset, int 
otherLength) {
+      if (value.hasArray()) {
+        return Binary.compareToLexicographic(value.array(), 
value.arrayOffset() + offset, length, other, otherOffset, otherLength);
+      } else {
+        // NOTE: We have to flip the sign, since we swap operands sides
+        return -Binary.compareToLexicographic(other, otherOffset, otherLength, 
value, offset, length);
+      }
+    }
+
+    @Override
+    public int compareToLexicographic(ByteBuffer other, int otherOffset, int 
otherLength) {
+      return Binary.compareToLexicographic(value, offset, length, other, 
otherOffset, otherLength);
+    }

Review comment:
       These methods should be package private.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path
> -------------------------------------------------------------------
>
>                 Key: PARQUET-2106
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2106
>             Project: Parquet
>          Issue Type: Task
>          Components: parquet-mr
>    Affects Versions: 1.12.2
>            Reporter: Alexey Kudinkin
>            Priority: Major
>         Attachments: Screen Shot 2021-12-03 at 3.26.31 PM.png, 
> profile_48449_alloc_1638494450_sort_by.html
>
>
> *Background*
> While writing out large Parquet tables using Spark, we've noticed that 
> BinaryComparator is the source of substantial churn of extremely short-lived 
> `HeapByteBuffer` objects – It's taking up to *16%* of total amount of 
> allocations in our benchmarks, putting substantial pressure on a Garbage 
> Collector:
> !Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521!
> [^profile_48449_alloc_1638494450_sort_by.html]
>  
> *Proposal*
> We're proposing to adjust lexicographical comparison (at least) to avoid 
> doing any allocations, since this code lies on the hot-path of every Parquet 
> write, therefore causing substantial churn amplification.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to