Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

via GitHub Thu, 09 Nov 2023 18:45:03 -0800


aokolnychyi commented on code in PR #8803:
URL: https://github.com/apache/iceberg/pull/8803#discussion_r1388756877



##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,20 @@ default Long fileSequenceNumber() {
    */
   F copyWithoutStats();
 
+  /**
+   * Copies this file with only specific column stats. Manifest readers can 
reuse file instances;

Review Comment:
   Optional: I think `with only specific column stats` may be interpreted as 
with some types of metrics, rather than with all types of metrics but for 
specific columns.
   
   I'd consider something like this but it is up to you.
   
   ```
   Copies this file with column stats only for specific columns. Manifest 
readers can reuse file
   instances; use this method to copy data with stats only for specific columns 
when collecting
   files.
   ```



##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,20 @@ default Long fileSequenceNumber() {
    */
   F copyWithoutStats();
 
+  /**
+   * Copies this file with only specific column stats. Manifest readers can 
reuse file instances;
+   * use this method to copy data and only copy specific stats when collecting 
files.
+   *
+   * @param requestedColumnIds column ids for which to keep stats. If 
<code>null</code> then every

Review Comment:
   Minor: `column ids` -> `column IDs` to follow the existing Javadoc in this 
class.



##########
core/src/main/java/org/apache/iceberg/BaseScan.java:
##########
@@ -165,6 +170,19 @@ public ThisT includeColumnStats() {
     return newRefinedScan(table, schema, 
context.shouldReturnColumnStats(true));
   }
 
+  @Override
+  public ThisT includeColumnStats(Collection<String> requestedColumns) {
+    return newRefinedScan(
+        table,
+        schema,
+        context
+            .shouldReturnColumnStats(true)

Review Comment:
   Supporting both `shouldReturnColumnStats()` and `columnsToKeepStats()` will 
only be needed if we decide to go with the utility for copying. If `null` is a 
valid value in `ContentFile$copyWithStats`, I think having just a set of field 
IDs to keep stats for would be sufficient.



##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -185,13 +190,25 @@ public PartitionData copy() {
     this.partitionType = toCopy.partitionType;
     this.recordCount = toCopy.recordCount;
     this.fileSizeInBytes = toCopy.fileSizeInBytes;
-    if (fullCopy) {
-      this.columnSizes = SerializableMap.copyOf(toCopy.columnSizes);
-      this.valueCounts = SerializableMap.copyOf(toCopy.valueCounts);
-      this.nullValueCounts = SerializableMap.copyOf(toCopy.nullValueCounts);
-      this.nanValueCounts = SerializableMap.copyOf(toCopy.nanValueCounts);
-      this.lowerBounds = 
SerializableByteBufferMap.wrap(SerializableMap.copyOf(toCopy.lowerBounds));
-      this.upperBounds = 
SerializableByteBufferMap.wrap(SerializableMap.copyOf(toCopy.upperBounds));
+    if (copyStats) {

Review Comment:
   In my view, this piece should not use streams, closures and must do as few 
extra copies as possible. It is called in a tight loop and our previous 
profiling showed that all this cost adds up at scale.
   
   I would suggest extending `SerializableMap` with `filteredCopyOf`.
   
   ```
   public static <K, V> SerializableMap<K, V> filteredCopyOf(Map<K, V> map, 
Set<K> keys) {
     return map == null ? null : new SerializableMap<>(map, keys);
   }
   
   private SerializableMap(Map<K, V> map, Set<K> keys) {
     Map<K, V> filteredMap = Maps.newHashMapWithExpectedSize(keys.size());
   
     for (K key : keys) {
       V value = map.get(key);
       if (value != null) {
         filteredMap.put(key, value);
       }
     }
   
     this.copiedMap = filteredMap;
   }
   ```
   
   Then we can define some extra helper methods in `BaseFile` to have 
reasonable formatting.
   
   ```
   private static <K, V> Map<K, V> copyMap(Map<K, V> map, Set<K> keys) {
     if (keys == null) {
       return SerializableMap.copyOf(map);
     } else {
       return SerializableMap.filteredCopyOf(map, keys);
     }
   }
   
   private static Map<Integer, ByteBuffer> copyByteBufferMap(
       Map<Integer, ByteBuffer> map, Set<Integer> keys) {
     return SerializableByteBufferMap.wrap(copyMap(map, keys));
   }
   ```
   
   I understand we are trying to be generic here but I think the performance is 
critical.



##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,20 @@ default Long fileSequenceNumber() {
    */
   F copyWithoutStats();
 
+  /**
+   * Copies this file with only specific column stats. Manifest readers can 
reuse file instances;
+   * use this method to copy data and only copy specific stats when collecting 
files.
+   *
+   * @param requestedColumnIds column ids for which to keep stats. If 
<code>null</code> then every
+   *     column stat is kept.
+   * @return a copy of this data file, with stats lower bounds, upper bounds, 
value counts, null

Review Comment:
   Minor: `a copy of this data file` -> `a copy of this file`, it is not 
necessary a data file anymore. We are gradually fixing the docs.



##########
api/src/main/java/org/apache/iceberg/Scan.java:
##########
@@ -77,6 +77,21 @@ public interface Scan<ThisT, T extends ScanTask, G extends 
ScanTaskGroup<T>> {
    */
   ThisT includeColumnStats();
 
+  /**
+   * Create a new scan from this that loads the column stats for the specific 
columns with each data
+   * file.
+   *
+   * <p>Column stats include: value count, null value count, lower bounds, and 
upper bounds.
+   *
+   * @param requestedColumns column names for which to keep the stats. If 
<code>null</code> then all

Review Comment:
   I think we should not support null here as a valid value independently of 
what we will do in `ContentFile`. I would drop the doc. 



##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,20 @@ default Long fileSequenceNumber() {
    */
   F copyWithoutStats();
 
+  /**
+   * Copies this file with only specific column stats. Manifest readers can 
reuse file instances;
+   * use this method to copy data and only copy specific stats when collecting 
files.
+   *
+   * @param requestedColumnIds column ids for which to keep stats. If 
<code>null</code> then every
+   *     column stat is kept.
+   * @return a copy of this data file, with stats lower bounds, upper bounds, 
value counts, null

Review Comment:
   Question: Should `with stats lower bounds, ...` be `with lower bounds, ...`?



##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,20 @@ default Long fileSequenceNumber() {
    */
   F copyWithoutStats();
 
+  /**
+   * Copies this file with only specific column stats. Manifest readers can 
reuse file instances;
+   * use this method to copy data and only copy specific stats when collecting 
files.
+   *
+   * @param requestedColumnIds column ids for which to keep stats. If 
<code>null</code> then every
+   *     column stat is kept.
+   * @return a copy of this data file, with stats lower bounds, upper bounds, 
value counts, null
+   *     value counts, and nan value counts for only specific columns.
+   */
+  default F copyWithStats(Set<Integer> requestedColumnIds) {

Review Comment:
   I spend a bit more time on this one. I do think we either drop support for 
`null` as a valid value here or drop the utility to copy. It is up to you which 
approach to pick.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

Reply via email to