aokolnychyi commented on code in PR #8803:
URL: https://github.com/apache/iceberg/pull/8803#discussion_r1388756877
##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,20 @@ default Long fileSequenceNumber() {
*/
F copyWithoutStats();
+ /**
+ * Copies this file with only specific column stats. Manifest readers can
reuse file instances;
Review Comment:
Optional: I think `with only specific column stats` may be interpreted as
with some types of metrics, rather than with all types of metrics but for
specific columns.
I'd consider something like this but it is up to you.
```
Copies this file with column stats only for specific columns. Manifest
readers can reuse file
instances; use this method to copy data with stats only for specific columns
when collecting
files.
```
##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,20 @@ default Long fileSequenceNumber() {
*/
F copyWithoutStats();
+ /**
+ * Copies this file with only specific column stats. Manifest readers can
reuse file instances;
+ * use this method to copy data and only copy specific stats when collecting
files.
+ *
+ * @param requestedColumnIds column ids for which to keep stats. If
<code>null</code> then every
Review Comment:
Minor: `column ids` -> `column IDs` to follow the existing Javadoc in this
class.
##########
core/src/main/java/org/apache/iceberg/BaseScan.java:
##########
@@ -165,6 +170,19 @@ public ThisT includeColumnStats() {
return newRefinedScan(table, schema,
context.shouldReturnColumnStats(true));
}
+ @Override
+ public ThisT includeColumnStats(Collection<String> requestedColumns) {
+ return newRefinedScan(
+ table,
+ schema,
+ context
+ .shouldReturnColumnStats(true)
Review Comment:
Supporting both `shouldReturnColumnStats()` and `columnsToKeepStats()` will
only be needed if we decide to go with the utility for copying. If `null` is a
valid value in `ContentFile$copyWithStats`, I think having just a set of field
IDs to keep stats for would be sufficient.
##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -185,13 +190,25 @@ public PartitionData copy() {
this.partitionType = toCopy.partitionType;
this.recordCount = toCopy.recordCount;
this.fileSizeInBytes = toCopy.fileSizeInBytes;
- if (fullCopy) {
- this.columnSizes = SerializableMap.copyOf(toCopy.columnSizes);
- this.valueCounts = SerializableMap.copyOf(toCopy.valueCounts);
- this.nullValueCounts = SerializableMap.copyOf(toCopy.nullValueCounts);
- this.nanValueCounts = SerializableMap.copyOf(toCopy.nanValueCounts);
- this.lowerBounds =
SerializableByteBufferMap.wrap(SerializableMap.copyOf(toCopy.lowerBounds));
- this.upperBounds =
SerializableByteBufferMap.wrap(SerializableMap.copyOf(toCopy.upperBounds));
+ if (copyStats) {
Review Comment:
In my view, this piece should not use streams, closures and must do as few
extra copies as possible. It is called in a tight loop and our previous
profiling showed that all this cost adds up at scale.
I would suggest extending `SerializableMap` with `filteredCopyOf`.
```
public static <K, V> SerializableMap<K, V> filteredCopyOf(Map<K, V> map,
Set<K> keys) {
return map == null ? null : new SerializableMap<>(map, keys);
}
private SerializableMap(Map<K, V> map, Set<K> keys) {
Map<K, V> filteredMap = Maps.newHashMapWithExpectedSize(keys.size());
for (K key : keys) {
V value = map.get(key);
if (value != null) {
filteredMap.put(key, value);
}
}
this.copiedMap = filteredMap;
}
```
Then we can define some extra helper methods in `BaseFile` to have
reasonable formatting.
```
private static <K, V> Map<K, V> copyMap(Map<K, V> map, Set<K> keys) {
if (keys == null) {
return SerializableMap.copyOf(map);
} else {
return SerializableMap.filteredCopyOf(map, keys);
}
}
private static Map<Integer, ByteBuffer> copyByteBufferMap(
Map<Integer, ByteBuffer> map, Set<Integer> keys) {
return SerializableByteBufferMap.wrap(copyMap(map, keys));
}
```
I understand we are trying to be generic here but I think the performance is
critical.
##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,20 @@ default Long fileSequenceNumber() {
*/
F copyWithoutStats();
+ /**
+ * Copies this file with only specific column stats. Manifest readers can
reuse file instances;
+ * use this method to copy data and only copy specific stats when collecting
files.
+ *
+ * @param requestedColumnIds column ids for which to keep stats. If
<code>null</code> then every
+ * column stat is kept.
+ * @return a copy of this data file, with stats lower bounds, upper bounds,
value counts, null
Review Comment:
Minor: `a copy of this data file` -> `a copy of this file`, it is not
necessary a data file anymore. We are gradually fixing the docs.
##########
api/src/main/java/org/apache/iceberg/Scan.java:
##########
@@ -77,6 +77,21 @@ public interface Scan<ThisT, T extends ScanTask, G extends
ScanTaskGroup<T>> {
*/
ThisT includeColumnStats();
+ /**
+ * Create a new scan from this that loads the column stats for the specific
columns with each data
+ * file.
+ *
+ * <p>Column stats include: value count, null value count, lower bounds, and
upper bounds.
+ *
+ * @param requestedColumns column names for which to keep the stats. If
<code>null</code> then all
Review Comment:
I think we should not support null here as a valid value independently of
what we will do in `ContentFile`. I would drop the doc.
##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,20 @@ default Long fileSequenceNumber() {
*/
F copyWithoutStats();
+ /**
+ * Copies this file with only specific column stats. Manifest readers can
reuse file instances;
+ * use this method to copy data and only copy specific stats when collecting
files.
+ *
+ * @param requestedColumnIds column ids for which to keep stats. If
<code>null</code> then every
+ * column stat is kept.
+ * @return a copy of this data file, with stats lower bounds, upper bounds,
value counts, null
Review Comment:
Question: Should `with stats lower bounds, ...` be `with lower bounds, ...`?
##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -165,6 +166,20 @@ default Long fileSequenceNumber() {
*/
F copyWithoutStats();
+ /**
+ * Copies this file with only specific column stats. Manifest readers can
reuse file instances;
+ * use this method to copy data and only copy specific stats when collecting
files.
+ *
+ * @param requestedColumnIds column ids for which to keep stats. If
<code>null</code> then every
+ * column stat is kept.
+ * @return a copy of this data file, with stats lower bounds, upper bounds,
value counts, null
+ * value counts, and nan value counts for only specific columns.
+ */
+ default F copyWithStats(Set<Integer> requestedColumnIds) {
Review Comment:
I spend a bit more time on this one. I do think we either drop support for
`null` as a valid value here or drop the utility to copy. It is up to you which
approach to pick.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]