szehon-ho commented on a change in pull request #3273:
URL: https://github.com/apache/iceberg/pull/3273#discussion_r732122829
##########
File path: core/src/main/java/org/apache/iceberg/DataFiles.java
##########
@@ -285,7 +285,12 @@ public DataFile build() {
}
Preconditions.checkArgument(format != null, "File format is required");
Preconditions.checkArgument(fileSizeInBytes >= 0, "File size is
required");
- Preconditions.checkArgument(recordCount >= 0, "Record count is
required");
+ Preconditions.checkArgument(recordCount != null, "Record count is
required");
+ // MetricsEvaluator skips using other metrics, if record count is -1
+ Preconditions.checkArgument(recordCount >= 0 ||
+ (recordCount == -1 && valueCounts == null && columnSizes == null
&& nanValueCounts == null &&
+ lowerBounds == null && upperBounds == null),
+ "Metrics cannot be set if record count is -1.");
Review comment:
I took @rdblue suggestion and made an attempt to use the AvroIO method
to get the row count, which internally just visits each block once. Potential
follow up could be making this (and even the Parquet/ORC footer reading) into
distributed Spark jobs. Added test.
Need to rebase following the spark directory refactor
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]