nastra commented on code in PR #13695:
URL: https://github.com/apache/iceberg/pull/13695#discussion_r2244705118


##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -151,6 +152,7 @@ public static class Builder {
     private Map<Integer, Long> nanValueCounts = null;
     private Map<Integer, ByteBuffer> lowerBounds = null;
     private Map<Integer, ByteBuffer> upperBounds = null;
+    private Map<Integer, Type> originalTypes = null;

Review Comment:
   > So I think the principle here is that anywhere we need to build new 
`DataFile` or `DeleteFile` metadata we need to make sure that the metrics 
original types information is copied over so that it can be used when 
converting to the future structure. The only place that really needs read 
exposure to the `originalTypes` is the Metrics class itself.
   
   Yes that's absolutely correct. The main use case is that we want to convert 
from a `Metrics` instance to a Stats instance when e.g. an appender/writer 
creates a new `Metrics` instance and before those metrics are written to disk. 
Also we don't want to store `originalTypes` on disk, since that would require a 
spec change and we wouldn't have that information anyway when reading existing 
data files. 
   
   > But I think I agree with @pvary, I do think the field does need to be 
copied over (we don't need to expose it neccesarily). I think the field needs 
to be copied over for the DeleteFile case since DeleteFile extends BaseFile and 
we could have stats for those that we would want to convert
   
   I took another look at this but I'm still not sure why we would want to 
carry forward the `originalTypes` in `BaseFile`. There's really only a single 
use case I could find in the codebase that would require exposing 
`originalTypes()` in order to carry it forward and that is in 
`RewriteTablePathUtil` when a delete manifest is rewritten. The metrics of that 
`DeleteFile` are carried forward in 
https://github.com/apache/iceberg/blob/62d9ff5d043a5571efe020b9177998ae763a41a0/core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java#L491-L492
 which might indicate that we would need to preserve `originalTypes` there as 
well, but the `DeleteFile` was read from the delete manifest at that point and 
we don't actually store `originalTypes` on disk, meaning that it would be 
`null` anyway.
   
   
   That being said, the `Metrics` class is our main surface area where we want 
to preserve `originalTypes`. Once we keep individual metrics fields and don't 
create a new `Metrics` instance again from those fields (which is the case with 
`BaseFile` - except `RewriteTablePathUtil`), we shouldn't need to preserve 
`originalTypes` anymore.
   
   @pvary @amogh-jahagirdar let me know if that makes sense or whether I'm 
missing something obvious
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to