nastra commented on code in PR #13695:
URL: https://github.com/apache/iceberg/pull/13695#discussion_r2244705118
##########
core/src/main/java/org/apache/iceberg/DataFiles.java:
##########
@@ -151,6 +152,7 @@ public static class Builder {
private Map<Integer, Long> nanValueCounts = null;
private Map<Integer, ByteBuffer> lowerBounds = null;
private Map<Integer, ByteBuffer> upperBounds = null;
+ private Map<Integer, Type> originalTypes = null;
Review Comment:
> So I think the principle here is that anywhere we need to build new
`DataFile` or `DeleteFile` metadata we need to make sure that the metrics
original types information is copied over so that it can be used when
converting to the future structure. The only place that really needs read
exposure to the `originalTypes` is the Metrics class itself.
Yes that's absolutely correct. The main use case is that we want to convert
from a `Metrics` instance to a Stats instance when e.g. an appender/writer
creates a new `Metrics` instance and before those metrics are written to disk.
Also we don't want to store `originalTypes` on disk, since that would require a
spec change and we wouldn't have that information anyway when reading existing
data files.
> But I think I agree with @pvary, I do think the field does need to be
copied over (we don't need to expose it neccesarily). I think the field needs
to be copied over for the DeleteFile case since DeleteFile extends BaseFile and
we could have stats for those that we would want to convert
I took another look at this but I'm still not sure why we would want to
carry forward the `originalTypes` in `BaseFile`. There's really only a single
use case I could find in the codebase that would require exposing
`originalTypes()` in order to carry it forward and that is in
`RewriteTablePathUtil` when a delete manifest is rewritten. The metrics of that
`DeleteFile` are carried forward in
https://github.com/apache/iceberg/blob/62d9ff5d043a5571efe020b9177998ae763a41a0/core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java#L491-L492
which might indicate that we would need to preserve `originalTypes` there as
well, but the `DeleteFile` was read from the delete manifest at that point and
we don't actually store `originalTypes` on disk, meaning that it would be
`null` anyway.
That being said, the `Metrics` class is our main surface area where we want
to preserve `originalTypes`. Once we keep individual metrics fields and don't
create a new `Metrics` instance again from those fields (which is the case with
`BaseFile` - except `RewriteTablePathUtil`), we shouldn't need to preserve
`originalTypes` anymore.
@pvary @amogh-jahagirdar let me know if that makes sense or whether I'm
missing something obvious
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]