HonahX commented on code in PR #790:
URL: https://github.com/apache/iceberg-python/pull/790#discussion_r1632797941
##########
pyiceberg/io/pyarrow.py:
##########
@@ -912,6 +916,9 @@ def primitive(self, primitive: pa.DataType) ->
PrimitiveType:
return TimestamptzType()
elif primitive.tz is None:
return TimestampType()
+ if primitive.unit == "ns":
+ if primitive.tz == "UTC":
+ return TimestamptzType()
Review Comment:
nanosecond timestamp is added in [version
3](https://iceberg.apache.org/spec/#version-3), which is still under
development and not formally adopted. Pyiceberg does not support nanosecond yet
so I think we should not add the conversion here. (and it should be converted
to a separate `TimestampNanoType` in the future:
https://iceberg.apache.org/spec/#primitive-types )
I think you add this because [arrow reads ORC timestamp as
nanoseconds](https://arrow.apache.org/docs/cpp/orc.html) since ORC's timestamp
types always contain nanosecond information, but we want to read this as
microsecond.
It seems Java side currently just [treat ORC's timestamp type as `us` unit
one](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/orc/src/main/java/org/apache/iceberg/orc/OrcToIcebergVisitor.java#L170-L177)
We could probably fix this at arrow schema level. For example, we can add an
additional conversion for `physical_schema` here to change the unit of arrow
timestamp from `ns` to `us`:
https://github.com/apache/iceberg-python/blob/f782c548df31386ba19a15a33ee53a5094886506/pyiceberg/io/pyarrow.py#L987-L989
Changing the physical schema also ensures that the actual timestamp data is
read with `us` unit as required by `TimestampType`
https://github.com/apache/iceberg-python/blob/f782c548df31386ba19a15a33ee53a5094886506/pyiceberg/io/pyarrow.py#L1002-L1008
However, I also feel it is not the ultimate solution because we assume the
unit is microsecond. When `TimestampNanoType` is in, we may need to do some
additional steps to ensure we reads the data using the correct unit.
@MehulBatra @Fokko Would love to hear your thoughts on this. Please correct
me if I make any mistakes about the ORC's behavior.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]