Re: [PR] Core: Fix numeric overflow of timestamp nano literal [iceberg]

via GitHub Thu, 29 May 2025 15:48:54 -0700


rdblue commented on code in PR #11775:
URL: https://github.com/apache/iceberg/pull/11775#discussion_r2114857389



##########
api/src/main/java/org/apache/iceberg/expressions/Literals.java:
##########
@@ -300,8 +300,7 @@ public <T> Literal<T> to(Type type) {
         case TIMESTAMP:
           return (Literal<T>) new TimestampLiteral(value());
         case TIMESTAMP_NANO:
-          // assume micros and convert to nanos to match the behavior in the 
timestamp case above
-          return new TimestampLiteral(value()).to(type);
+          return (Literal<T>) new TimestampNanoLiteral(value());

Review Comment:
   > I understand it was done for compatibility with Spark
   
   I think this is a little inaccurate. The cast from long to a microsecond 
timestamp was done for compatibility with Spark. But making the long to 
nanosecond timestamp cast use microseconds was done for consistency within the 
Iceberg API.
   
   The problem is: how should the API interpret a value when we don't know what 
unit the value is in? Originally, we didn't convert long to timestamp because 
of this problem. There is no way for the API to distinguish between these cases:
   
   ```java
   Expressions.lessThan("ts", System.currentTimeMillis())
   ```
   ```java
   Expressions.lessThan("ts", System.currentTimeMillis() * MILLIS_TO_MICROS)
   ```
   
   In order to support Spark, we decided that long values should be interpreted 
as microseconds. Then when adding a `timestamp(9)` type we followed that 
decision for consistency. What I mean by consistency is that **an expression 
behaves the same way regardless of binding type**.
   
   Think of an expression that uses a string:
   
   ```java
   Expressions.lessThan("ts", "2022-07-26T12:13:14.123") // ts is timestamp(6)
   ```
   ```java
   Expressions.lessThan("ts", "2022-07-26T12:13:14.123") // ts is timestamp(9)
   ```
   
   Both of those expressions select the same values when they are cast between 
micros and nanos. What I mean is that they both select based on the same point 
in time, regardless of the precision of the column the expression is eventually 
bound to. If you use that expression to filter two different tables with 
different `ts` precisions, you would get essentially the same results.
   
   Then the question is whether you should get the same result when passing a 
long value instead:
   
   ```java
   long boundaryValue = parseTimestampMicros("2022-07-26T12:13:14.123");
   Expressions.lessThan("ts", boundaryValue) // ts is timestamp(6)
   ```
   ```java
   Expressions.lessThan("ts", boundaryValue) // ts is timestamp(9)
   ```
   
   I think that you should get the same behavior when using an identical 
expression, just like you would when binding a string timestamp to a different 
type. The alternative is surprising, at least to me: when selecting from two 
tables that differ only by the timestamp type (nanosecond values end in 000) 
the results of the same expression are completely different. That means in 
order to construct an expression, you need to know the actual precision of the 
column in some cases.
   
   I think that the right solution is to add a public method to construct 
timestamp literals from nanos and micros. It also looks like we need to look 
into expression binding because the stack trace above is concerning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Core: Fix numeric overflow of timestamp nano literal [iceberg]

Reply via email to