rdblue commented on code in PR #11775:
URL: https://github.com/apache/iceberg/pull/11775#discussion_r2114857389
##########
api/src/main/java/org/apache/iceberg/expressions/Literals.java:
##########
@@ -300,8 +300,7 @@ public <T> Literal<T> to(Type type) {
case TIMESTAMP:
return (Literal<T>) new TimestampLiteral(value());
case TIMESTAMP_NANO:
- // assume micros and convert to nanos to match the behavior in the
timestamp case above
- return new TimestampLiteral(value()).to(type);
+ return (Literal<T>) new TimestampNanoLiteral(value());
Review Comment:
> I understand it was done for compatibility with Spark
I think this is a little inaccurate. The cast from long to a microsecond
timestamp was done for compatibility with Spark. But making the long to
nanosecond timestamp cast use microseconds was done for consistency within the
Iceberg API.
The problem is: how should the API interpret a value when we don't know what
unit the value is in? Originally, we didn't convert long to timestamp because
of this problem. There is no way for the API to distinguish between these cases:
```java
Expressions.lessThan("ts", System.currentTimeMillis())
```
```java
Expressions.lessThan("ts", System.currentTimeMillis() * MILLIS_TO_MICROS)
```
In order to support Spark, we decided that long values should be interpreted
as microseconds. Then when adding a `timestamp(9)` type we followed that
decision for consistency. What I mean by consistency is that **an expression
behaves the same way regardless of binding type**.
Think of an expression that uses a string:
```java
Expressions.lessThan("ts", "2022-07-26T12:13:14.123") // ts is timestamp(6)
```
```java
Expressions.lessThan("ts", "2022-07-26T12:13:14.123") // ts is timestamp(9)
```
Both of those expressions select the same values when they are cast between
micros and nanos. What I mean is that they both select based on the same point
in time, regardless of the precision of the column the expression is eventually
bound to. If you use that expression to filter two different tables with
different `ts` precisions, you would get essentially the same results.
Then the question is whether you should get the same result when passing a
long value instead:
```java
long boundaryValue = parseTimestampMicros("2022-07-26T12:13:14.123");
Expressions.lessThan("ts", boundaryValue) // ts is timestamp(6)
```
```java
Expressions.lessThan("ts", boundaryValue) // ts is timestamp(9)
```
I think that you should get the same behavior when using an identical
expression, just like you would when binding a string timestamp to a different
type. The alternative is surprising, at least to me: when selecting from two
tables that differ only by the timestamp type (nanosecond values end in 000)
the results of the same expression are completely different. That means in
order to construct an expression, you need to know the actual precision of the
column in some cases.
I think that the right solution is to add a public method to construct
timestamp literals from nanos and micros. It also looks like we need to look
into expression binding because the stack trace above is concerning.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]