wooyeong commented on code in PR #9455:
URL: https://github.com/apache/iceberg/pull/9455#discussion_r1475886118
##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -405,15 +407,18 @@ public boolean equals(Object other) {
return false;
}
- // use only name in order to correctly invalidate Spark cache
+ // use name only unless branch/snapshotId is given in order to correctly
invalidate Spark cache
+ // when branch or snapshotId is given, it's time travel
SparkTable that = (SparkTable) other;
- return icebergTable.name().equals(that.icebergTable.name());
+ return icebergTable.name().equals(that.icebergTable.name())
+ && Objects.equals(snapshotId, that.snapshotId);
Review Comment:
I've organized my thoughts.
Currently, we are working on an issue that arises when comparing two
different `SparkTable` objects.
A `SparkTable` can be initialized in three ways:
- `main` branch (when neither `branch` nor `snapshotId` is provided)
- Specific branch (when a `branch` is provided)
- Specific snapshot (when a `snapshotId` is provided)
When a snapshotId is provided, since the snapshotId is immutable, we can
simply compare the two snapshots. In other cases, it's when a branch is
provided. (`main` (where `branch` is null) or specific branch is given)
If the branches of the two SparkTables are different, I believe it is
correct to consider them as different objects. This is because if a branch is
specified, the user has intentionally created distinct table objects, and that
intention should be respected. Therefore, using the name of the branch to
compare is reasonable.
I've looked into the impact of CachingCatalog and refreshEagerly.
There is no issue when using CachingCatalog. When a branch is used, we
obtain the `main` branch SparkTable and then use the `copyWithBranch` method to
create a separate SparkTable object. The same applies when a snapshotId is used.
```sql
-- test sql
SELECT * FROM iceberg_except_test
VERSION AS OF '2023-01-01'
UNION ALL
SELECT * FROM iceberg_except_test;
-- injected logs
SparkCatalog.loadTable(ident, version): identifier: iceberg_except_test,
ver: 2023-01-01
SparkCatalog.loadTable(ident): identifier: iceberg_except_test
SparkCatalog.load: identifier: iceberg_except_test
CachingCatalog.load: identifier: iceberg_except_test
-> SparkTable -> SparkTable#copyWithSnapshotId
SparkCatalog.loadTable(ident): identifier: iceberg_except_test
SparkCatalog.load: identifier: iceberg_except_test
CachingCatalog.load: identifier: iceberg_except_test
-> SparkTable
```
Lastly, `refreshEagerly` is only valid when using Cache. If Cache is not
used, `refreshEagerly` becomes false, and a refresh occurs. I don't think it's
necessary to check whether cache is being used when verifying the equality of
two SparkTables. This is because even in the case of simply performing a
SELECT, the results may vary depending on the timing, which is an expected
outcome. Furthermore, the current Iceberg code does not consider refreshEagerly
when comparing two SparkTables, so it won't be worse than it currently is.
Therefore, to summarize my argument:
- It is sufficient to compare the name, branch, and snapshotId.
- Remove the current implementation of extracting snapshotId from the
branch in this PR.
- When using CachingCatalog, there is no impact because a separate instance
is created using the copyWithXX method.
- We don't need to consider the impact of the refreshEagerly option
I look forward to hearing your feedback. Thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]