gaborkaszab commented on PR #14398: URL: https://github.com/apache/iceberg/pull/14398#issuecomment-3432253624
Let me give an initial insight into the different decisions I made during the implementation: **1) Client-side table cache** Entire table objects are cached on the client side. The table objects hold user level configs and credentials, so in order to not share these accross users, the client-side table cache is arranged on a per-sessionID basis. For simplicity I chose to use a composite key: (SessionID, TableIdentifier) for the cache. Later if there is a need, the cache can be enhanced to have 2 separate levels for the sessionID and the TableIdentifier instead of a composite key. With this we can have different cache eviction policies for the sessions (max number if entries, TTL) and for the tables. Also we could separate sessions not to evict tables related to other sessions. For a reference implementation see RESTTableCache [here](https://github.com/gaborkaszab/iceberg/commit/c82cb30dd25acfae6b8d6625431c7f1df9665d4c#diff-10e1b4c71384742d710e9d8a022b6344a799c830cdf8946f175e6ecb46e650b2). **2) Table objects returned by loadTable** Currently, each `RESTCatalog.loadTable()` call returns a different Table object. Changing this and returning 'shared' table objects similarly to CachingCatalog is a breaking change in my opinion. Take this use-case: One query could load the table for query planning, then another table also loads the table and commits some changes. In this case the table being held by the first query is changed in the meantime, and as it's a shared object it could mess up planning. This is why I chose not to return the cached Table objects directly, but to do a 'clone' object instead and return that. As a result, in case the client commits to the table, it won't be reflected in the cache and the next loadTable request to the server will result in a full table load, because the ETag stored on the client side is not the latest one. **3) Table commits** As a direct result of the above point the table commits through `RESTTableOperations` aren't reflected in the cache. As a future improvement we can enhance this, but I haven't found a very clean solution for this. What I experimented with is to provide a `loadTableCallback` to the `RESTTableOperations` that is called after getting the `LoadTableResult`. Setting up such a callback properly can update the cache in `RESTSessionCatalog`. For a reference implementation see `RESTTableOperations` and `RESTSessionCatalog.loadTableCallback()` [here](https://github.com/gaborkaszab/iceberg/commit/c82cb30dd25acfae6b8d6625431c7f1df9665d4c#diff-86450612dbe323d6d06cbc3846aa1913f042eaedadc0ca027c36bfbe08d3a46cR787). **4) SnapshotMode = REFS** When lazy snapshot loading is configured, loadTable will populate the cache with a table object that has partially loaded snapshot list. If the client calls snapshots() and loads the rest of the snapshots, this won't be reflected in the cache. As a future improvement the snapshotsSupplier provided to TableMetadata could be clever enough to refresh the cache too. I found thes too complicated to implement for the initial version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
