gaborkaszab commented on PR #14398:
URL: https://github.com/apache/iceberg/pull/14398#issuecomment-3432253624

   Let me give an initial insight into the different decisions I made during 
the implementation:
   
   **1) Client-side table cache**
   
   Entire table objects are cached on the client side. The table objects hold 
user level configs and credentials, so in order to not share these accross 
users, the client-side table cache is arranged on a per-sessionID basis. For 
simplicity I chose to use a composite key: (SessionID, TableIdentifier) for the 
cache.
   
   Later if there is a need, the cache can be enhanced to have 2 separate 
levels for the sessionID and the TableIdentifier instead of a composite key. 
With this we can have different cache eviction policies for the sessions (max 
number if entries, TTL) and for the tables. Also we could separate sessions not 
to evict tables related to other sessions.
   For a reference implementation see RESTTableCache 
[here](https://github.com/gaborkaszab/iceberg/commit/c82cb30dd25acfae6b8d6625431c7f1df9665d4c#diff-10e1b4c71384742d710e9d8a022b6344a799c830cdf8946f175e6ecb46e650b2).
   
   **2) Table objects returned by loadTable**
   
   Currently, each `RESTCatalog.loadTable()` call returns a different Table 
object. Changing this and returning 'shared' table objects similarly to 
CachingCatalog is a breaking change in my opinion. Take this use-case: One 
query could load the table for query planning, then another table also loads 
the table and commits some changes. In this case the table being held by the 
first query is changed in the meantime, and as it's a shared object it could 
mess up planning.
   This is why I chose not to return the cached Table objects directly, but to 
do a 'clone' object instead and return that. As a result, in case the client 
commits to the table, it won't be reflected in the cache and the next loadTable 
request to the server will result in a full table load, because the ETag stored 
on the client side is not the latest one.
   
   **3) Table commits** 
   
   As a direct result of the above point the table commits through 
`RESTTableOperations` aren't reflected in the cache. As a future improvement we 
can enhance this, but I haven't found a very clean solution for this. What I 
experimented with is to provide a `loadTableCallback` to the 
`RESTTableOperations` that is called after getting the `LoadTableResult`. 
Setting up such a callback properly can update the cache in 
`RESTSessionCatalog`.
   For a reference implementation see `RESTTableOperations` and 
`RESTSessionCatalog.loadTableCallback()` 
[here](https://github.com/gaborkaszab/iceberg/commit/c82cb30dd25acfae6b8d6625431c7f1df9665d4c#diff-86450612dbe323d6d06cbc3846aa1913f042eaedadc0ca027c36bfbe08d3a46cR787).
   
   **4) SnapshotMode = REFS**
   
   When lazy snapshot loading is configured, loadTable will populate the cache 
with a table object that has partially loaded snapshot list. If the client 
calls snapshots() and loads the rest of the snapshots, this won't be reflected 
in the cache. As a future improvement the snapshotsSupplier provided to 
TableMetadata could be clever enough to refresh the cache too. I found thes too 
complicated to implement for the initial version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to