Re: [PR] HDDS-11237. Implement BucketCache with expiry for FileSystem client rename and delete [ozone]

via GitHub Mon, 29 Jul 2024 08:29:06 -0700


fapifta commented on PR #6992:
URL: https://github.com/apache/ozone/pull/6992#issuecomment-2256243165


   Hi @tanvipenumudy thank you for your work on this, I have some general 
doubt, let me share it.
   
   Most of our tests that were failing earlier when I moved off from getBucket 
call [HDDS-11235](https://issues.apache.org/jira/browse/HDDS-11235) and 
[HDDS-11232](https://issues.apache.org/jira/browse/HDDS-11232), I ran into test 
failures as semantics have changed.
   Here I don't see any test failures, however the semantics is changing, and 
this makes me nervous about the cache itself...
   
   Let's see a few cases what can happen:
   - in case the client initializes a FileSystem object, does a call and closes 
the object, then there is no change, other than we have one element in a cache 
that is used only once, and the cache is removed. (This is completely fine.)
   - in case the client initializes a FileSystem object, and issues subsequent 
calls that should succeed, and the calls does not contain any manipulation of 
the bucket(s) used after they are created and cached, then things are fine.
   - in case the client initializes a FileSystem object, and issues subsequent 
calls that should succeed, but it modifies a bucket (e.g.: changing a property) 
then it goes undetected from the client's point of view unless the change 
causes a subsequent call to fail. (I am not sure if there is such a scenario at 
the moment though).
   - in case the client initializes a FileSystem object, and issues subsequent 
calls that should fail in a particular way after a call removes the bucket, it 
might fail in a very different way, that might affects retries as well, as in 
this case it will be the server side that fails, and the failure handling in 
getBucket won't be effective as we get the bucket from the cache, and there we 
can not get the error.
   
   This last case as far as I can tell is not handled in the current code, and 
it can lead to unexpected behaviour, or behaviour we do not have tests for.
   
   
   There is an other thing we need to put more thought into when it comes to 
caching buckets. We need to consider what happens to the client, if the bucket 
was recreated with different properties, or removed after the client cached it, 
as the client uses stale information and I don't think the behaviour in these 
cases are fully defined yet fully correct...
   In this case I do not see any handling that would clear the cached bucket 
info, so if my assumptions are correct, the client may behave incorrectly until 
the cache evicts the bucket info after the 10 minute cache lifetime.
   
   
   Let me add one more generic thought...
   I believe our first goal should be to change our code to use at most one 
VolumeInfo, at most one BucketInfo, and then at most the required RPC calls to 
handle the actual operation, for example 3 in createkey (openkey and commitkey, 
and renameKey (from x.COPYING to x), or 1 for RenameKey, or 1 for DeleteKey and 
so on.
   Once that is done, we can introduce the cache, and cache VolumeInfo and 
BucketInfo objects, in order to reduce the overall volume of RPC calls that are 
caused by long running clients.
   Why? Currently the code calls getBucket here and there, by introducing 
caching, we either do error handling everywhere where we use the bucket object, 
and at the moment it is a bit hectic and error prone. If all our code that 
touches RPC is simplified, and it is clear how and when we call an RPC, that 
helps greatly to properly handle errors.
   I know this is arguable, and I also am sure that this is just a single point 
of view based on assumptions on the rest of the code I have not examined much, 
but this is how I think about this problem, please argue with this if I am 
wrong, as I would be happy to introduce the cache first and get rid of 
unnecessary InfoBucket calls first, but at the moment I don't think it is this 
simple even for just the delete/rename flow...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-11237. Implement BucketCache with expiry for FileSystem client rename and delete [ozone]

Reply via email to