ivandika3 commented on PR #7988:
URL: https://github.com/apache/ozone/pull/7988#issuecomment-2809174040

   @errose28 Thanks for the initial review.
   
   > Thanks for working on this @ivandika3. I only briefly looked at the code, 
but can we implement this such that the client/server API required for this 
change can be reused if/when we add general purpose Ratis based read from 
follower?
   
   It's possible, but it requires first to consolidate designs for general 
purposed Ratis based read from follower. Additionally, the use case for this 
patch is not for general purpose read, but to be able to continuously read from 
a single OM node, mentioned in the design doc of "Cross-Region Bucket 
Replication" (https://issues.apache.org/jira/browse/HDDS-12307).
   
   > For example, with Ratis read-from-follower, I imagine the implementation 
would look something like this:
   > 
   > * Client gives request to leader OM with an option that says it is ok with 
reading from a follower for this request.
   > * OM leader either services the request itself, or load balances by 
returning options for other followers which have applied all requests.
   > * If given a list of other OMs, client will try to have them service its 
request by contacting them directly.
   > 
   > In this implementation, since we are not yet using Ratis to determine who 
has the snapshots, the same request flow could be used:
   > 
   > * Client gives request to leader OM with an option that says it is ok with 
reading from a follower for this request.
   >   
   >   * Leader OM only acknowledges this option for snapshot read requests.
   > * OM leader returns addresses of both followers for the client to try, 
since we don't have read-from-follower implemented.
   > * Client tries each OM returned until it finds one that has the snapshot, 
and then does the read there.
   
   There have been some previous discussions regarding the read-from-follower
   - OM HA: support read from followers 
(https://issues.apache.org/jira/browse/HDDS-9279 and 
https://github.com/apache/ozone/pull/5288): This should be the simplest one 
since there is very small changes needed for the OM (server-side). However, 
currently our current OM setup does not guarantee linearizability (AFAIK we 
only support read-after-write consistency). Therefore, when the linearizability 
read was tested it causes the overall throughput (read + write) to be reduced 
by quite a lot. Refer to 
https://docs.google.com/document/d/1xVkaQYDXJmztETJVZQHkkij_j8j6MGQ4XB8ehathhG8/edit?tab=t.0#heading=h.o61uifuxltgn
 for the benchmark results.
   - Allow slightly stale reads: @whbing team implemented readonly client that 
will periodically probe the OM nodes for the appliedIndex and latency. AFAIK, 
based on the last probes information, the readonly client will read the 
"fastest" OM follower (the one with the largest appliedIndex). This might cause 
stale reads, but this can increase the read throughput. I suppose that this 
requires additional heartbeat protocols and possibly changing the OM request 
and response. 
   - We can have similar mechanism as HDFS Observer Read 
(https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html).
 AFAIK, if it's applied to Ozone, client will request the lastAppliedIndex of 
the OM leader (similar to msync call) and then will use the appliedIndex when 
sending request to the OM follower. This requires adding appliedIndex in the OM 
request and adding msync protocol. Also we need to change the OM failover proxy 
provider to allow read from follower.
   - I recorded some of my throughs while brainstorming the possible 
implementation in https://issues.apache.org/jira/browse/HDDS-9279 comment, 
these are just loose thoughts to highlight the elements to take into account.
   
   Therefore, if we want to not change the protobuf field in the future, we 
have to be pretty sure about the general purpose read from follower 
implementation, which requires additional time and effort.
   
   > The only snapshot specific aspect here would be the ability to read 
snapdiff from a specific OM. I would assume the snapdiff request outputs the 
host that the job is running on so that the user can feed this back into 
requests for their snapdiff. I don't think general snapshot list/info CLIs 
would need to be updated since they could use the same general flow as above. 
Since snapdiff is already OM specific, we should probably implement this part 
of the change on its own (to account for leader changes) and then come back to 
this PR as an enhancement on top of that.
   
   The issue is that snapdiff requires that the two snapshots exists, which 
might not happen on a slow follower even if the create snapshot has been 
applied in the leader. Therefore, we need to also support listing snapshots and 
get snapshot info operation. During the incremental replication in 
"Cross-Region Bucket Replication", the Syncer will wait until the previous 
snapshot and new snapshot have been created in the OM listener, and then the 
Syncer will send a SnapDiff request to OM to get the Snapdiff to replicate to 
the target bucket.
   
   > It would be good to avoid adding anything to the client/server code that 
would just get deprecated later, because it adds complexity to our cross 
compatibility support. I imagine general read-from-follower would be able to 
replace a lot of the snapshot specific changes here, so we should probably 
think in that direction.
   
   I agree. If we are still not sure, it might be better to defer this patch in 
the feature or create a feature branch until we can commit with the client and 
server changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to