ivandika3 commented on PR #7988: URL: https://github.com/apache/ozone/pull/7988#issuecomment-2809174040
@errose28 Thanks for the initial review. > Thanks for working on this @ivandika3. I only briefly looked at the code, but can we implement this such that the client/server API required for this change can be reused if/when we add general purpose Ratis based read from follower? It's possible, but it requires first to consolidate designs for general purposed Ratis based read from follower. Additionally, the use case for this patch is not for general purpose read, but to be able to continuously read from a single OM node, mentioned in the design doc of "Cross-Region Bucket Replication" (https://issues.apache.org/jira/browse/HDDS-12307). > For example, with Ratis read-from-follower, I imagine the implementation would look something like this: > > * Client gives request to leader OM with an option that says it is ok with reading from a follower for this request. > * OM leader either services the request itself, or load balances by returning options for other followers which have applied all requests. > * If given a list of other OMs, client will try to have them service its request by contacting them directly. > > In this implementation, since we are not yet using Ratis to determine who has the snapshots, the same request flow could be used: > > * Client gives request to leader OM with an option that says it is ok with reading from a follower for this request. > > * Leader OM only acknowledges this option for snapshot read requests. > * OM leader returns addresses of both followers for the client to try, since we don't have read-from-follower implemented. > * Client tries each OM returned until it finds one that has the snapshot, and then does the read there. There have been some previous discussions regarding the read-from-follower - OM HA: support read from followers (https://issues.apache.org/jira/browse/HDDS-9279 and https://github.com/apache/ozone/pull/5288): This should be the simplest one since there is very small changes needed for the OM (server-side). However, currently our current OM setup does not guarantee linearizability (AFAIK we only support read-after-write consistency). Therefore, when the linearizability read was tested it causes the overall throughput (read + write) to be reduced by quite a lot. Refer to https://docs.google.com/document/d/1xVkaQYDXJmztETJVZQHkkij_j8j6MGQ4XB8ehathhG8/edit?tab=t.0#heading=h.o61uifuxltgn for the benchmark results. - Allow slightly stale reads: @whbing team implemented readonly client that will periodically probe the OM nodes for the appliedIndex and latency. AFAIK, based on the last probes information, the readonly client will read the "fastest" OM follower (the one with the largest appliedIndex). This might cause stale reads, but this can increase the read throughput. I suppose that this requires additional heartbeat protocols and possibly changing the OM request and response. - We can have similar mechanism as HDFS Observer Read (https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html). AFAIK, if it's applied to Ozone, client will request the lastAppliedIndex of the OM leader (similar to msync call) and then will use the appliedIndex when sending request to the OM follower. This requires adding appliedIndex in the OM request and adding msync protocol. Also we need to change the OM failover proxy provider to allow read from follower. - I recorded some of my throughs while brainstorming the possible implementation in https://issues.apache.org/jira/browse/HDDS-9279 comment, these are just loose thoughts to highlight the elements to take into account. Therefore, if we want to not change the protobuf field in the future, we have to be pretty sure about the general purpose read from follower implementation, which requires additional time and effort. > The only snapshot specific aspect here would be the ability to read snapdiff from a specific OM. I would assume the snapdiff request outputs the host that the job is running on so that the user can feed this back into requests for their snapdiff. I don't think general snapshot list/info CLIs would need to be updated since they could use the same general flow as above. Since snapdiff is already OM specific, we should probably implement this part of the change on its own (to account for leader changes) and then come back to this PR as an enhancement on top of that. The issue is that snapdiff requires that the two snapshots exists, which might not happen on a slow follower even if the create snapshot has been applied in the leader. Therefore, we need to also support listing snapshots and get snapshot info operation. During the incremental replication in "Cross-Region Bucket Replication", the Syncer will wait until the previous snapshot and new snapshot have been created in the OM listener, and then the Syncer will send a SnapDiff request to OM to get the Snapdiff to replicate to the target bucket. > It would be good to avoid adding anything to the client/server code that would just get deprecated later, because it adds complexity to our cross compatibility support. I imagine general read-from-follower would be able to replace a lot of the snapshot specific changes here, so we should probably think in that direction. I agree. If we are still not sure, it might be better to defer this patch in the feature or create a feature branch until we can commit with the client and server changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
