Erik Krogen created HDFS-14211:
----------------------------------

             Summary: [Consistent Observer Reads] Allow for configurable 
"always msync" mode
                 Key: HDFS-14211
                 URL: https://issues.apache.org/jira/browse/HDFS-14211
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: hdfs-client
            Reporter: Erik Krogen


To allow for reads to be serviced from an ObserverNode (see HDFS-12943) in a 
consistent way, an {{msync}} API was introduced (HDFS-13688) to allow for a 
client to fetch the latest transaction ID from the Active NN, thereby ensuring 
that subsequent reads from the ObserverNode will be up-to-date with the current 
state of the Active.

Using this properly, however, requires application-side changes: for examples, 
a NodeManager should call {{msync}} before localizing the resources for a 
client, since it received notification of the existence of those resources via 
communicate which is out-of-band to HDFS and thus could potentially attempt to 
localize them prior to the availability of those resources on the ObserverNode.

Until such application-side changes can be made, which will be a longer-term 
effort, we need to provide a mechanism for unchanged clients to utilize the 
ObserverNode without exposing such a client to inconsistencies. This is 
essentially phase 3 of the roadmap outlined in the [design 
document|https://issues.apache.org/jira/secure/attachment/12915990/ConsistentReadsFromStandbyNode.pdf]
 for HDFS-12943.

The design document proposes some heuristics based on understanding of how 
common applications (e.g. MR) use HDFS for resources. As an initial pass, we 
can simply have a flag which tells a client to call {{msync}} before _every 
single_ read operation. This may seem counterintuitive, as it turns every read 
operation into two RPCs: {{msync}} to the Active following by an actual read 
operation to the Observer. However, the {{msync}} operation is extremely 
lightweight, as it does not acquire the {{FSNamesystemLock}}, and in 
experiments we have found that this approach can easily scale to well over 
100,000 {{msync}} operations per second on the Active (while still servicing 
approx. 10,000 write op/s). Combined with the fast-path edit log tailing for 
standby/observer nodes (HDFS-13150), this "always msync" approach should 
introduce only a few ms of extra latency to each read call.

Below are some experimental results collected from experiments which convert a 
normal RPC workload into one in which all read operations are turned into an 
{{msync}}. The baseline is a workload of 1.5k write op/s and 25k read op/s.

||Rate Multiplier|2|4|6|8||
||RPC Queue Avg Time (ms)|14.2|53.2|110.4|125.3||
||RPC Queue NumOps Avg (k)|51.4|102.3|147.8|177.9||
||RPC Queue NumOps Max (k)|148.8|269.5|306.3|312.4||

Results are promising up to between 4x and 6x of the baseline workload, which 
is approx. 100-150k read op/s.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to