[
https://issues.apache.org/jira/browse/HDFS-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792903#comment-16792903
]
Erik Krogen commented on HDFS-14211:
------------------------------------
The {{TestConsistentReadsObserver}} failure is interesting. In my environment,
before this patch, it succeeds when run through maven but fails when I run it
in my IDE to try to debug. After my patch, it fails in both. However, digging
into it, I cannot understand why this test would ever succeed, due to the bug I
just filed as HADOOP-16192. Unfortunately I can't easily understand why it was
passing before, as it fails in my IDE when I try to debug. Once HADOOP-16192 is
fixed, this patch does not affect the success of {{testRequeueCall}}. So I
think we can safely ignore it for now during reviews.
> [Consistent Observer Reads] Allow for configurable "always msync" mode
> ----------------------------------------------------------------------
>
> Key: HDFS-14211
> URL: https://issues.apache.org/jira/browse/HDFS-14211
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client
> Reporter: Erik Krogen
> Assignee: Erik Krogen
> Priority: Major
> Attachments: HDFS-14211.000.patch, HDFS-14211.001.patch
>
>
> To allow for reads to be serviced from an ObserverNode (see HDFS-12943) in a
> consistent way, an {{msync}} API was introduced (HDFS-13688) to allow for a
> client to fetch the latest transaction ID from the Active NN, thereby
> ensuring that subsequent reads from the ObserverNode will be up-to-date with
> the current state of the Active.
> Using this properly, however, requires application-side changes: for
> examples, a NodeManager should call {{msync}} before localizing the resources
> for a client, since it received notification of the existence of those
> resources via communicate which is out-of-band to HDFS and thus could
> potentially attempt to localize them prior to the availability of those
> resources on the ObserverNode.
> Until such application-side changes can be made, which will be a longer-term
> effort, we need to provide a mechanism for unchanged clients to utilize the
> ObserverNode without exposing such a client to inconsistencies. This is
> essentially phase 3 of the roadmap outlined in the [design
> document|https://issues.apache.org/jira/secure/attachment/12915990/ConsistentReadsFromStandbyNode.pdf]
> for HDFS-12943.
> The design document proposes some heuristics based on understanding of how
> common applications (e.g. MR) use HDFS for resources. As an initial pass, we
> can simply have a flag which tells a client to call {{msync}} before _every
> single_ read operation. This may seem counterintuitive, as it turns every
> read operation into two RPCs: {{msync}} to the Active following by an actual
> read operation to the Observer. However, the {{msync}} operation is extremely
> lightweight, as it does not acquire the {{FSNamesystemLock}}, and in
> experiments we have found that this approach can easily scale to well over
> 100,000 {{msync}} operations per second on the Active (while still servicing
> approx. 10,000 write op/s). Combined with the fast-path edit log tailing for
> standby/observer nodes (HDFS-13150), this "always msync" approach should
> introduce only a few ms of extra latency to each read call.
> Below are some experimental results collected from experiments which convert
> a normal RPC workload into one in which all read operations are turned into
> an {{msync}}. The baseline is a workload of 1.5k write op/s and 25k read op/s.
> ||Rate Multiplier|2|4|6|8||
> ||RPC Queue Avg Time (ms)|14|53|110|125||
> ||RPC Queue NumOps Avg (k)|51|102|147|177||
> ||RPC Queue NumOps Max (k)|148|269|306|312||
> _(numbers are approximate and should be viewed primarily for their trends)_
> Results are promising up to between 4x and 6x of the baseline workload, which
> is approx. 100-150k read op/s.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]