[ https://issues.apache.org/jira/browse/HDFS-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795221#comment-16795221 ]
Erik Krogen commented on HDFS-14211: ------------------------------------ Hey [~ayushtkn], thanks for your concerns, they are very valid. Regarding documentation, you bring up a great point. I am uploading a new v002 patch now which adds documentation of this feature to {{ObserverNameNode.md}}. Regarding explicit vs. implicit calls, any call to the Active will update {{lastMsyncTimeMs}}. See {{ObserverReadProxyProvider L415}}. This should cover the scenario you have described; please let me know if you see further issues. > [Consistent Observer Reads] Allow for configurable "always msync" mode > ---------------------------------------------------------------------- > > Key: HDFS-14211 > URL: https://issues.apache.org/jira/browse/HDFS-14211 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client > Reporter: Erik Krogen > Assignee: Erik Krogen > Priority: Major > Attachments: HDFS-14211.000.patch, HDFS-14211.001.patch > > > To allow for reads to be serviced from an ObserverNode (see HDFS-12943) in a > consistent way, an {{msync}} API was introduced (HDFS-13688) to allow for a > client to fetch the latest transaction ID from the Active NN, thereby > ensuring that subsequent reads from the ObserverNode will be up-to-date with > the current state of the Active. > Using this properly, however, requires application-side changes: for > examples, a NodeManager should call {{msync}} before localizing the resources > for a client, since it received notification of the existence of those > resources via communicate which is out-of-band to HDFS and thus could > potentially attempt to localize them prior to the availability of those > resources on the ObserverNode. > Until such application-side changes can be made, which will be a longer-term > effort, we need to provide a mechanism for unchanged clients to utilize the > ObserverNode without exposing such a client to inconsistencies. This is > essentially phase 3 of the roadmap outlined in the [design > document|https://issues.apache.org/jira/secure/attachment/12915990/ConsistentReadsFromStandbyNode.pdf] > for HDFS-12943. > The design document proposes some heuristics based on understanding of how > common applications (e.g. MR) use HDFS for resources. As an initial pass, we > can simply have a flag which tells a client to call {{msync}} before _every > single_ read operation. This may seem counterintuitive, as it turns every > read operation into two RPCs: {{msync}} to the Active following by an actual > read operation to the Observer. However, the {{msync}} operation is extremely > lightweight, as it does not acquire the {{FSNamesystemLock}}, and in > experiments we have found that this approach can easily scale to well over > 100,000 {{msync}} operations per second on the Active (while still servicing > approx. 10,000 write op/s). Combined with the fast-path edit log tailing for > standby/observer nodes (HDFS-13150), this "always msync" approach should > introduce only a few ms of extra latency to each read call. > Below are some experimental results collected from experiments which convert > a normal RPC workload into one in which all read operations are turned into > an {{msync}}. The baseline is a workload of 1.5k write op/s and 25k read op/s. > ||Rate Multiplier|2|4|6|8|| > ||RPC Queue Avg Time (ms)|14|53|110|125|| > ||RPC Queue NumOps Avg (k)|51|102|147|177|| > ||RPC Queue NumOps Max (k)|148|269|306|312|| > _(numbers are approximate and should be viewed primarily for their trends)_ > Results are promising up to between 4x and 6x of the baseline workload, which > is approx. 100-150k read op/s. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org