[ 
https://issues.apache.org/jira/browse/HDFS-3899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-3899:
------------------------------

    Attachment: hdfs-3899.txt

Attached patch implements writer-side metrics.

Here is a readout from a running cluster:

{code}
{
    "name" : "Hadoop:service=NameNode,name=IPCLoggerChannel-127.0.0.1-13001",
    "modelerType" : "IPCLoggerChannel-127.0.0.1-13001",
    "tag.Context" : "dfs",
    "tag.IsOutOfSync" : "false",
    "tag.Hostname" : "todd-w510",
    "WritesE2E30sNumOps" : 20024,
    "WritesE2E30s50thPercentileLatencyMicros" : 601,
    "WritesE2E30s75thPercentileLatencyMicros" : 686,
    "WritesE2E30s90thPercentileLatencyMicros" : 804,
    "WritesE2E30s95thPercentileLatencyMicros" : 1033,
    "WritesE2E30s99thPercentileLatencyMicros" : 2020,
    "WritesRpc30sNumOps" : 20024,
    "WritesRpc30s50thPercentileLatencyMicros" : 565,
    "WritesRpc30s75thPercentileLatencyMicros" : 641,
    "WritesRpc30s90thPercentileLatencyMicros" : 749,
    "WritesRpc30s95thPercentileLatencyMicros" : 929,
    "WritesRpc30s99thPercentileLatencyMicros" : 1925,
    "QueuedEditsSize" : 0,
    "LagTimeMillis" : 0,
    "CurrentLagTxns" : 0
  }
{code}

In the same cluster I set up a shell script to alternatingly kill -STOP and 
kill -CONT one of the JNs every 100ms. This simulates one of the JNs being 
heavily loaded so that it "stutters".

The bean for that connection shows:
{code}
{
    "name" : "Hadoop:service=NameNode,name=IPCLoggerChannel-127.0.0.1-13002",
    "modelerType" : "IPCLoggerChannel-127.0.0.1-13002",
    "tag.Context" : "dfs",
    "tag.IsOutOfSync" : "false",
    "tag.Hostname" : "todd-w510",
    "WritesE2E30sNumOps" : 20035,
    "WritesE2E30s50thPercentileLatencyMicros" : 30315,
    "WritesE2E30s75thPercentileLatencyMicros" : 65647,
    "WritesE2E30s90thPercentileLatencyMicros" : 88103,
    "WritesE2E30s95thPercentileLatencyMicros" : 95793,
    "WritesE2E30s99thPercentileLatencyMicros" : 101629,
    "WritesRpc30sNumOps" : 20035,
    "WritesRpc30s50thPercentileLatencyMicros" : 302,
    "WritesRpc30s75thPercentileLatencyMicros" : 563,
    "WritesRpc30s90thPercentileLatencyMicros" : 701,
    "WritesRpc30s95thPercentileLatencyMicros" : 800,
    "WritesRpc30s99thPercentileLatencyMicros" : 2520,
    "QueuedEditsSize" : 13568,
    "LagTimeMillis" : 64,
    "CurrentLagTxns" : 251
  }
{code}
This illustrates the difference between the "end-to-end" latency metric and the 
"rpc" latency metric. Because it's stuttering, almost all of the RPCs are 
individually very fast. But occasionally one of the RPCs will block for 100ms, 
which pushes the end-to-end latency metrics much higher (because a bunch of 
RPCs queue up behind the slow RPC, kind of like a pipeline stall)


                
> QJM: Writer-side metrics
> ------------------------
>
>                 Key: HDFS-3899
>                 URL: https://issues.apache.org/jira/browse/HDFS-3899
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: QuorumJournalManager (HDFS-3077)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-3899.txt
>
>
> We already have some metrics on the server side (JournalNode) but it's useful 
> to also gather metrics from the client side (NameNode). This is important in 
> order to monitor that the client is seeing good performance from the 
> individual JNs, and so that administrators can set up alerts if any of the 
> JNs has become inaccessible to the NN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to