[jira] [Commented] (HDFS-9184) Logging HDFS operation's caller context into audit logs

Colin Patrick McCabe (JIRA) Fri, 02 Oct 2015 11:38:57 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-9184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14941542#comment-14941542
 ]


Colin Patrick McCabe commented on HDFS-9184:
--------------------------------------------

Is it documented anywhere that the audit log is key/value?  I didn't see any 
specification for the format... did I miss some docs somewhere?  I don't think 
this is similar to protobuf because there is a clearly defined and documented 
way to extend PB.

Many modern Hadoop systems access HDFS through a proxy.  For example, some 
people use Tachyon to get read and write caching.  RecordService provides 
row-level security and deserialization services.  Hive itself usually does its 
work on behalf of some other process like Tableau, or Spark.  How will this 
solution work in those cases?

For me, a lot of this discussion gets back to the reasons why htrace is a 
separate system rather than just part of HDFS or HBase.  You need something 
that can span multiple projects and create a coherent narrative about what's 
going on.  I agree that HTrace should not be run at 100% sampling, but I am not 
convinced by the arguments that we need 100% sampling.

If this is to diagnose performance issues, then 1% or so sampling should be 
fine.  If this is about security issues, then it seems flawed, since it doesn't 
actually stop anyone from accessing anything.  Can you be a little clearer 
about the specific use-cases for this?

> Logging HDFS operation's caller context into audit logs
> -------------------------------------------------------
>
>                 Key: HDFS-9184
>                 URL: https://issues.apache.org/jira/browse/HDFS-9184
>             Project: Hadoop HDFS
>          Issue Type: Task
>            Reporter: Mingliang Liu
>            Assignee: Mingliang Liu
>         Attachments: HDFS-9184.000.patch
>
>
> For a given HDFS operation (e.g. delete file), it's very helpful to track 
> which upper level job issues it. The upper level callers may be specific 
> Oozie tasks, MR jobs, and hive queries. One scenario is that the namenode 
> (NN) is abused/spammed, the operator may want to know immediately which MR 
> job should be blamed so that she can kill it. To this end, the caller context 
> contains at least the application-dependent "tracking id".
> There are several existing techniques that may be related to this problem.
> 1. Currently the HDFS audit log tracks the users of the the operation which 
> is obviously not enough. It's common that the same user issues multiple jobs 
> at the same time. Even for a single top level task, tracking back to a 
> specific caller in a chain of operations of the whole workflow (e.g.Oozie -> 
> Hive -> Yarn) is hard, if not impossible.
> 2. HDFS integrated {{htrace}} support for providing tracing information 
> across multiple layers. The span is created in many places interconnected 
> like a tree structure which relies on offline analysis across RPC boundary. 
> For this use case, {{htrace}} has to be enabled at 100% sampling rate which 
> introduces significant overhead. Moreover, passing additional information 
> (via annotations) other than span id from root of the tree to leaf is a 
> significant additional work.
> 3. In [HDFS-4680 | https://issues.apache.org/jira/browse/HDFS-4680], there 
> are some related discussion on this topic. The final patch implemented the 
> tracking id as a part of delegation token. This protects the tracking 
> information from being changed or impersonated. However, kerberos 
> authenticated connections or insecure connections don't have tokens. 
> [HADOOP-8779] proposes to use tokens in all the scenarios, but that might 
> mean changes to several upstream projects and is a major change in their 
> security implementation.
> We propose another approach to address this problem. We also treat HDFS audit 
> log as a good place for after-the-fact root cause analysis. We propose to put 
> the caller id (e.g. Hive query id) in threadlocals. Specially, on client side 
> the threadlocal object is passed to NN as a part of RPC header (optional), 
> while on sever side NN retrieves it from header and put it to {{Handler}}'s 
> threadlocals. Finally in {{FSNamesystem}}, HDFS audit logger will record the 
> caller context for each operation. In this way, the existing code is not 
> affected.
> It is still challenging to keep "lying" client from abusing the caller 
> context. Our proposal is to add a {{signature}} field to the caller context. 
> The client choose to provide its signature along with the caller id. The 
> operator may need to validate the signature at the time of offline analysis. 
> The NN is not responsible for validating the signature online.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-9184) Logging HDFS operation's caller context into audit logs

Reply via email to