[ 
https://issues.apache.org/jira/browse/HDFS-9184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14941613#comment-14941613
 ] 

Allen Wittenauer commented on HDFS-9184:
----------------------------------------

bq. Is it documented anywhere that the audit log is key/value? I didn't see any 
specification for the format...

It's a) not documented and b) not a kvp.

Story time. This is going to be the shorter version.  

I have few regrets about things I helped design in Hadoop, but this does happen 
to be one of them especially due to all of the misunderstanding around what 
it's purpose in life is and how people actually use it.  When [~chris.douglas] 
and I did the design work on the audit log back in 2008 (IIRC), I specifically 
wanted a fixed field log file format.  We were going to be writing ops tools to 
answer questions that we the ops team simply could not. It was important that 
the format stay fixed for a variety of reasons:

* The ops team at Y! was tiny with a mix of junior and senior folks. The junior 
folks were likely going to be the ones writing the code since the senior folks 
were busy dealing with the continual fallout from the weekly Hadoop upgrades 
and just getting a working infrastructure in place while we moved away from 
YST.  (... and getting ops-specific tooling out of dev was regularly blocked by 
management ...)

* We needed to make sure that no matter what the devs added to Hadoop, the log 
file wouldn't change.  At that point in time, the logs for things like the NN 
were wildly fluctuating and were pretty much impossible to use for any sort of 
metrics or monitoring.  We needed a safespace that was away from the turmoil 
happening in the rest of the system.  If the system would have been open ended, 
it would have been absolute hell to work with.  Forcing a format that at that 
point covered 100% of the foreseeable use cases solved that problem.

*  The content was modeled after Solaris BSM with a few key differences.  BSM 
wrote in binary which just wasn't a real option without us pulling out more 
advanced techniques. It would fail the 'quick and dirty' tests that the ops 
team had to have in order to fulfill user needs. BSM also supported a heck of a 
lot more than Hadoop did.  So a straight logfile it was.

Now one of the things I wanted to avoid was the "tab problem".  e.g., fields 
that are empty end up looking like field<tab><tab>field. So we settled on a 
<column label>=<value> format where every label would always be present so that 
we could then use spaces to break up the columns.  [Thus why I say it is *not* 
kvp.  In most key-value stores that I've worked with, it's rare to see 
key=(null)]. 

I've also heard that the file is a "weird form of JSON".  No, it's not.  In 
fact, I vetoed JSON because of the extra parsing overhead with very little gain 
to be seen by doing that vs. just fixing all the fields.

Now, what would I do differently?  #1 would be documentation with a clear 
explanation of this history, covering the whys and the hows.  #2 would probably 
be to make it officially key value with some fields being required.  But that's 
a different problem altogether....



> Logging HDFS operation's caller context into audit logs
> -------------------------------------------------------
>
>                 Key: HDFS-9184
>                 URL: https://issues.apache.org/jira/browse/HDFS-9184
>             Project: Hadoop HDFS
>          Issue Type: Task
>            Reporter: Mingliang Liu
>            Assignee: Mingliang Liu
>         Attachments: HDFS-9184.000.patch
>
>
> For a given HDFS operation (e.g. delete file), it's very helpful to track 
> which upper level job issues it. The upper level callers may be specific 
> Oozie tasks, MR jobs, and hive queries. One scenario is that the namenode 
> (NN) is abused/spammed, the operator may want to know immediately which MR 
> job should be blamed so that she can kill it. To this end, the caller context 
> contains at least the application-dependent "tracking id".
> There are several existing techniques that may be related to this problem.
> 1. Currently the HDFS audit log tracks the users of the the operation which 
> is obviously not enough. It's common that the same user issues multiple jobs 
> at the same time. Even for a single top level task, tracking back to a 
> specific caller in a chain of operations of the whole workflow (e.g.Oozie -> 
> Hive -> Yarn) is hard, if not impossible.
> 2. HDFS integrated {{htrace}} support for providing tracing information 
> across multiple layers. The span is created in many places interconnected 
> like a tree structure which relies on offline analysis across RPC boundary. 
> For this use case, {{htrace}} has to be enabled at 100% sampling rate which 
> introduces significant overhead. Moreover, passing additional information 
> (via annotations) other than span id from root of the tree to leaf is a 
> significant additional work.
> 3. In [HDFS-4680 | https://issues.apache.org/jira/browse/HDFS-4680], there 
> are some related discussion on this topic. The final patch implemented the 
> tracking id as a part of delegation token. This protects the tracking 
> information from being changed or impersonated. However, kerberos 
> authenticated connections or insecure connections don't have tokens. 
> [HADOOP-8779] proposes to use tokens in all the scenarios, but that might 
> mean changes to several upstream projects and is a major change in their 
> security implementation.
> We propose another approach to address this problem. We also treat HDFS audit 
> log as a good place for after-the-fact root cause analysis. We propose to put 
> the caller id (e.g. Hive query id) in threadlocals. Specially, on client side 
> the threadlocal object is passed to NN as a part of RPC header (optional), 
> while on sever side NN retrieves it from header and put it to {{Handler}}'s 
> threadlocals. Finally in {{FSNamesystem}}, HDFS audit logger will record the 
> caller context for each operation. In this way, the existing code is not 
> affected.
> It is still challenging to keep "lying" client from abusing the caller 
> context. Our proposal is to add a {{signature}} field to the caller context. 
> The client choose to provide its signature along with the caller id. The 
> operator may need to validate the signature at the time of offline analysis. 
> The NN is not responsible for validating the signature online.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to