[ 
https://issues.apache.org/jira/browse/HDFS-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725292#comment-13725292
 ] 

Daryn Sharp commented on HDFS-4680:
-----------------------------------

For the record:  This is an offline compromise to decoding the identifier.  I 
don't like this approach and find it marginally useful.  I like the concept of 
this feature for chargeback but don't want a feeble dependency on the job conf 
and history files - or even tailored to MR.

Other dislikes are it's not a forward mapping.  It's not a descriptive string, 
ex. jobId for MR or a query id for hive.  If the NN is being spammed/abused, I 
want to immediately know the responsible job so it can be killed.  Traceability 
with this patch requires joining the audit logs with the job history files 
post-job completion.

A "lying" client will be able to abuse any solution.  This approach allows any 
task or sub-jobs to strip out or alter the tracking info from the job conf and 
break the traceability.  Embedding a tracking id in the token identifier 
provides only one point to lie: initial token acquisition during submission.  
If you are using oozie to control job submission then you can "trust" that 
oozie isn't going to lie and jobs are incapable of lying.

That said, comments on the patch:
# _Make it a configurable option defaulting to off_.  I don't want the penalty 
for something of little to no value to us.
# UGI.getCurrentUser isn't cheap.  The caller of {{logAuditEvent}} has the UGI, 
but it's passing just the username.  Pass the actual ugi down instead to avoid 
an unnecessary lookup.
# Don't look for a token if the user isn't authed with a token.
# There should be one and only one token ident in the UGI but I suppose 
paranoia is good.  However it should be looking specifically for 
DelegationTokenIdentifier, not the abstract.
# It's costly to compute the md5sum for every single client connection.  Store 
it in the {{DelegationTokenInformation}} when the token is created and query 
the dtsm during logging.
# I'd prefer it's called something generic like "trackingId" so it can be 
reused when we actually make it a useful forward mapping.

Completely untested code that illustrates some of the above points:
{code}
if (someConfValue && ugi.getAuthenticationMethod() == 
AuthenticationMethod.TOKEN) {
  for (TokenIdentifier tokenId : ugi.getTokenIdentifiers()) {
    if (tokenId instanceof DelegationTokenIdentifier) {
      sb.append("\ttrackingId=").append(dtsm.getTrackingId(tokenId));
      break;
    }
  }
}
{code}
                
> Audit logging of delegation tokens for MR tracing
> -------------------------------------------------
>
>                 Key: HDFS-4680
>                 URL: https://issues.apache.org/jira/browse/HDFS-4680
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode, security
>    Affects Versions: 2.0.3-alpha
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-4680-1.patch
>
>
> HDFS audit logging tracks HDFS operations made by different users, e.g. 
> creation and deletion of files. This is useful for after-the-fact root cause 
> analysis and security. However, logging merely the username is insufficient 
> for many usecases. For instance, it is common for a single user to run 
> multiple MapReduce jobs (I believe this is the case with Hive). In this 
> scenario, given a particular audit log entry, it is difficult to trace it 
> back to the MR job or task that generated that entry.
> I see a number of potential options for implementing this.
> 1. Make an optional "client name" field part of the NN RPC format. We already 
> pass a {{clientName}} as a parameter in many RPC calls, so this would 
> essentially make it standardized. MR tasks could then set this field to the 
> job and task ID.
> 2. This could be generalized to a set of optional key-value *tags* in the NN 
> RPC format, which would then be audit logged. This has standalone benefits 
> outside of just verifying MR task ids.
> 3. Neither of the above two options actually securely verify that MR clients 
> are who they claim they are. Doing this securely requires the JobTracker to 
> sign MR task attempts, and then having the NN verify this signature. However, 
> this is substantially more work, and could be built on after idea #2.
> Thoughts welcomed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to