Re: Add Caller Context in Spark

Weiqing Yang Thu, 09 Jun 2016 17:42:57 -0700

Yes, it is a string. Jira SPARK-15857
<https://issues.apache.org/jira/browse/SPARK-15857> is created.


Thanks,
WQ

On Thu, Jun 9, 2016 at 4:40 PM, Reynold Xin <r...@databricks.com> wrote:

> Is this just to set some string? That makes sense. One thing you would
> need to make sure is that Spark should still work outside of Hadoop, and
> also in older versions of Hadoop.
>
> On Thu, Jun 9, 2016 at 4:37 PM, Weiqing Yang <yangweiqing...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Hadoop has implemented a feature of log tracing – caller context (Jira:
>> HDFS-9184 <https://issues.apache.org/jira/browse/HDFS-9184> and YARN-4349
>> <https://issues.apache.org/jira/browse/YARN-4349>). The motivation is to
>> better diagnose and understand how specific applications impacting parts of
>> the Hadoop system and potential problems they may be creating (e.g.
>> overloading NN). As HDFS mentioned inHDFS-9184
>> <https://issues.apache.org/jira/browse/HDFS-9184>, for a given HDFS
>> operation, it's very helpful to track which upper level job issues it. The
>> upper level callers may be specific Oozie tasks, MR jobs, hive queries,
>> Spark jobs.
>>
>> Hadoop ecosystems like MapReduce, Tez (TEZ-2851
>> <https://issues.apache.org/jira/browse/TEZ-2851>), Hive (HIVE-12249
>> <https://issues.apache.org/jira/browse/HIVE-12249>, HIVE-12254
>> <https://issues.apache.org/jira/browse/HIVE-12254>) and Pig(PIG-4714
>> <https://issues.apache.org/jira/browse/PIG-4714>) have implemented their
>> caller contexts. Those systems invoke HDFS client API and Yarn client API
>> to setup caller context, and also expose an API to pass in caller context
>> into it.
>>
>> Lots of Spark applications are running on Yarn/HDFS. Spark can also
>> implement its caller context via invoking HDFS/Yarn API, and also expose an
>> API to its upstream applications to set up their caller contexts. In the
>> end, the spark caller context written into Yarn log / HDFS log can
>> associate with task id, stage id, job id and app id.  That is also very
>> good for Spark users to identify tasks especially if Spark supports
>> multi-tenant environment in the future.
>>
>> e.g.  Run SparkKmeans on Spark.
>>
>> In HDFS log:
>> …
>> 2016-05-25 15:36:23,748 INFO FSNamesystem.audit: allowed=true
>> ugi=yang(auth:SIMPLE)        ip=/127.0.0.1        cmd=getfileinfo
>> src=/data/mllib/kmeans_data.txt/_spark_metadata       dst=null
>> perm=null      proto=rpc callerContext=SparkKMeans
>> application_1464728991691_0009 running on Spark
>>
>>  2016-05-25 15:36:27,893 INFO FSNamesystem.audit: allowed=true
>> ugi=yang (auth:SIMPLE)        ip=/127.0.0.1        cmd=open
>> src=/data/mllib/kmeans_data.txt       dst=null       perm=null
>> proto=rpc
>> callerContext=JobID_0_stageID_0_stageAttemptId_0_taskID_0_attemptNumber_0 on
>> Spark
>> …
>>
>> “application_146472899169” is the application id.
>>
>> I do have code that works with spark master branch. I am going to create
>> a Jira. Please feel free to let me know if you have any concern or
>> comments.
>>
>> Thanks,
>> Qing
>>
>
>

Re: Add Caller Context in Spark

Reply via email to