Yes, it is a string. Jira SPARK-15857 <https://issues.apache.org/jira/browse/SPARK-15857> is created.
Thanks, WQ On Thu, Jun 9, 2016 at 4:40 PM, Reynold Xin <r...@databricks.com> wrote: > Is this just to set some string? That makes sense. One thing you would > need to make sure is that Spark should still work outside of Hadoop, and > also in older versions of Hadoop. > > On Thu, Jun 9, 2016 at 4:37 PM, Weiqing Yang <yangweiqing...@gmail.com> > wrote: > >> Hi, >> >> Hadoop has implemented a feature of log tracing – caller context (Jira: >> HDFS-9184 <https://issues.apache.org/jira/browse/HDFS-9184> and YARN-4349 >> <https://issues.apache.org/jira/browse/YARN-4349>). The motivation is to >> better diagnose and understand how specific applications impacting parts of >> the Hadoop system and potential problems they may be creating (e.g. >> overloading NN). As HDFS mentioned inHDFS-9184 >> <https://issues.apache.org/jira/browse/HDFS-9184>, for a given HDFS >> operation, it's very helpful to track which upper level job issues it. The >> upper level callers may be specific Oozie tasks, MR jobs, hive queries, >> Spark jobs. >> >> Hadoop ecosystems like MapReduce, Tez (TEZ-2851 >> <https://issues.apache.org/jira/browse/TEZ-2851>), Hive (HIVE-12249 >> <https://issues.apache.org/jira/browse/HIVE-12249>, HIVE-12254 >> <https://issues.apache.org/jira/browse/HIVE-12254>) and Pig(PIG-4714 >> <https://issues.apache.org/jira/browse/PIG-4714>) have implemented their >> caller contexts. Those systems invoke HDFS client API and Yarn client API >> to setup caller context, and also expose an API to pass in caller context >> into it. >> >> Lots of Spark applications are running on Yarn/HDFS. Spark can also >> implement its caller context via invoking HDFS/Yarn API, and also expose an >> API to its upstream applications to set up their caller contexts. In the >> end, the spark caller context written into Yarn log / HDFS log can >> associate with task id, stage id, job id and app id. That is also very >> good for Spark users to identify tasks especially if Spark supports >> multi-tenant environment in the future. >> >> e.g. Run SparkKmeans on Spark. >> >> In HDFS log: >> … >> 2016-05-25 15:36:23,748 INFO FSNamesystem.audit: allowed=true >> ugi=yang(auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo >> src=/data/mllib/kmeans_data.txt/_spark_metadata dst=null >> perm=null proto=rpc callerContext=SparkKMeans >> application_1464728991691_0009 running on Spark >> >> 2016-05-25 15:36:27,893 INFO FSNamesystem.audit: allowed=true >> ugi=yang (auth:SIMPLE) ip=/127.0.0.1 cmd=open >> src=/data/mllib/kmeans_data.txt dst=null perm=null >> proto=rpc >> callerContext=JobID_0_stageID_0_stageAttemptId_0_taskID_0_attemptNumber_0 on >> Spark >> … >> >> “application_146472899169” is the application id. >> >> I do have code that works with spark master branch. I am going to create >> a Jira. Please feel free to let me know if you have any concern or >> comments. >> >> Thanks, >> Qing >> > >