Re: Add Caller Context in Spark

2016-06-09 Thread Weiqing Yang
Yes, it is a string. Jira SPARK-15857
 is created.

Thanks,
WQ

On Thu, Jun 9, 2016 at 4:40 PM, Reynold Xin  wrote:

> Is this just to set some string? That makes sense. One thing you would
> need to make sure is that Spark should still work outside of Hadoop, and
> also in older versions of Hadoop.
>
> On Thu, Jun 9, 2016 at 4:37 PM, Weiqing Yang 
> wrote:
>
>> Hi,
>>
>> Hadoop has implemented a feature of log tracing – caller context (Jira:
>> HDFS-9184  and YARN-4349
>> ). The motivation is to
>> better diagnose and understand how specific applications impacting parts of
>> the Hadoop system and potential problems they may be creating (e.g.
>> overloading NN). As HDFS mentioned inHDFS-9184
>> , for a given HDFS
>> operation, it's very helpful to track which upper level job issues it. The
>> upper level callers may be specific Oozie tasks, MR jobs, hive queries,
>> Spark jobs.
>>
>> Hadoop ecosystems like MapReduce, Tez (TEZ-2851
>> ), Hive (HIVE-12249
>> , HIVE-12254
>> ) and Pig(PIG-4714
>> ) have implemented their
>> caller contexts. Those systems invoke HDFS client API and Yarn client API
>> to setup caller context, and also expose an API to pass in caller context
>> into it.
>>
>> Lots of Spark applications are running on Yarn/HDFS. Spark can also
>> implement its caller context via invoking HDFS/Yarn API, and also expose an
>> API to its upstream applications to set up their caller contexts. In the
>> end, the spark caller context written into Yarn log / HDFS log can
>> associate with task id, stage id, job id and app id.  That is also very
>> good for Spark users to identify tasks especially if Spark supports
>> multi-tenant environment in the future.
>>
>> e.g.  Run SparkKmeans on Spark.
>>
>> In HDFS log:
>> …
>> 2016-05-25 15:36:23,748 INFO FSNamesystem.audit: allowed=true
>> ugi=yang(auth:SIMPLE)ip=/127.0.0.1cmd=getfileinfo
>> src=/data/mllib/kmeans_data.txt/_spark_metadata   dst=null
>> perm=null  proto=rpc callerContext=SparkKMeans
>> application_1464728991691_0009 running on Spark
>>
>>  2016-05-25 15:36:27,893 INFO FSNamesystem.audit: allowed=true
>> ugi=yang (auth:SIMPLE)ip=/127.0.0.1cmd=open
>> src=/data/mllib/kmeans_data.txt   dst=null   perm=null
>> proto=rpc
>> callerContext=JobID_0_stageID_0_stageAttemptId_0_taskID_0_attemptNumber_0 on
>> Spark
>> …
>>
>> “application_146472899169” is the application id.
>>
>> I do have code that works with spark master branch. I am going to create
>> a Jira. Please feel free to let me know if you have any concern or
>> comments.
>>
>> Thanks,
>> Qing
>>
>
>


Add Caller Context in Spark

2016-06-09 Thread Weiqing Yang
Hi,

Hadoop has implemented a feature of log tracing – caller context (Jira:
HDFS-9184  and YARN-4349
). The motivation is to
better diagnose and understand how specific applications impacting parts of
the Hadoop system and potential problems they may be creating (e.g.
overloading NN). As HDFS mentioned inHDFS-9184
, for a given HDFS
operation, it's very helpful to track which upper level job issues it. The
upper level callers may be specific Oozie tasks, MR jobs, hive queries,
Spark jobs.

Hadoop ecosystems like MapReduce, Tez (TEZ-2851
), Hive (HIVE-12249
, HIVE-12254
) and Pig(PIG-4714
) have implemented their
caller contexts. Those systems invoke HDFS client API and Yarn client API
to setup caller context, and also expose an API to pass in caller context
into it.

Lots of Spark applications are running on Yarn/HDFS. Spark can also
implement its caller context via invoking HDFS/Yarn API, and also expose an
API to its upstream applications to set up their caller contexts. In the
end, the spark caller context written into Yarn log / HDFS log can
associate with task id, stage id, job id and app id.  That is also very
good for Spark users to identify tasks especially if Spark supports
multi-tenant environment in the future.

e.g.  Run SparkKmeans on Spark.

In HDFS log:
…
2016-05-25 15:36:23,748 INFO FSNamesystem.audit: allowed=true
ugi=yang(auth:SIMPLE)ip=/127.0.0.1cmd=getfileinfo
src=/data/mllib/kmeans_data.txt/_spark_metadata   dst=null
perm=null  proto=rpc callerContext=SparkKMeans
application_1464728991691_0009 running on Spark

 2016-05-25 15:36:27,893 INFO FSNamesystem.audit: allowed=true
ugi=yang (auth:SIMPLE)ip=/127.0.0.1cmd=open
src=/data/mllib/kmeans_data.txt   dst=null   perm=null
proto=rpc
callerContext=JobID_0_stageID_0_stageAttemptId_0_taskID_0_attemptNumber_0 on
Spark
…

“application_146472899169” is the application id.

I do have code that works with spark master branch. I am going to create a
Jira. Please feel free to let me know if you have any concern or comments.

Thanks,
Qing


Re: JIRA SPARK-2984

2016-06-09 Thread Holden Karau
I'd do some searching and see if there is a JIRA related to this problem
on s3 and if you don't find one go ahead and make one. Even if it is an
intrinsic problem with s3 (and I'm not super sure since I'm just reading
this on mobile) - it would maybe be a good thing for us to document.

On Thursday, June 9, 2016, Sunil Kumar  wrote:

> Holden
> Thanks for your prompt reply... Any suggestions on the next step ? Does
> this call for a new spark jira ticket or is this an issue for s3?
> Thx
>
>
> I think your error could possibly be different - looking at the original
> JIRA the issue was happening on HDFS and you seem to be experiencing the
> issue on s3n, and while I don't have full view of the problem I could see
> this being s3 specific (read-after-write on s3 is trickier than
> read-after-write on HDFS).
>
> On Thursday, June 9, 2016, Sunil Kumar 
> wrote:
>
>> Hi,
>>
>> I am running into SPARK-2984 while running my spark 1.6.1 jobs over yarn
>> in AWS. I have tried with spark.speculation=false but still see the same
>> failure with _temporary file missing for task_xxx...This ticket is in
>> resolved state. How can this be reopened ? Is there a workaround ?
>>
>> thanks
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: JIRA SPARK-2984

2016-06-09 Thread Holden Karau
I think your error could possibly be different - looking at the original
JIRA the issue was happening on HDFS and you seem to be experiencing the
issue on s3n, and while I don't have full view of the problem I could see
this being s3 specific (read-after-write on s3 is trickier than
read-after-write on HDFS).

On Thursday, June 9, 2016, Sunil Kumar 
wrote:

> Hi,
>
> I am running into SPARK-2984 while running my spark 1.6.1 jobs over yarn
> in AWS. I have tried with spark.speculation=false but still see the same
> failure with _temporary file missing for task_xxx...This ticket is in
> resolved state. How can this be reopened ? Is there a workaround ?
>
> thanks
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Strange exception while reading Parquet files

2016-06-09 Thread Rostyslav Sotnychenko
Hello!

I have faced a very strange exception (stack-trace in the end of this
email) while trying to read Parquet file using Hive Context from Spark
1.3.1, Hive 0.13.

This issue appears only on YARN (standalone and local are working fine) and
only when HiveContext is used (from SqlContext everything works fine).

After researching for the whole week, I was able to find only two mentions
of this problem! One is an unanswered email to spark-user mailing list
,
second is a on some companies Jira
.

The only current workaround I have is upgrading to Spark 1.4.1 but this
isn't a solution.


Does anyone knows how to deal with it?


Thanks in advance,
Rostyslav Sotnychenko



-- STACK TRACE 

16/06/10 00:13:58 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times;
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage
2.0 (TID 14, 2-op.cluster): java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at parquet.hadoop.ParquetInputSplit.readArray(ParquetInputSplit.java:240)
at parquet.hadoop.ParquetInputSplit.readUTF8(ParquetInputSplit.java:230)
at parquet.hadoop.ParquetInputSplit.readFields(ParquetInputSplit.java:197)
at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)
at
org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:43)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1138)
at
org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1897)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


Re: Spark 2.0.0 preview docs uploaded

2016-06-09 Thread Sean Owen
Available but mostly as JIRA output:
https://spark.apache.org/news/spark-2.0.0-preview.html

On Thu, Jun 9, 2016 at 7:33 AM, Pete Robbins  wrote:
> It would be nice to have a "what's new in 2.0.0" equivalent to
> https://spark.apache.org/releases/spark-release-1-6-0.html available or am I
> just missing it?
>
> On Wed, 8 Jun 2016 at 13:15 Sean Owen  wrote:
>>
>> OK, this is done:
>>
>> http://spark.apache.org/documentation.html
>> http://spark.apache.org/docs/2.0.0-preview/
>> http://spark.apache.org/docs/preview/
>>
>> On Tue, Jun 7, 2016 at 4:59 PM, Shivaram Venkataraman
>>  wrote:
>> > As far as I know the process is just to copy docs/_site from the build
>> > to the appropriate location in the SVN repo (i.e.
>> > site/docs/2.0.0-preview).
>> >
>> > Thanks
>> > Shivaram
>> >
>> > On Tue, Jun 7, 2016 at 8:14 AM, Sean Owen  wrote:
>> >> As a stop-gap, I can edit that page to have a small section about
>> >> preview releases and point to the nightly docs.
>> >>
>> >> Not sure who has the power to push 2.0.0-preview to site/docs, but, if
>> >> that's done then we can symlink "preview" in that dir to it and be
>> >> done, and update this section about preview docs accordingly.
>> >>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 2.0.0 preview docs uploaded

2016-06-09 Thread Pete Robbins
It would be nice to have a "what's new in 2.0.0" equivalent to
https://spark.apache.org/releases/spark-release-1-6-0.html available or am
I just missing it?

On Wed, 8 Jun 2016 at 13:15 Sean Owen  wrote:

> OK, this is done:
>
> http://spark.apache.org/documentation.html
> http://spark.apache.org/docs/2.0.0-preview/
> http://spark.apache.org/docs/preview/
>
> On Tue, Jun 7, 2016 at 4:59 PM, Shivaram Venkataraman
>  wrote:
> > As far as I know the process is just to copy docs/_site from the build
> > to the appropriate location in the SVN repo (i.e.
> > site/docs/2.0.0-preview).
> >
> > Thanks
> > Shivaram
> >
> > On Tue, Jun 7, 2016 at 8:14 AM, Sean Owen  wrote:
> >> As a stop-gap, I can edit that page to have a small section about
> >> preview releases and point to the nightly docs.
> >>
> >> Not sure who has the power to push 2.0.0-preview to site/docs, but, if
> >> that's done then we can symlink "preview" in that dir to it and be
> >> done, and update this section about preview docs accordingly.
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>