Re: Spark on Yarn - A small issue !

2014-05-13 Thread Tom Graves
You need to look at the logs files for yarn.  Generally this can be done with 
"yarn logs -applicationId ".  That only works if you have log 
aggregation enabled though.   You should be able to see atleast the application 
master logs through the yarn resourcemanager web ui.  I would try that first. 

If that doesn't work you can turn on debug in the nodemanager:

To review per-container launch environment, increase 
yarn.nodemanager.delete.debug-delay-sec to a large value (e.g. 36000), and then 
access the application cache through yarn.nodemanager.local-dirs on the nodes 
on which containers are launched. This directory contains the launch script, 
jars, and all environment variables used for launching each container. This 
process is useful for debugging classpath problems in particular. (Note that 
enabling this requires admin privileges on cluster settings and a restart of 
all node managers. Thus, this is not applicable to hosted clusters).



Tom


On Monday, May 12, 2014 9:38 AM, Sai Prasanna  wrote:
 
Hi All, 

I wanted to launch Spark on Yarn, interactive - yarn client mode.

With default settings of yarn-site.xml and spark-env.sh, i followed the given 
link 
http://spark.apache.org/docs/0.8.1/running-on-yarn.html

I get the pi value correct when i run without launching the shell.

When i launch the shell, with following command,
SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar
 \
SPARK_YARN_APP_JAR=examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar
 \
MASTER=yarn-client ./spark-shell
And try to create RDDs and do some action on it, i get nothing. After sometime 
tasks fails.

LogFile of spark: 
519095 14/05/12 13:30:40 INFO YarnClientClusterScheduler: 
YarnClientClusterScheduler.postStartHook done
519096 14/05/12 13:30:40 INFO BlockManagerMasterActor$BlockManagerInfo: 
Registering block manager s1:38355 with 324.4 MB RAM
519097 14/05/12 13:31:38 INFO MemoryStore: ensureFreeSpace(202584) called with 
curMem=0, maxMem=340147568
519098 14/05/12 13:31:38 INFO MemoryStore: Block broadcast_0 stored as values 
to memory (estimated size 197.8 KB, free 324.2 MB)
519099 14/05/12 13:31:49 INFO FileInputFormat: Total input paths to process : 1
519100 14/05/12 13:31:49 INFO NetworkTopology: Adding a new node: 
/default-rack/192.168.1.100:50010
519101 14/05/12 13:31:49 INFO SparkContext: Starting job: top at :15
519102 14/05/12 13:31:49 INFO DAGScheduler: Got job 0 (top at :15) 
with 4 output partitions (allowLocal=false)
519103 14/05/12 13:31:49 INFO DAGScheduler: Final stage: Stage 0 (top at 
:15)
519104 14/05/12 13:31:49 INFO DAGScheduler: Parents of final stage: List()
519105 14/05/12 13:31:49 INFO DAGScheduler: Missing parents: List()
519106 14/05/12 13:31:49 INFO DAGScheduler: Submitting Stage 0 
(MapPartitionsRDD[2] at top at :15), which has no missing par   
ents
519107 14/05/12 13:31:49 INFO DAGScheduler: Submitting 4 missing tasks from 
Stage 0 (MapPartitionsRDD[2] at top at :15)
519108 14/05/12 13:31:49 INFO YarnClientClusterScheduler: Adding task set 0.0 
with 4 tasks
519109 14/05/12 13:31:49 INFO RackResolver: Resolved s1 to /default-rack
519110 14/05/12 13:31:49 INFO ClusterTaskSetManager: Starting task 0.0:3 as TID 
0 on executor 1: s1 (PROCESS_LOCAL)
519111 14/05/12 13:31:49 INFO ClusterTaskSetManager: Serialized task 0.0:3 as 
1811 bytes in 4 ms
519112 14/05/12 13:31:49 INFO ClusterTaskSetManager: Starting task 0.0:0 as TID 
1 on executor 1: s1 (NODE_LOCAL)
519113 14/05/12 13:31:49 INFO ClusterTaskSetManager: Serialized task 0.0:0 as 
1811 bytes in 1 ms
519114 14/05/12 13:32:18INFO YarnClientSchedulerBackend: Executor 1 
disconnected, so removing it
519115 14/05/12 13:32:18 ERROR YarnClientClusterScheduler: Lost executor 1 on 
s1: remote Akka client shutdown
519116 14/05/12 13:32:18 INFO ClusterTaskSetManager: Re-queueing tasks for 1 
from TaskSet 0.0
519117 14/05/12 13:32:18 WARN ClusterTaskSetManager: Lost TID 1 (task 0.0:0)
519118 14/05/12 13:32:18 WARN ClusterTaskSetManager: Lost TID 0 (task 0.0:3)
519119 14/05/12 13:32:18 INFO DAGScheduler: Executor lost: 1 (epoch 0)
519120 14/05/12 13:32:18 INFO BlockManagerMasterActor: Trying to remove 
executor 1 from BlockManagerMaster.
519121 14/05/12 13:32:18 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor


 Do i need to set any other env-variable specifically for SPARK on YARN. What 
could be the isuue ??


Can anyone please help me in this regard.

Thanks in Advance !!

Re: streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected

2014-05-13 Thread zzzzzqf12345
thanks for reply~~

I had solved the problem and found the reason, because I used the Master
node to upload files to hdfs, this action may take up a lot of Master's
network resources. When I changed to use another computer none of the
cluster to upload these files, it got the correct result.

QingFeng



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-on-hdfs-can-detected-all-new-file-but-the-sum-of-all-the-rdd-count-not-equals-which-had-ded-tp5572p5635.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


1.0.0 Release Date?

2014-05-13 Thread bhusted
Can anyone comment on the anticipated date or worse case timeframe for when
Spark 1.0.0 will be released?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/1-0-0-Release-Date-tp5664.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


How to use Mahout VectorWritable in Spark.

2014-05-13 Thread Stuti Awasthi
Hi All,
I am very new to Spark and trying to play around with Mllib hence apologies for 
the basic question.

I am trying to run KMeans algorithm using Mahout and Spark MLlib to see the 
performance. Now initial datasize was 10 GB. Mahout converts the data in 
Sequence File  which is used for KMeans Clustering.  The 
Sequence File crated was ~ 6GB in size.

Now I wanted if I can use the Mahout Sequence file to be executed in Spark 
MLlib for KMeans . I have read that SparkContext.sequenceFile may be used here. 
Hence I tried to read my sequencefile as below but getting the error :

Command on Spark Shell :
scala> val data = sc.sequenceFile[String,VectorWritable]("/ 
KMeans_dataset_seq/part-r-0",String,VectorWritable)
:12: error: not found: type VectorWritable
   val data = sc.sequenceFile[String,VectorWritable](" 
/KMeans_dataset_seq/part-r-0",String,VectorWritable)

Here I have 2 ques:
1.  Mahout has "Text" as Key but Spark is printing "not found: type:Text" hence 
I changed it to String.. Is this correct ???
2. How will VectorWritable be found in Spark. Do I need to include Mahout jar 
in Classpath or any other option ??

Please Suggest

Regards
Stuti Awasthi


::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




Re: Spark to utilize HDFS's mmap caching

2014-05-13 Thread Chanwit Kaewkasi
Great to know that! Thank you, Matei.

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Tue, May 13, 2014 at 2:14 AM, Matei Zaharia  wrote:
> That API is something the HDFS administrator uses outside of any application 
> to tell HDFS to cache certain files or directories. But once you've done 
> that, any existing HDFS client accesses them directly from the cache.
>
> Matei
>
> On May 12, 2014, at 11:10 AM, Marcelo Vanzin  wrote:
>
>> Is that true? I believe that API Chanwit is talking about requires
>> explicitly asking for files to be cached in HDFS.
>>
>> Spark automatically benefits from the kernel's page cache (i.e. if
>> some block is in the kernel's page cache, it will be read more
>> quickly). But the explicit HDFS cache is a different thing; Spark
>> applications that want to use it would have to explicitly call the
>> respective HDFS APIs.
>>
>> On Sun, May 11, 2014 at 11:04 PM, Matei Zaharia  
>> wrote:
>>> Yes, Spark goes through the standard HDFS client and will automatically 
>>> benefit from this.
>>>
>>> Matei
>>>
>>> On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi  wrote:
>>>
 Hi all,

 Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via
 sc.textFile() and other HDFS-related APIs?

 http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

 Best regards,

 -chanwit

 --
 Chanwit Kaewkasi
 linkedin.com/in/chanwit
>>>
>>
>>
>>
>> --
>> Marcelo
>


Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Mayur Rustagi
Count causes the overall performance to drop drastically. Infact beyond 50
files it starts to hang. if i force materialization.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Tue, May 13, 2014 at 9:34 PM, Xiangrui Meng  wrote:

> After checkPoint, call count directly to materialize it. -Xiangrui
>
> On Tue, May 13, 2014 at 4:20 AM, Mayur Rustagi 
> wrote:
> > We are running into same issue. After 700 or so files the stack
> overflows,
> > cache, persist & checkpointing dont help.
> > Basically checkpointing only saves the RDD when it is materialized & it
> only
> > materializes in the end, then it runs out of stack.
> >
> > Regards
> > Mayur
> >
> > Mayur Rustagi
> > Ph: +1 (760) 203 3257
> > http://www.sigmoidanalytics.com
> > @mayur_rustagi
> >
> >
> >
> > On Tue, May 13, 2014 at 11:40 AM, Xiangrui Meng 
> wrote:
> >>
> >> You have a long lineage that causes the StackOverflow error. Try
> >> rdd.checkPoint() and rdd.count() for every 20~30 iterations.
> >> checkPoint can cut the lineage. -Xiangrui
> >>
> >> On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan  wrote:
> >> > Dear Sparkers:
> >> >
> >> > I am using Python spark of version 0.9.0 to implement some iterative
> >> > algorithm. I got some errors shown at the end of this email. It seems
> >> > that
> >> > it's due to the Java Stack Overflow error. The same error has been
> >> > duplicated on a mac desktop and a linux workstation, both running the
> >> > same
> >> > version of Spark.
> >> >
> >> > The same line of code works correctly after quite some iterations. At
> >> > the
> >> > line of error, rdd__new.count() could be 0. (In some previous rounds,
> >> > this
> >> > was also 0 without any problem).
> >> >
> >> > Any thoughts on this?
> >> >
> >> > Thank you very much,
> >> > - Guanhua
> >> >
> >> >
> >> > 
> >> > CODE:print "round", round, rdd__new.count()
> >> > 
> >> >   File
> >> >
> >> >
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
> >> > line 542, in count
> >> > 14/05/12 16:20:28 INFO TaskSetManager: Loss was due to
> >> > java.lang.StackOverflowError [duplicate 1]
> >> > return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
> >> > 14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times;
> >> > aborting job
> >> >   File
> >> >
> >> >
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
> >> > line 533, in sum
> >> > 14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state
> >> > FAILED
> >> > from TID 1774 because its task set is gone
> >> > return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
> >> >   File
> >> >
> >> >
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
> >> > line 499, in reduce
> >> > vals = self.mapPartitions(func).collect()
> >> >   File
> >> >
> >> >
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
> >> > line 463, in collect
> >> > bytesInJava = self._jrdd.collect().iterator()
> >> >   File
> >> >
> >> >
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> >> > line 537, in __call__
> >> >   File
> >> >
> >> >
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
> >> > line 300, in get_return_value
> >> > py4j.protocol.Py4JJavaError: An error occurred while calling
> >> > o4317.collect.
> >> > : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed 1
> >> > times
> >> > (most recent failure: Exception failure: java.lang.StackOverflowError)
> >> > at
> >> >
> >> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
> >> > at
> >> >
> >> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
> >> > at
> >> >
> >> >
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> >> > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> >> > at
> >> >
> >> > org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
> >> > at
> >> >
> >> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> >> > at
> >> >
> >> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> >> > at scala.Option.foreach(Option.scala:236)
> >> > at
> >> >
> >> >
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
> >> > at
> >> >
> >> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
> >> > at akka.a

Re: Is their a way to Create SparkContext object?

2014-05-13 Thread Andrew Ash
SparkContext is not serializable, so you can't send it across the cluster
like the rdd.map(t => compute(sc, t._2)) would do.

There is likely a way to express what you're trying to do with an algorithm
that doesn't require serializing SparkContext.  Can you tell us more about
your goals?

Andrew


On Tue, May 13, 2014 at 2:14 AM, yh18190  wrote:

> Thanks Mateh Zahria.Can i pass it as a parameter as part of closures.
> for example
> RDD.map(t=>compute(sc,t._2))
>
> can I use sc inside map function?Pls let me know
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-their-a-way-to-Create-SparkContext-object-tp5612p5647.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Guanhua Yan
Thanks Xiangrui. After some debugging efforts, it turns out that the
problem results from a bug in my code. But it's good to know that a long
lineage could also lead to this problem. I will also try checkpointing to
see whether the performance can be improved.

Best regards,
- Guanhua

On 5/13/14 12:10 AM, "Xiangrui Meng"  wrote:

>You have a long lineage that causes the StackOverflow error. Try
>rdd.checkPoint() and rdd.count() for every 20~30 iterations.
>checkPoint can cut the lineage. -Xiangrui
>
>On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan  wrote:
>> Dear Sparkers:
>>
>> I am using Python spark of version 0.9.0 to implement some iterative
>> algorithm. I got some errors shown at the end of this email. It seems
>>that
>> it's due to the Java Stack Overflow error. The same error has been
>> duplicated on a mac desktop and a linux workstation, both running the
>>same
>> version of Spark.
>>
>> The same line of code works correctly after quite some iterations. At
>>the
>> line of error, rdd__new.count() could be 0. (In some previous rounds,
>>this
>> was also 0 without any problem).
>>
>> Any thoughts on this?
>>
>> Thank you very much,
>> - Guanhua
>>
>>
>> 
>> CODE:print "round", round, rdd__new.count()
>> 
>>   File
>> 
>>"/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/
>>rdd.py",
>> line 542, in count
>> 14/05/12 16:20:28 INFO TaskSetManager: Loss was due to
>> java.lang.StackOverflowError [duplicate 1]
>> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>> 14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times;
>> aborting job
>>   File
>> 
>>"/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/
>>rdd.py",
>> line 533, in sum
>> 14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state
>>FAILED
>> from TID 1774 because its task set is gone
>> return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
>>   File
>> 
>>"/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/
>>rdd.py",
>> line 499, in reduce
>> vals = self.mapPartitions(func).collect()
>>   File
>> 
>>"/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/
>>rdd.py",
>> line 463, in collect
>> bytesInJava = self._jrdd.collect().iterator()
>>   File
>> 
>>"/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j
>>-0.8.1-src.zip/py4j/java_gateway.py",
>> line 537, in __call__
>>   File
>> 
>>"/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j
>>-0.8.1-src.zip/py4j/protocol.py",
>> line 300, in get_return_value
>> py4j.protocol.Py4JJavaError: An error occurred while calling
>>o4317.collect.
>> : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed 1
>>times
>> (most recent failure: Exception failure: java.lang.StackOverflowError)
>> at
>> 
>>org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$schedul
>>er$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>> at
>> 
>>org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$schedul
>>er$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
>> at
>> 
>>scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scal
>>a:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> 
>>org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSch
>>eduler$$abortStage(DAGScheduler.scala:1026)
>> at
>> 
>>org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DA
>>GScheduler.scala:619)
>> at
>> 
>>org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DA
>>GScheduler.scala:619)
>> at scala.Option.foreach(Option.scala:236)
>> at
>> 
>>org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:6
>>19)
>> at
>> 
>>org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun
>>$receive$1.applyOrElse(DAGScheduler.scala:207)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>> at
>> 
>>akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(Abstract
>>Dispatcher.scala:386)
>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>> 
>>scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.jav
>>a:1339)
>> at 
>>scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> at
>> 
>>scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.j
>>ava:107)
>>
>> ==
>> The stack overflow error is shown as follows:
>> ==
>>
>> 14/05/12 16:20:28 ERROR Executor: Exception in task ID 1774
>> java.lang.StackOverflowError
>> at java.util.zip.Inflater.inflate(Inflater.java:259)
>> a

Re: Spark to utilize HDFS's mmap caching

2014-05-13 Thread Marcelo Vanzin
On Mon, May 12, 2014 at 12:14 PM, Matei Zaharia  wrote:
> That API is something the HDFS administrator uses outside of any application 
> to tell HDFS to cache certain files or directories. But once you’ve done 
> that, any existing HDFS client accesses them directly from the cache.

Ah, yeah, sure. What I meant is that Spark itself will not, AFAIK, use
that facility for adding files to the cache or anything like that. But
yes, it does benefit from things already cached.


> On May 12, 2014, at 11:10 AM, Marcelo Vanzin  wrote:
>
>> Is that true? I believe that API Chanwit is talking about requires
>> explicitly asking for files to be cached in HDFS.
>>
>> Spark automatically benefits from the kernel's page cache (i.e. if
>> some block is in the kernel's page cache, it will be read more
>> quickly). But the explicit HDFS cache is a different thing; Spark
>> applications that want to use it would have to explicitly call the
>> respective HDFS APIs.
>>
>> On Sun, May 11, 2014 at 11:04 PM, Matei Zaharia  
>> wrote:
>>> Yes, Spark goes through the standard HDFS client and will automatically 
>>> benefit from this.
>>>
>>> Matei
>>>
>>> On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi  wrote:
>>>
 Hi all,

 Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via
 sc.textFile() and other HDFS-related APIs?

 http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

 Best regards,

 -chanwit

 --
 Chanwit Kaewkasi
 linkedin.com/in/chanwit
>>>
>>
>>
>>
>> --
>> Marcelo
>



-- 
Marcelo


Dead lock running multiple Spark Jobs on Mesos

2014-05-13 Thread Martin Weindel
I'm using a current Spark 1.0.0-SNAPSHOT for Hadoop 2.2.0 on Mesos 0.17.0.

If I run a single Spark Job, the job runs fine on Mesos. Running multiple
Spark Jobs also works, if I'm using the coarse-grained mode
("spark.mesos.coarse" = true).

But if I run two Spark Jobs in parallel using the fine-grained mode, the
jobs seem to block each other after a few seconds.
And the Mesos UI reports no idle but also no used CPUs in this state.

As soon as I kill one job, the other continues normally. See below for some
log output.
Looks to me as if something strange happens with assigning resources to the
both jobs.

Can anybody give me a hint about the cause? The jobs read some HDFS files,
but have no other communication to external processes.
Or any other suggestions how to analyse this problem?

Thanks,

Martin

-
Here is the relevant log output of job1:

INFO 17:53:09,247 Missing parents for Stage 2: List()
 INFO 17:53:09,250 Submitting Stage 2 (MapPartitionsRDD[9] at mapPartitions
at HighTemperatureSpansPerLogfile.java:92), which is now runnable
 INFO 17:53:09,269 Submitting 1 missing tasks from Stage 2
(MapPartitionsRDD[9] at mapPartitions at
HighTemperatureSpansPerLogfile.java:92)
 INFO 17:53:09,269 Adding task set 2.0 with 1 tasks

 
*** at this point the job was killed *** 
 
 
log output of job2:
 INFO 17:53:04,874 Missing parents for Stage 6: List()
 INFO 17:53:04,875 Submitting Stage 6 (MappedRDD[23] at values at
ComputeLogFileTimespan.java:71), which is now runnable
 INFO 17:53:04,881 Submitting 1 missing tasks from Stage 6 (MappedRDD[23] at
values at ComputeLogFileTimespan.java:71)
 INFO 17:53:04,882 Adding task set 6.0 with 1 tasks

*** at this point the job 1 was killed *** 
INFO 18:01:39,307 Starting task 6.0:0 as TID 7 on executor
20140501-141732-308511242-5050-2657-1: ustst019-cep-node2.usu.usu.grp
(PROCESS_LOCAL)
 INFO 18:01:39,307 Serialized task 6.0:0 as 3052 bytes in 0 ms
 INFO 18:01:39,328 Asked to send map output locations for shuffle 2 to
sp...@ustst018-cep-node1.usu.usu.grp:40542
 INFO 18:01:39,328 Size of output statuses for shuffle 2 is 178 bytes



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dead-lock-running-multiple-Spark-Jobs-on-Mesos-tp5611.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Turn BLAS on MacOSX

2014-05-13 Thread Debasish Das
Hi,

How do I load native BLAS libraries on Mac ?

I am getting the following errors while running LR and SVM with SGD:

14/05/07 10:48:13 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS

14/05/07 10:48:13 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS

centos it was fine...but on mac I am getting these warnings..

Also when it fails to run native blas does it use java code for BLAS
operations ?

May be after addition of breeze, we should add these details on a page as
well so that users are aware of it before they report any performance
results..

Thanks.

Deb


Re: 1.0.0 Release Date?

2014-05-13 Thread Anurag Tangri
Hi All,
We are also waiting for this. Does anyone know of tentative date for this
release ?

We are at spark 0.8.0 right now.  Should we wait for spark 1.0 or upgrade
to spark 0.9.1 ?


Thanks,
Anurag Tangri



On Tue, May 13, 2014 at 9:40 AM, bhusted  wrote:

> Can anyone comment on the anticipated date or worse case timeframe for when
> Spark 1.0.0 will be released?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/1-0-0-Release-Date-tp5664.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: same log4j slf4j error in spark 9.1

2014-05-13 Thread Patrick Wendell
Hey Adrian,

If you are including log4j-over-slf4j.jar in your application, you'll
still need to manually exclude slf4j-log4j12.jar from Spark. However,
it should work once you do that. Before 0.9.1 you couldn't make it
work, even if you added an exclude.

- Patrick

On Thu, May 8, 2014 at 1:52 PM, Adrian Mocanu  wrote:
> I recall someone from the Spark team (TD?) saying that Spark 9.1 will change
> the logger and the circular loop error between slf4j and log4j wouldn't show
> up.
>
>
>
> Yet on Spark 9.1 I still get
>
> SLF4J: Detected both log4j-over-slf4j.jar AND slf4j-log4j12.jar on the class
> path, preempting StackOverflowError.
>
> SLF4J: See also http://www.slf4j.org/codes.html#log4jDelegationLoop for more
> details.
>
>
>
> Any solutions?
>
>
>
> -Adrian


Re: Reading from .bz2 files with Spark

2014-05-13 Thread Xiangrui Meng
Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes
the problem you described, but it does contain several fixes to bzip2
format. -Xiangrui

On Wed, May 7, 2014 at 9:19 PM, Andrew Ash  wrote:
> Hi all,
>
> Is anyone reading and writing to .bz2 files stored in HDFS from Spark with
> success?
>
>
> I'm finding the following results on a recent commit (756c96 from 24hr ago)
> and CDH 4.4.0:
>
> Works: val r = sc.textFile("/user/aa/myfile.bz2").count
> Doesn't work: val r = sc.textFile("/user/aa/myfile.bz2").map((s:String) =>
> s+"| " ).count
>
> Specifically, I'm getting an exception coming out of the bzip2 libraries
> (see below stacktraces), which is unusual because I'm able to read from that
> file without an issue using the same libraries via Pig.  It was originally
> created from Pig as well.
>
> Digging a little deeper I found this line in the .bz2 decompressor's javadoc
> for CBZip2InputStream:
>
> "Instances of this class are not threadsafe." [source]
>
>
> My current working theory is that Spark has a much higher level of
> parallelism than Pig/Hadoop does and thus I get these wild IndexOutOfBounds
> exceptions much more frequently (as in can't finish a run over a little 2M
> row file) vs hardly at all in other libraries.
>
> The only other reference I could find to the issue was in presto-users, but
> the recommendation to leave .bz2 for .lzo doesn't help if I actually do want
> the higher compression levels of .bz2.
>
>
> Would love to hear if I have some kind of configuration issue or if there's
> a bug in .bz2 that's fixed in later versions of CDH, or generally any other
> thoughts on the issue.
>
>
> Thanks!
> Andrew
>
>
>
> Below are examples of some exceptions I'm getting:
>
> 14/05/07 15:09:49 WARN scheduler.TaskSetManager: Loss was due to
> java.lang.ArrayIndexOutOfBoundsException
> java.lang.ArrayIndexOutOfBoundsException: 65535
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.hbCreateDecodeTables(CBZip2InputStream.java:663)
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.createHuffmanDecodingTables(CBZip2InputStream.java:790)
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:762)
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:798)
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:502)
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:397)
> at
> org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:426)
> at java.io.InputStream.read(InputStream.java:101)
> at
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)
> at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:203)
> at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43)
>
>
>
>
> java.lang.ArrayIndexOutOfBoundsException: 90
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:900)
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:502)
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:397)
> at
> org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:426)
> at java.io.InputStream.read(InputStream.java:101)
> at
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)
> at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:203)
> at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43)
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198)
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:35)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> org.apache.spark.rdd.RDD.o

Re: Turn BLAS on MacOSX

2014-05-13 Thread DB Tsai
Hi wxhsdp,

See https://github.com/scalanlp/breeze/issues/142 and
https://github.com/fommil/netlib-java/issues/60 for details.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, May 13, 2014 at 2:17 AM, wxhsdp  wrote:

> Hi, Xiangrui
>
>   i compile openblas on ec2 m1.large, when breeze calls the native lib,
> error occurs:
>
> INFO: successfully loaded
> /mnt2/wxhsdp/libopenblas/lib/libopenblas_nehalemp-r0.2.9.rc2.so
> [error] (run-main-0) java.lang.UnsatisfiedLinkError:
>
> com.github.fommil.netlib.NativeSystemBLAS.dgemm_offsets(Ljava/lang/String;Ljava/lang/String;IIID[DII[DIID[DII)V
> java.lang.UnsatisfiedLinkError:
>
> com.github.fommil.netlib.NativeSystemBLAS.dgemm_offsets(Ljava/lang/String;Ljava/lang/String;IIID[DII[DIID[DII)V
> at com.github.fommil.netlib.NativeSystemBLAS.dgemm_offsets(Native
> Method)
> at
> com.github.fommil.netlib.NativeSystemBLAS.dgemm(NativeSystemBLAS.java:100)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Turn-BLAS-on-MacOSX-tpp5648.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Variables outside of mapPartitions scope

2014-05-13 Thread DB Tsai
Scala's for-loop is not just looping; it's not native looping in bytecode
level. It will create a couple of objects at runtime and performs a
truckload of method calls on them. As a result, if you are referring the
variables outside the for-loop, the whole for-loop object and any variable
inside the loop have to be serializable. Since the for-loop is serializable
in scala, I guess you have something non-serializable inside the for-loop.

The while-loop in scala is native, so you won't have this issue if you use
while-loop.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, May 9, 2014 at 1:13 PM, pedro  wrote:

> Right now I am not using any class variables (references to this). All my
> variables are created within the scope of the method I am running.
>
> I did more debugging and found this strange behavior.
> variables here
> for loop
> mapPartitions call
> use variables here
> end mapPartitions
> endfor
>
> This will result in a serializable bug, but this won't
>
> variables here
> for loop
> create new references to variables here
> mapPartitions call
> use new reference variables here
> end mapPartitions
> endfor
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517p5528.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: A new resource for getting examples of Spark RDD API calls

2014-05-13 Thread Flavio Pompermaier
Great work!thanks!
On May 13, 2014 3:16 AM, "zhen"  wrote:

> Hi Everyone,
>
> I found it quite difficult to find good examples for Spark RDD API calls.
> So
> my student and I decided to go through the entire API and write examples
> for
> the vast majority of API calls (basically examples for anything that is
> remotely interesting). I think these examples maybe useful to other people.
> Hence I have put them up on my web site. There is also a pdf version that
> you can download from the web site.
>
> http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html
>
> Please let me know if you find any errors in them. Or any better examples
> you would like me to add into it.
>
> Hope you find it useful.
>
> Zhen
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/A-new-resource-for-getting-examples-of-Spark-RDD-API-calls-tp5529.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Caching in graphX

2014-05-13 Thread Franco Avi
Hi, i'm writing this post because I would to know a caching approach for
iterative algorithms in graphX. So far I was not able to keep stable the
time of execution of each iteration. Can you achieve this condition?
The code I used is this:

var g = ... // my graph
var prevG: Graph[VD, ED] = null

var i = 0
while ( i < maxIter ){

prevG = g
   
g = g.foo()
g = g.foo1()
g = g.fooN()
   
g.cache
g.vertices.count + g.edges.count

prevG.edges.unpersist()
prevG.vertices.unpersist()

}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-in-graphX-tp5482.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: A new resource for getting examples of Spark RDD API calls

2014-05-13 Thread Gerard Maas
Hi Zhen,

Thanks a lot for sharing. I'm sure it will be useful for new users.

A small note: On the 'checkpoint' explanation:
sc.setCheckpointDir("my_directory_name")
it would be useful to specify that 'my_directory_name' should exist in all
slaves. As an alternative you could use an HDFS directory URL as well.
I've seen people tripping on that few times.

-kr, Gerard.



On Fri, May 9, 2014 at 11:54 PM, zhen  wrote:

> Hi Everyone,
>
> I found it quite difficult to find good examples for Spark RDD API calls.
> So
> my student and I decided to go through the entire API and write examples
> for
> the vast majority of API calls (basically examples for anything that is
> remotely interesting). I think these examples maybe useful to other people.
> Hence I have put them up on my web site. There is also a pdf version that
> you can download from the web site.
>
> http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html
>
> Please let me know if you find any errors in them. Or any better examples
> you would like me to add into it.
>
> Hope you find it useful.
>
> Zhen
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/A-new-resource-for-getting-examples-of-Spark-RDD-API-calls-tp5529.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Xiangrui Meng
After checkPoint, call count directly to materialize it. -Xiangrui

On Tue, May 13, 2014 at 4:20 AM, Mayur Rustagi  wrote:
> We are running into same issue. After 700 or so files the stack overflows,
> cache, persist & checkpointing dont help.
> Basically checkpointing only saves the RDD when it is materialized & it only
> materializes in the end, then it runs out of stack.
>
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
>
>
>
> On Tue, May 13, 2014 at 11:40 AM, Xiangrui Meng  wrote:
>>
>> You have a long lineage that causes the StackOverflow error. Try
>> rdd.checkPoint() and rdd.count() for every 20~30 iterations.
>> checkPoint can cut the lineage. -Xiangrui
>>
>> On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan  wrote:
>> > Dear Sparkers:
>> >
>> > I am using Python spark of version 0.9.0 to implement some iterative
>> > algorithm. I got some errors shown at the end of this email. It seems
>> > that
>> > it's due to the Java Stack Overflow error. The same error has been
>> > duplicated on a mac desktop and a linux workstation, both running the
>> > same
>> > version of Spark.
>> >
>> > The same line of code works correctly after quite some iterations. At
>> > the
>> > line of error, rdd__new.count() could be 0. (In some previous rounds,
>> > this
>> > was also 0 without any problem).
>> >
>> > Any thoughts on this?
>> >
>> > Thank you very much,
>> > - Guanhua
>> >
>> >
>> > 
>> > CODE:print "round", round, rdd__new.count()
>> > 
>> >   File
>> >
>> > "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
>> > line 542, in count
>> > 14/05/12 16:20:28 INFO TaskSetManager: Loss was due to
>> > java.lang.StackOverflowError [duplicate 1]
>> > return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>> > 14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times;
>> > aborting job
>> >   File
>> >
>> > "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
>> > line 533, in sum
>> > 14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state
>> > FAILED
>> > from TID 1774 because its task set is gone
>> > return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
>> >   File
>> >
>> > "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
>> > line 499, in reduce
>> > vals = self.mapPartitions(func).collect()
>> >   File
>> >
>> > "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
>> > line 463, in collect
>> > bytesInJava = self._jrdd.collect().iterator()
>> >   File
>> >
>> > "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
>> > line 537, in __call__
>> >   File
>> >
>> > "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
>> > line 300, in get_return_value
>> > py4j.protocol.Py4JJavaError: An error occurred while calling
>> > o4317.collect.
>> > : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed 1
>> > times
>> > (most recent failure: Exception failure: java.lang.StackOverflowError)
>> > at
>> >
>> > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>> > at
>> >
>> > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
>> > at
>> >
>> > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> > at
>> >
>> > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
>> > at
>> >
>> > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
>> > at
>> >
>> > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
>> > at scala.Option.foreach(Option.scala:236)
>> > at
>> >
>> > org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
>> > at
>> >
>> > org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
>> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>> > at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>> > at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>> > at
>> >
>> > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>> > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> > at
>> >
>> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> > at
>> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.ja

Re: How to use spark-submit

2014-05-13 Thread Sonal Goyal
Hi Stephen,

Sorry I just use plain mvn.

Best Regards,
Sonal
Nube Technologies 






On Mon, May 12, 2014 at 12:29 PM, Stephen Boesch  wrote:

> @Sonal - makes sense.  Is the maven shade plugin runnable within sbt ? If
> so would you care to share those build.sbt (or .scala) lines?  If not, are
> you aware of a similar plugin for sbt?
>
>
>
>
> 2014-05-11 23:53 GMT-07:00 Sonal Goyal :
>
> Hi Stephen,
>>
>> I am using maven shade plugin for creating my uber jar. I have marked
>> spark dependencies as provided.
>>
>> Best Regards,
>> Sonal
>> Nube Technologies 
>>
>> 
>>
>>
>>
>>
>> On Mon, May 12, 2014 at 1:04 AM, Stephen Boesch wrote:
>>
>>> HI Sonal,
>>> Yes I am working towards that same idea.  How did you go about
>>> creating the non-spark-jar dependencies ?  The way I am doing it is a
>>> separate straw-man project that does not include spark but has the external
>>> third party jars included. Then running sbt compile:managedClasspath and
>>> reverse engineering the lib jars from it.  That is obviously not ideal.
>>>
>>> The maven "run" will be useful for other projects built by maven: i will
>>> keep in my notes.
>>>
>>> AFA sbt run-example, it requires additional libraries to be added for my
>>> external dependencies.  I tried several items including  ADD_JARS,
>>>  --driver-class-path  and combinations of extraClassPath. I have deferred
>>> that ad-hoc approach to finding a systematic one.
>>>
>>>
>>>
>>>
>>> 2014-05-08 5:26 GMT-07:00 Sonal Goyal :
>>>
>>> I am creating a jar with only my dependencies and run spark-submit
 through my project mvn build. I have configured the mvn exec goal to the
 location of the script. Here is how I have set it up for my app. The
 mainClass is my driver program, and I am able to send my custom args too.
 Hope this helps.

 
 org.codehaus.mojo
 exec-maven-plugin
 
 
  
 exec
 
  
 
 
/home/sgoyal/spark/bin/spark-submit
  
 ${jars}
 --class
 ${mainClass}
 --arg
 ${spark.master}
 --arg
 ${my app arg 1}
 --arg
 ${my arg 2}
 
 
 


 Best Regards,
 Sonal
 Nube Technologies 

  




 On Wed, May 7, 2014 at 6:57 AM, Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

> Doesnt the run-example script work for you? Also, are you on the
> latest commit of branch-1.0 ?
>
> TD
>
>
> On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta <
> soumya.sima...@gmail.com> wrote:
>
>>
>>
>> Yes, I'm struggling with a similar problem where my class are not
>> found on the worker nodes. I'm using 1.0.0_SNAPSHOT.  I would really
>> appreciate if someone can provide some documentation on the usage of
>> spark-submit.
>>
>> Thanks
>>
>> > On May 5, 2014, at 10:24 PM, Stephen Boesch 
>> wrote:
>> >
>> >
>> > I have a spark streaming application that uses the external
>> streaming modules (e.g. kafka, mqtt, ..) as well.  It is not clear how to
>> properly invoke the spark-submit script: what are the 
>> ---driver-class-path
>> and/or -Dspark.executor.extraClassPath parameters required?
>> >
>> >  For reference, the following error is proving difficult to resolve:
>> >
>> > java.lang.ClassNotFoundException:
>> org.apache.spark.streaming.examples.StreamingExamples
>> >
>>
>
>

>>>
>>
>


Re: Variables outside of mapPartitions scope

2014-05-13 Thread ankurdave
In general, you can find out exactly what's not serializable by adding
-Dsun.io.serialization.extendedDebugInfo=true to SPARK_JAVA_OPTS.
Since a this reference to the enclosing class is often what's causing the
problem, a general workaround is to move the mapPartitions call to a static
method where there is no this reference. This transforms this:
class A {  def f() = rdd.mapPartitions(iter => ...)}
into this:
class A {  def f() = A.helper(rdd)}object A {  def helper(rdd: RDD[...]) =
rdd.mapPartitions(iter => ...)}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517p5527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Is there any problem on the spark mailing list?

2014-05-13 Thread wxhsdp
i think so, fewer questions and answers these three days



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5522.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Dead lock running multiple Spark Jobs on Mesos

2014-05-13 Thread Eugen Cepoi
I have a similar issue (but with spark 0.9.1) when a shell is active.
Multiple jobs run fine, but when the shell is active (even if at the moment
is not using any CPU) I encounter the exact same behaviour.

At the moment I don't know what happens and how to solve it, but I was
planning to have a look more in depth these days. I'll keep you tuned.

Eugen

PS: all the jobs run in fine-grained mode.


2014-05-12 21:29 GMT+02:00 Martin Weindel :

> I'm using a current Spark 1.0.0-SNAPSHOT for Hadoop 2.2.0 on Mesos 0.17.0.
>
> If I run a single Spark Job, the job runs fine on Mesos. Running multiple
> Spark Jobs also works, if I'm using the coarse-grained mode
> ("spark.mesos.coarse" = true).
>
> But if I run two Spark Jobs in parallel using the fine-grained mode, the
> jobs seem to block each other after a few seconds.
> And the Mesos UI reports no idle but also no used CPUs in this state.
>
> As soon as I kill one job, the other continues normally. See below for some
> log output.
> Looks to me as if something strange happens with assigning resources to the
> both jobs.
>
> Can anybody give me a hint about the cause? The jobs read some HDFS files,
> but have no other communication to external processes.
> Or any other suggestions how to analyse this problem?
>
> Thanks,
>
> Martin
>
> -
> Here is the relevant log output of job1:
>
> INFO 17:53:09,247 Missing parents for Stage 2: List()
>  INFO 17:53:09,250 Submitting Stage 2 (MapPartitionsRDD[9] at mapPartitions
> at HighTemperatureSpansPerLogfile.java:92), which is now runnable
>  INFO 17:53:09,269 Submitting 1 missing tasks from Stage 2
> (MapPartitionsRDD[9] at mapPartitions at
> HighTemperatureSpansPerLogfile.java:92)
>  INFO 17:53:09,269 Adding task set 2.0 with 1 tasks
>
> 
> *** at this point the job was killed ***
>
>
> log output of job2:
>  INFO 17:53:04,874 Missing parents for Stage 6: List()
>  INFO 17:53:04,875 Submitting Stage 6 (MappedRDD[23] at values at
> ComputeLogFileTimespan.java:71), which is now runnable
>  INFO 17:53:04,881 Submitting 1 missing tasks from Stage 6 (MappedRDD[23]
> at
> values at ComputeLogFileTimespan.java:71)
>  INFO 17:53:04,882 Adding task set 6.0 with 1 tasks
>
> 
> *** at this point the job 1 was killed ***
> INFO 18:01:39,307 Starting task 6.0:0 as TID 7 on executor
> 20140501-141732-308511242-5050-2657-1: ustst019-cep-node2.usu.usu.grp
> (PROCESS_LOCAL)
>  INFO 18:01:39,307 Serialized task 6.0:0 as 3052 bytes in 0 ms
>  INFO 18:01:39,328 Asked to send map output locations for shuffle 2 to
> sp...@ustst018-cep-node1.usu.usu.grp:40542
>  INFO 18:01:39,328 Size of output statuses for shuffle 2 is 178 bytes
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Dead-lock-running-multiple-Spark-Jobs-on-Mesos-tp5611.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


user@spark.apache.org

2014-05-13 Thread Herman, Matt (CORP)
unsubscribe

--
This message and any attachments are intended only for the use of the addressee 
and may contain information that is privileged and confidential. If the reader 
of the message is not the intended recipient or an authorized representative of 
the intended recipient, you are hereby notified that any dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, notify the sender immediately by return email and delete the message 
and any attachments from your system.


Re: Is any idea on architecture based on Spark + Spray + Akka

2014-05-13 Thread Chester At Yahoo
We are using spray + Akka + spark stack at Alpine data labs

Chester



Sent from my iPhone

> On May 4, 2014, at 8:37 PM, ZhangYi  wrote:
> 
> Hi all,
> 
> Currently, our project is planning to adopt spark to be big data platform. 
> For the client side, we decide expose REST api based on Spray. Our domain is 
> focus on communication field for 3G and 4G user of processing some data 
> analyst and statictics . Now, Spark + Spray is brand new for us, and we can't 
> find any best practice via google. 
> 
> In our opinion, event-driven architecture is good choice for our project 
> maybe. However, more idea is welcome. Thanks.  
> 
> -- 
> ZhangYi (张逸)
> Developer
> tel: 15023157626
> blog: agiledon.github.com
> weibo: tw张逸
> Sent with Sparrow
> 


Re: Caching in graphX

2014-05-13 Thread ankurdave
Unfortunately it's very difficult to get uncaching right with GraphX due to
the complicated internal dependency structure that it creates. It's
necessary to know exactly what operations you're doing on the graph in order
to unpersist correctly (i.e., in a way that avoids recomputation).

I have a pull request (https://github.com/apache/spark/pull/497) that may
make this a bit easier, but your best option is to use the Pregel API for
iterative algorithms if possible.

If that's not possible, leaving things cached has actually not been very
costly in my experience, at least as long as VD and ED are primitive types
to reduce the load on the garbage collector.

Ankur



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-in-graphX-tp5482p5514.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Doubts regarding Shark

2014-05-13 Thread Mayur Rustagi
The table will be cached but 10GB (Most likely more) would be on disk. You
can check that in the storage tab in shark application.

Java out of memory could be as your worker memory is too low or memory
allocated to Shark is too low.


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Thu, May 8, 2014 at 12:42 AM, vinay Bajaj  wrote:

> Hello
>
> I have few questions regarding shark.
>
> 1) I have a table of 60 GB and i have total memory of 50 GB but when i try
> to cache the table it get cached successfully. How shark caches the table
> there was not enough memory to get the table in memory. And how cache
> eviction policies (FIFO and LRU) works while caching the table. While
> creating tables I am using cache type property as MEMORY (storage level:
> memory and disk)
>
> 2) Sometime while running queries I get JavaOutOfMemory Exception but all
> tables are cached successfully. Can you tell me the cases or some example
> due to which that error can come.
>
> Regards
> Vinay Bajaj
>


Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Mayur Rustagi
We are running into same issue. After 700 or so files the stack overflows,
cache, persist & checkpointing dont help.
Basically checkpointing only saves the RDD when it is materialized & it
only materializes in the end, then it runs out of stack.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Tue, May 13, 2014 at 11:40 AM, Xiangrui Meng  wrote:

> You have a long lineage that causes the StackOverflow error. Try
> rdd.checkPoint() and rdd.count() for every 20~30 iterations.
> checkPoint can cut the lineage. -Xiangrui
>
> On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan  wrote:
> > Dear Sparkers:
> >
> > I am using Python spark of version 0.9.0 to implement some iterative
> > algorithm. I got some errors shown at the end of this email. It seems
> that
> > it's due to the Java Stack Overflow error. The same error has been
> > duplicated on a mac desktop and a linux workstation, both running the
> same
> > version of Spark.
> >
> > The same line of code works correctly after quite some iterations. At the
> > line of error, rdd__new.count() could be 0. (In some previous rounds,
> this
> > was also 0 without any problem).
> >
> > Any thoughts on this?
> >
> > Thank you very much,
> > - Guanhua
> >
> >
> > 
> > CODE:print "round", round, rdd__new.count()
> > 
> >   File
> >
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
> > line 542, in count
> > 14/05/12 16:20:28 INFO TaskSetManager: Loss was due to
> > java.lang.StackOverflowError [duplicate 1]
> > return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
> > 14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times;
> > aborting job
> >   File
> >
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
> > line 533, in sum
> > 14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state
> FAILED
> > from TID 1774 because its task set is gone
> > return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
> >   File
> >
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
> > line 499, in reduce
> > vals = self.mapPartitions(func).collect()
> >   File
> >
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
> > line 463, in collect
> > bytesInJava = self._jrdd.collect().iterator()
> >   File
> >
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> > line 537, in __call__
> >   File
> >
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
> > line 300, in get_return_value
> > py4j.protocol.Py4JJavaError: An error occurred while calling
> o4317.collect.
> > : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed 1
> times
> > (most recent failure: Exception failure: java.lang.StackOverflowError)
> > at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
> > at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
> > at
> >
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> > at
> > org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
> > at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> > at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> > at scala.Option.foreach(Option.scala:236)
> > at
> >
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
> > at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> > at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> > at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> > at
> >
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > at
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > at
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >
> > ==
> > The stack overflow error is shown as follows:
> > ==
> >
> > 14/05/12 16:20:28 ERRO

Re: How to read a multipart s3 file?

2014-05-13 Thread kamatsuoka
Thanks Nicholas!  I looked at those docs several times without noticing that
critical part you highlighted.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5494.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Is their a way to Create SparkContext object?

2014-05-13 Thread yh18190
Thanks Mateh Zahria.Can i pass it as a parameter as part of closures.
for example
RDD.map(t=>compute(sc,t._2))

can I use sc inside map function?Pls let me know



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-their-a-way-to-Create-SparkContext-object-tp5612p5647.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: details about event log

2014-05-13 Thread wxhsdp
thank you very much, Andrew

by the difinition of "Fetch Wait Time", can i make a conclusion that task
pipelines block fetch and
job doing?


Andrew Or-2 wrote
> Hi wxhsdp,
> 
> These times are computed from Java's System.currentTimeMillis(), which is
> "the
> difference, measured in milliseconds, between the current time and
> midnight, January 1, 1970 UTC." Thus, this quantity doesn't mean much by
> itself, but is only meaningful when you subtract it from another
> System.currentTimeMillis() to find the time elapsed. For instance, in your
> case (Finish Time - Launch Time) = 1862, which means it took 1862 ms for
> the task to complete (but the actual execution only took 1781 ms, the rest
> being overhead).
> 
> Correct, (Shuffle Finish Time - Launch Time) is the total time it took for
> this task to fetch blocks, and (Finish Time - Shuffle Finish Time) is the
> actual execution after fetching all blocks.
> 
> "Fetch Wait Time" on the other hand is the time spent blocking on the
> thread to wait for shuffle blocks while not doing anything else. For
> instance, the example given in the code comments is: "if block B is being
> fetched while the task is not finished with processing block A, it is not
> considered to be blocking on block B."
> 
> "Shuffle Write Time" is the time written to write the shuffle files (only
> for ShuffleMapTask). It is in nanoseconds, which is slightly inconsistent
> with other values in these metrics.
> 
> By the way, all of this information is available in the code comments:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala
> 
> 
> 
> 
> On Tue, May 6, 2014 at 11:10 PM, wxhsdp <

> wxhsdp@

> > wrote:
> 
>> any ideas?  thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/details-about-event-log-tp5411p5476.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/details-about-event-log-tp5411p5624.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


something about pipeline

2014-05-13 Thread wxhsdp
Dear, all

   definition of fetch wait time:
   * Time the task spent waiting for remote shuffle blocks. This only
includes the time
   * blocking on shuffle input data. For instance if block B is being
fetched while the task is
   * still not finished processing block A, it is not considered to be
blocking on block B.

   by the definition of fetch wait time, can i make a conclusion that tasks
pipeline block fetch and the
   real work? how spark decides the task can be splitted by blocks to do the
pipeline?

  if the task is something like:

  val b = a.mapPartitions{ itr =>
timeStamp
val arr = itr.toArray
...
timeStamp
arr.toIterator
  }

  can fetching blocks of RDD a and processing RDD b be pipelined?

here's the information of my task:
"Launch Time":1399882225433
"Finish Time":  1399882252948
"Executor Run Time":27497
"Shuffle Finish Time":1399882246138
"Fetch Wait Time":9377
task time in a.mapPartitions is 8287 (say it mapPartition time)

Finish Time - Launch Time = 27515
Shuffle Finish Time - Launch Time = 20705 (say it total shuffle time)
Executor Run Time - total shuffle time = 6792

total shuffle time = 20705, and Fetch Wait Time = 9377, so in the time of
(20705-9377=11328), the
task are doing other jobs, what does it do? the mapPartition? or the
mapPartition is executed after 
shuffle completes? but the times calculated do not match. i'am so confused,
need your help!







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/something-about-pipeline-tp5626.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Xiangrui Meng
You have a long lineage that causes the StackOverflow error. Try
rdd.checkPoint() and rdd.count() for every 20~30 iterations.
checkPoint can cut the lineage. -Xiangrui

On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan  wrote:
> Dear Sparkers:
>
> I am using Python spark of version 0.9.0 to implement some iterative
> algorithm. I got some errors shown at the end of this email. It seems that
> it's due to the Java Stack Overflow error. The same error has been
> duplicated on a mac desktop and a linux workstation, both running the same
> version of Spark.
>
> The same line of code works correctly after quite some iterations. At the
> line of error, rdd__new.count() could be 0. (In some previous rounds, this
> was also 0 without any problem).
>
> Any thoughts on this?
>
> Thank you very much,
> - Guanhua
>
>
> 
> CODE:print "round", round, rdd__new.count()
> 
>   File
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
> line 542, in count
> 14/05/12 16:20:28 INFO TaskSetManager: Loss was due to
> java.lang.StackOverflowError [duplicate 1]
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
> 14/05/12 16:20:28 ERROR TaskSetManager: Task 8419.0:0 failed 1 times;
> aborting job
>   File
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
> line 533, in sum
> 14/05/12 16:20:28 INFO TaskSchedulerImpl: Ignoring update with state FAILED
> from TID 1774 because its task set is gone
> return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
>   File
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
> line 499, in reduce
> vals = self.mapPartitions(func).collect()
>   File
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py",
> line 463, in collect
> bytesInJava = self._jrdd.collect().iterator()
>   File
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> line 537, in __call__
>   File
> "/home1/ghyan/Software/spark-0.9.0-incubating-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o4317.collect.
> : org.apache.spark.SparkException: Job aborted: Task 8419.0:1 failed 1 times
> (most recent failure: Exception failure: java.lang.StackOverflowError)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> ==
> The stack overflow error is shown as follows:
> ==
>
> 14/05/12 16:20:28 ERROR Executor: Exception in task ID 1774
> java.lang.StackOverflowError
> at java.util.zip.Inflater.inflate(Inflater.java:259)
> at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:152)
> at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:116)
> at
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> at
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
> at
> java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2818)
> at java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1452)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1511)
> at