Yet another question on saving RDD into files

2014-03-22 Thread Jaonary Rabarisoa
Dear all,

As a Spark newbie, I need some help to understand how RDD save to file
behaves. After reading the post on saving single files efficiently

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-as-a-single-file-efficiently-td3014.html

I understand that each partition of the RDD is saved into a separate file,
isn't it ? And in order to get one single file, one should call
coalesce(1,shuffle=true), right ?

The other use case that I have is : append a RDD into existing file. Is it
possible with spark ? Precisely, I have a map transformation that results
vary over time, like a big time series :
 I need to store the result  for further analysis but if I store the RDD in
a different file each time I run the computation I may end with many little
files. A pseudo code of my process is as follow :

every tamestamp do
RDD[Array[Double]].map - RDD[(timestamp,Double)].save to the same file

What should be the best solution to that ?

Best


答复: unable to build spark - sbt/sbt: line 50: killed

2014-03-22 Thread 林武康
Large memory is need to build spark, I think you should make xmx larger, 2g for 
example.

-原始邮件-
发件人: Bharath Bhushan manku.ti...@outlook.com
发送时间: ‎2014/‎3/‎22 12:50
收件人: user@spark.apache.org user@spark.apache.org
主题: unable to build spark - sbt/sbt: line 50: killed

I am getting the following error when trying to build spark. I tried various 
sizes for the -Xmx and other memory related arguments to the java command line, 
but the assembly command still fails.

$ sbt/sbt assembly
...
[info] Compiling 298 Scala sources and 17 Java sources to 
/vagrant/spark-0.9.0-incubating-bin-hadoop2/core/target/scala-2.10/classes...
sbt/sbt: line 50: 10202 Killed  java -Xmx1900m 
-XX:MaxPermSize=1000m -XX:ReservedCodeCacheSize=256m -jar ${JAR} $@

Versions of software:
Spark: 0.9.0 (hadoop2 binary)
Scala: 2.10.3
Ubuntu: Ubuntu 12.04.4 LTS - Linux vagrant-ubuntu-precise-64 3.2.0-54-generic
Java: 1.6.0_45 (oracle java 6)

I can still use the binaries in bin/ but I was just trying to check if sbt/sbt 
assembly works fine.

-- Thanks

Configuring shuffle write directory

2014-03-22 Thread Tsai Li Ming
Hi,

Each of my worker node has its own unique spark.local.dir.

However, when I run spark-shell, the shuffle writes are always written to /tmp 
despite being set when the worker node is started.

By specifying the spark.local.dir for the driver program, it seems to override 
the executor? Is there a way to properly define it in the worker node?

Thanks!

Re: distinct on huge dataset

2014-03-22 Thread Kane
Yes it works with smaller file, it can count and map, but not distinct.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: 答复: unable to build spark - sbt/sbt: line 50: killed

2014-03-22 Thread Bharath Bhushan
Thanks for the reply. It turns out that my ubuntu-in-vagrant had 512MB of ram 
only. Increasing it to 1024MB allowed the assembly to finish successfully. Peak 
usage was around 780MB.

To: user@spark.apache.org
From: vboylin1...@gmail.com
Subject: 答复: unable to build spark - sbt/sbt: line 50: killed
Date: Sat, 22 Mar 2014 20:03:28 +0800







Large memory is need to build spark, I think you should make xmx larger, 2g for 
example.


发件人: Bharath Bhushan
发送时间: ‎2014/‎3/‎22 12:50
收件人: user@spark.apache.org
主题: unable to build spark - sbt/sbt: line 50: killed


I am getting the following error when trying to build spark. I tried various 
sizes for the -Xmx and other memory related arguments to the java command line, 
but the assembly command still fails.

$ sbt/sbt assembly
...
[info] Compiling 298 Scala sources and 17 Java sources to 
/vagrant/spark-0.9.0-incubating-bin-hadoop2/core/target/scala-2.10/classes...
sbt/sbt: line 50: 10202 Killed  java -Xmx1900m 
-XX:MaxPermSize=1000m -XX:ReservedCodeCacheSize=256m -jar ${JAR} $@

Versions of software:
Spark: 0.9.0 (hadoop2 binary)
Scala: 2.10.3
Ubuntu: Ubuntu 12.04.4 LTS - Linux vagrant-ubuntu-precise-64 3.2.0-54-generic
Java: 1.6.0_45 (oracle java 6)

I can still use the binaries in bin/ but I was just trying to check if sbt/sbt 
assembly works fine.

-- Thanks
  

Re: 答复: unable to build spark - sbt/sbt: line 50: killed

2014-03-22 Thread Koert Kuipers
i have found that i am unable to build/test spark with sbt and java6, but
using java7 works (and it compiles with java target version 1.6 so binaries
are usable from java 6)


On Sat, Mar 22, 2014 at 3:11 PM, Bharath Bhushan manku.ti...@outlook.comwrote:

 Thanks for the reply. It turns out that my ubuntu-in-vagrant had 512MB of
 ram only. Increasing it to 1024MB allowed the assembly to finish
 successfully. Peak usage was around 780MB.

 --
 To: user@spark.apache.org
 From: vboylin1...@gmail.com
 Subject: 答复: unable to build spark - sbt/sbt: line 50: killed
 Date: Sat, 22 Mar 2014 20:03:28 +0800


  Large memory is need to build spark, I think you should make xmx larger,
 2g for example.
  --
 发件人: Bharath Bhushan manku.ti...@outlook.com
 发送时间: 2014/3/22 12:50
 收件人: user@spark.apache.org
 主题: unable to build spark - sbt/sbt: line 50: killed

 I am getting the following error when trying to build spark. I tried
 various sizes for the -Xmx and other memory related arguments to the java
 command line, but the assembly command still fails.

 $ sbt/sbt assembly
 ...
 [info] Compiling 298 Scala sources and 17 Java sources to
 /vagrant/spark-0.9.0-incubating-bin-hadoop2/core/target/scala-2.10/classes...
 sbt/sbt: line 50: 10202 Killed  java -Xmx1900m
 -XX:MaxPermSize=1000m -XX:ReservedCodeCacheSize=256m -jar ${JAR} $@

 Versions of software:
 Spark: 0.9.0 (hadoop2 binary)
 Scala: 2.10.3
 Ubuntu: Ubuntu 12.04.4 LTS - Linux vagrant-ubuntu-precise-64
 3.2.0-54-generic
 Java: 1.6.0_45 (oracle java 6)

 I can still use the binaries in bin/ but I was just trying to check if
 sbt/sbt assembly works fine.

 -- Thanks



Re: distinct on huge dataset

2014-03-22 Thread Aaron Davidson
This could be related to the hash collision bug in ExternalAppendOnlyMap in
0.9.0: https://spark-project.atlassian.net/browse/SPARK-1045

You might try setting spark.shuffle.spill to false and see if that runs any
longer (turning off shuffle spill is dangerous, though, as it may cause
Spark to OOM if your reduce partitions are too large).



On Sat, Mar 22, 2014 at 10:00 AM, Kane kane.ist...@gmail.com wrote:

 I mean everything works with the small file. With huge file only count and
 map work, distinct - doesn't



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3034.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: distinct on huge dataset

2014-03-22 Thread Kane
Yes, that helped, at least it was able to advance a bit further.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3038.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: distinct on huge dataset

2014-03-22 Thread Kane
But i was wrong - map also fails on big file and setting spark.shuffle.spill
doesn't help. Map fails with the same error.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: worker keeps getting disassociated upon a failed job spark version 0.90

2014-03-22 Thread sam
I have this problem too.  Eventually the job fails (on the UI) and hangs the
terminal until I CTRL + C.  (Logs below)

Now the Spark docs explain the heartbeat configuration stuff can be tweaked
to handle GC hangs.  I'm wondering if this is symptomatic of pushing the
cluster a little too hard (we where also running a hdfs balance which died
of an OOM).

What sort of values should I try increasing the configurables too

14/03/22 21:45:47 ERROR scheduler.TaskSchedulerImpl: Lost executor 6 on
ip-172-31-0-126.ec2.internal: remote Akka client disassociated
14/03/22 21:45:47 INFO scheduler.TaskSetManager: Re-queueing tasks for 6
from TaskSet 1.0
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 720 (task 1.0:248)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 722 (task 1.0:250)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 686 (task 1.0:214)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 698 (task 1.0:226)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 707 (task 1.0:235)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 709 (task 1.0:237)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 694 (task 1.0:222)
14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 688 (task 1.0:216)
14/03/22 21:45:47 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 5)
14/03/22 21:45:47 INFO storage.BlockManagerMasterActor: Trying to remove
executor 6 from BlockManagerMaster.
14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor updated:
app-20140322213226-0044/6 is now FAILED (Command exited with code 137)
14/03/22 21:45:47 INFO cluster.SparkDeploySchedulerBackend: Executor
app-20140322213226-0044/6 removed: Command exited with code 137
14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor added:
app-20140322213226-0044/9 on
worker-20140321161205-ip-172-31-0-126.ec2.internal-50034
(ip-172-31-0-126.ec2.internal:50034) with 8 cores
14/03/22 21:45:47 INFO cluster.SparkDeploySchedulerBackend: Granted executor
ID app-20140322213226-0044/9 on hostPort ip-172-31-0-126.ec2.internal:50034
with 8 cores, 13.5 GB RAM
14/03/22 21:45:47 INFO storage.BlockManagerMaster: Removed 6 successfully in
removeExecutor
14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor updated:
app-20140322213226-0044/9 is now RUNNING
14/03/22 21:45:49 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Added rdd_6_236 in memory on ec2-54-84-166-37.compute-1.amazonaws.com:56804
(size: 72.5 MB, free: 3.4 GB)
14/03/22 21:45:49 INFO scheduler.TaskSetManager: Starting task 1.0:216 as
TID 729 on executor 8: ec2-54-84-166-37.compute-1.amazonaws.com
(PROCESS_LOCAL)

 more stuff happens 

14/03/22 21:52:09 ERROR scheduler.TaskSchedulerImpl: Lost executor 12 on
ip-172-31-8-63.ec2.internal: remote Akka client disassociated
14/03/22 21:52:09 INFO scheduler.TaskSetManager: Re-queueing tasks for 12
from TaskSet 1.0
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 828 (task 1.0:339)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 830 (task 1.0:305)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 824 (task 1.0:302)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 827 (task 1.0:313)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 826 (task 1.0:338)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 829 (task 1.0:311)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 823 (task 1.0:314)
14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 825 (task 1.0:312)
14/03/22 21:52:09 INFO scheduler.DAGScheduler: Executor lost: 12 (epoch 10)
14/03/22 21:52:09 INFO storage.BlockManagerMasterActor: Trying to remove
executor 12 from BlockManagerMaster.
14/03/22 21:52:09 INFO storage.BlockManagerMaster: Removed 12 successfully
in removeExecutor
14/03/22 21:52:10 INFO cluster.SparkDeploySchedulerBackend: Executor 11
disconnected, so removing it
14/03/22 21:52:10 ERROR scheduler.TaskSchedulerImpl: Lost executor 11 on
ec2-54-84-151-18.compute-1.amazonaws.com: remote Akka client disassociated
14/03/22 21:52:10 INFO scheduler.TaskSetManager: Re-queueing tasks for 11
from TaskSet 1.0
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 837 (task 1.0:331)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 831 (task 1.0:341)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 839 (task 1.0:347)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 836 (task 1.0:284)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 794 (task 1.0:271)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 838 (task 1.0:273)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 841 (task 1.0:296)
14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 840 (task 1.0:276)
14/03/22 21:52:10 INFO scheduler.DAGScheduler: Executor lost: 11 (epoch 11)
14/03/22 21:52:10 INFO storage.BlockManagerMasterActor: Trying to remove
executor 11 from BlockManagerMaster.
14/03/22 21:52:10 INFO storage.BlockManagerMaster: 

Re: Largest input data set observed for Spark.

2014-03-22 Thread Usman Ghani
I am having similar issues with much smaller data sets. I am using spark
EC2 scripts to launch clusters, but I almost always end up with straggling
executors that take over a node's CPU and memory and end up never finishing.



On Thu, Mar 20, 2014 at 1:54 PM, Soila Pertet Kavulya skavu...@gmail.comwrote:

 Hi Reynold,

 Nice! What spark configuration parameters did you use to get your job to
 run successfully on a large dataset? My job is failing on 1TB of input data
 (uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory
 errors just lost executors.

 Thanks,

 Soila
 On Mar 20, 2014 11:29 AM, Reynold Xin r...@databricks.com wrote:

 I'm not really at liberty to discuss details of the job. It involves some
 expensive aggregated statistics, and took 10 hours to complete (mostly
 bottlenecked by network  io).





 On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman 
 suren.hira...@velos.io wrote:

 Reynold,

 How complex was that job (I guess in terms of number of transforms and
 actions) and how long did that take to process?

 -Suren



 On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin r...@databricks.com
 wrote:

  Actually we just ran a job with 70TB+ compressed data on 28 worker
 nodes -
  I didn't count the size of the uncompressed data, but I am guessing it
 is
  somewhere between 200TB to 700TB.
 
 
 
  On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani us...@platfora.com
 wrote:
 
   All,
   What is the largest input data set y'all have come across that has
 been
   successfully processed in production using spark. Ball park?
  
 



 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io





Re: Sprak Job stuck

2014-03-22 Thread Usman Ghani
Were you able to figure out what this was?

You can try setting spark.akka.askTimeout to a larger value. That might
help.


On Thu, Mar 20, 2014 at 10:24 PM, mohit.goyal mohit.go...@guavus.comwrote:

 Hi,

 I have run the spark application to process input data of size ~14GB with
 executor memory 10GB. The job got stuck with below message

 14/03/21 05:02:07 WARN storage.BlockManagerMasterActor: Removing
 BlockManager BlockManagerId(0, guavus-0102bf, 49347, 0) with no recent
 heart
 beats: 85563ms exceeds 45000ms

 But job completed successfully if i increase executor memory 40GB.

 Any idea??

 Thanks
 Mohit Goyal




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Sprak-Job-stuck-tp2979.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: distinct on huge dataset

2014-03-22 Thread Andrew Ash
FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and
have it disabled now. The specific error behavior was that a join would
consistently return one count of rows with spill enabled and another count
with it disabled.

Sent from my mobile phone
On Mar 22, 2014 1:52 PM, Kane kane.ist...@gmail.com wrote:

 But i was wrong - map also fails on big file and setting
 spark.shuffle.spill
 doesn't help. Map fails with the same error.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.