Yet another question on saving RDD into files
Dear all, As a Spark newbie, I need some help to understand how RDD save to file behaves. After reading the post on saving single files efficiently http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-as-a-single-file-efficiently-td3014.html I understand that each partition of the RDD is saved into a separate file, isn't it ? And in order to get one single file, one should call coalesce(1,shuffle=true), right ? The other use case that I have is : append a RDD into existing file. Is it possible with spark ? Precisely, I have a map transformation that results vary over time, like a big time series : I need to store the result for further analysis but if I store the RDD in a different file each time I run the computation I may end with many little files. A pseudo code of my process is as follow : every tamestamp do RDD[Array[Double]].map - RDD[(timestamp,Double)].save to the same file What should be the best solution to that ? Best
答复: unable to build spark - sbt/sbt: line 50: killed
Large memory is need to build spark, I think you should make xmx larger, 2g for example. -原始邮件- 发件人: Bharath Bhushan manku.ti...@outlook.com 发送时间: 2014/3/22 12:50 收件人: user@spark.apache.org user@spark.apache.org 主题: unable to build spark - sbt/sbt: line 50: killed I am getting the following error when trying to build spark. I tried various sizes for the -Xmx and other memory related arguments to the java command line, but the assembly command still fails. $ sbt/sbt assembly ... [info] Compiling 298 Scala sources and 17 Java sources to /vagrant/spark-0.9.0-incubating-bin-hadoop2/core/target/scala-2.10/classes... sbt/sbt: line 50: 10202 Killed java -Xmx1900m -XX:MaxPermSize=1000m -XX:ReservedCodeCacheSize=256m -jar ${JAR} $@ Versions of software: Spark: 0.9.0 (hadoop2 binary) Scala: 2.10.3 Ubuntu: Ubuntu 12.04.4 LTS - Linux vagrant-ubuntu-precise-64 3.2.0-54-generic Java: 1.6.0_45 (oracle java 6) I can still use the binaries in bin/ but I was just trying to check if sbt/sbt assembly works fine. -- Thanks
Configuring shuffle write directory
Hi, Each of my worker node has its own unique spark.local.dir. However, when I run spark-shell, the shuffle writes are always written to /tmp despite being set when the worker node is started. By specifying the spark.local.dir for the driver program, it seems to override the executor? Is there a way to properly define it in the worker node? Thanks!
Re: distinct on huge dataset
Yes it works with smaller file, it can count and map, but not distinct. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3033.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: 答复: unable to build spark - sbt/sbt: line 50: killed
Thanks for the reply. It turns out that my ubuntu-in-vagrant had 512MB of ram only. Increasing it to 1024MB allowed the assembly to finish successfully. Peak usage was around 780MB. To: user@spark.apache.org From: vboylin1...@gmail.com Subject: 答复: unable to build spark - sbt/sbt: line 50: killed Date: Sat, 22 Mar 2014 20:03:28 +0800 Large memory is need to build spark, I think you should make xmx larger, 2g for example. 发件人: Bharath Bhushan 发送时间: 2014/3/22 12:50 收件人: user@spark.apache.org 主题: unable to build spark - sbt/sbt: line 50: killed I am getting the following error when trying to build spark. I tried various sizes for the -Xmx and other memory related arguments to the java command line, but the assembly command still fails. $ sbt/sbt assembly ... [info] Compiling 298 Scala sources and 17 Java sources to /vagrant/spark-0.9.0-incubating-bin-hadoop2/core/target/scala-2.10/classes... sbt/sbt: line 50: 10202 Killed java -Xmx1900m -XX:MaxPermSize=1000m -XX:ReservedCodeCacheSize=256m -jar ${JAR} $@ Versions of software: Spark: 0.9.0 (hadoop2 binary) Scala: 2.10.3 Ubuntu: Ubuntu 12.04.4 LTS - Linux vagrant-ubuntu-precise-64 3.2.0-54-generic Java: 1.6.0_45 (oracle java 6) I can still use the binaries in bin/ but I was just trying to check if sbt/sbt assembly works fine. -- Thanks
Re: 答复: unable to build spark - sbt/sbt: line 50: killed
i have found that i am unable to build/test spark with sbt and java6, but using java7 works (and it compiles with java target version 1.6 so binaries are usable from java 6) On Sat, Mar 22, 2014 at 3:11 PM, Bharath Bhushan manku.ti...@outlook.comwrote: Thanks for the reply. It turns out that my ubuntu-in-vagrant had 512MB of ram only. Increasing it to 1024MB allowed the assembly to finish successfully. Peak usage was around 780MB. -- To: user@spark.apache.org From: vboylin1...@gmail.com Subject: 答复: unable to build spark - sbt/sbt: line 50: killed Date: Sat, 22 Mar 2014 20:03:28 +0800 Large memory is need to build spark, I think you should make xmx larger, 2g for example. -- 发件人: Bharath Bhushan manku.ti...@outlook.com 发送时间: 2014/3/22 12:50 收件人: user@spark.apache.org 主题: unable to build spark - sbt/sbt: line 50: killed I am getting the following error when trying to build spark. I tried various sizes for the -Xmx and other memory related arguments to the java command line, but the assembly command still fails. $ sbt/sbt assembly ... [info] Compiling 298 Scala sources and 17 Java sources to /vagrant/spark-0.9.0-incubating-bin-hadoop2/core/target/scala-2.10/classes... sbt/sbt: line 50: 10202 Killed java -Xmx1900m -XX:MaxPermSize=1000m -XX:ReservedCodeCacheSize=256m -jar ${JAR} $@ Versions of software: Spark: 0.9.0 (hadoop2 binary) Scala: 2.10.3 Ubuntu: Ubuntu 12.04.4 LTS - Linux vagrant-ubuntu-precise-64 3.2.0-54-generic Java: 1.6.0_45 (oracle java 6) I can still use the binaries in bin/ but I was just trying to check if sbt/sbt assembly works fine. -- Thanks
Re: distinct on huge dataset
This could be related to the hash collision bug in ExternalAppendOnlyMap in 0.9.0: https://spark-project.atlassian.net/browse/SPARK-1045 You might try setting spark.shuffle.spill to false and see if that runs any longer (turning off shuffle spill is dangerous, though, as it may cause Spark to OOM if your reduce partitions are too large). On Sat, Mar 22, 2014 at 10:00 AM, Kane kane.ist...@gmail.com wrote: I mean everything works with the small file. With huge file only count and map work, distinct - doesn't -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3034.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: distinct on huge dataset
Yes, that helped, at least it was able to advance a bit further. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3038.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: distinct on huge dataset
But i was wrong - map also fails on big file and setting spark.shuffle.spill doesn't help. Map fails with the same error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: worker keeps getting disassociated upon a failed job spark version 0.90
I have this problem too. Eventually the job fails (on the UI) and hangs the terminal until I CTRL + C. (Logs below) Now the Spark docs explain the heartbeat configuration stuff can be tweaked to handle GC hangs. I'm wondering if this is symptomatic of pushing the cluster a little too hard (we where also running a hdfs balance which died of an OOM). What sort of values should I try increasing the configurables too 14/03/22 21:45:47 ERROR scheduler.TaskSchedulerImpl: Lost executor 6 on ip-172-31-0-126.ec2.internal: remote Akka client disassociated 14/03/22 21:45:47 INFO scheduler.TaskSetManager: Re-queueing tasks for 6 from TaskSet 1.0 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 720 (task 1.0:248) 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 722 (task 1.0:250) 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 686 (task 1.0:214) 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 698 (task 1.0:226) 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 707 (task 1.0:235) 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 709 (task 1.0:237) 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 694 (task 1.0:222) 14/03/22 21:45:47 WARN scheduler.TaskSetManager: Lost TID 688 (task 1.0:216) 14/03/22 21:45:47 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 5) 14/03/22 21:45:47 INFO storage.BlockManagerMasterActor: Trying to remove executor 6 from BlockManagerMaster. 14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor updated: app-20140322213226-0044/6 is now FAILED (Command exited with code 137) 14/03/22 21:45:47 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140322213226-0044/6 removed: Command exited with code 137 14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor added: app-20140322213226-0044/9 on worker-20140321161205-ip-172-31-0-126.ec2.internal-50034 (ip-172-31-0-126.ec2.internal:50034) with 8 cores 14/03/22 21:45:47 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140322213226-0044/9 on hostPort ip-172-31-0-126.ec2.internal:50034 with 8 cores, 13.5 GB RAM 14/03/22 21:45:47 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor 14/03/22 21:45:47 INFO client.AppClient$ClientActor: Executor updated: app-20140322213226-0044/9 is now RUNNING 14/03/22 21:45:49 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Added rdd_6_236 in memory on ec2-54-84-166-37.compute-1.amazonaws.com:56804 (size: 72.5 MB, free: 3.4 GB) 14/03/22 21:45:49 INFO scheduler.TaskSetManager: Starting task 1.0:216 as TID 729 on executor 8: ec2-54-84-166-37.compute-1.amazonaws.com (PROCESS_LOCAL) more stuff happens 14/03/22 21:52:09 ERROR scheduler.TaskSchedulerImpl: Lost executor 12 on ip-172-31-8-63.ec2.internal: remote Akka client disassociated 14/03/22 21:52:09 INFO scheduler.TaskSetManager: Re-queueing tasks for 12 from TaskSet 1.0 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 828 (task 1.0:339) 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 830 (task 1.0:305) 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 824 (task 1.0:302) 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 827 (task 1.0:313) 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 826 (task 1.0:338) 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 829 (task 1.0:311) 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 823 (task 1.0:314) 14/03/22 21:52:09 WARN scheduler.TaskSetManager: Lost TID 825 (task 1.0:312) 14/03/22 21:52:09 INFO scheduler.DAGScheduler: Executor lost: 12 (epoch 10) 14/03/22 21:52:09 INFO storage.BlockManagerMasterActor: Trying to remove executor 12 from BlockManagerMaster. 14/03/22 21:52:09 INFO storage.BlockManagerMaster: Removed 12 successfully in removeExecutor 14/03/22 21:52:10 INFO cluster.SparkDeploySchedulerBackend: Executor 11 disconnected, so removing it 14/03/22 21:52:10 ERROR scheduler.TaskSchedulerImpl: Lost executor 11 on ec2-54-84-151-18.compute-1.amazonaws.com: remote Akka client disassociated 14/03/22 21:52:10 INFO scheduler.TaskSetManager: Re-queueing tasks for 11 from TaskSet 1.0 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 837 (task 1.0:331) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 831 (task 1.0:341) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 839 (task 1.0:347) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 836 (task 1.0:284) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 794 (task 1.0:271) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 838 (task 1.0:273) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 841 (task 1.0:296) 14/03/22 21:52:10 WARN scheduler.TaskSetManager: Lost TID 840 (task 1.0:276) 14/03/22 21:52:10 INFO scheduler.DAGScheduler: Executor lost: 11 (epoch 11) 14/03/22 21:52:10 INFO storage.BlockManagerMasterActor: Trying to remove executor 11 from BlockManagerMaster. 14/03/22 21:52:10 INFO storage.BlockManagerMaster:
Re: Largest input data set observed for Spark.
I am having similar issues with much smaller data sets. I am using spark EC2 scripts to launch clusters, but I almost always end up with straggling executors that take over a node's CPU and memory and end up never finishing. On Thu, Mar 20, 2014 at 1:54 PM, Soila Pertet Kavulya skavu...@gmail.comwrote: Hi Reynold, Nice! What spark configuration parameters did you use to get your job to run successfully on a large dataset? My job is failing on 1TB of input data (uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory errors just lost executors. Thanks, Soila On Mar 20, 2014 11:29 AM, Reynold Xin r...@databricks.com wrote: I'm not really at liberty to discuss details of the job. It involves some expensive aggregated statistics, and took 10 hours to complete (mostly bottlenecked by network io). On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Reynold, How complex was that job (I guess in terms of number of transforms and actions) and how long did that take to process? -Suren On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin r...@databricks.com wrote: Actually we just ran a job with 70TB+ compressed data on 28 worker nodes - I didn't count the size of the uncompressed data, but I am guessing it is somewhere between 200TB to 700TB. On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani us...@platfora.com wrote: All, What is the largest input data set y'all have come across that has been successfully processed in production using spark. Ball park? -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Sprak Job stuck
Were you able to figure out what this was? You can try setting spark.akka.askTimeout to a larger value. That might help. On Thu, Mar 20, 2014 at 10:24 PM, mohit.goyal mohit.go...@guavus.comwrote: Hi, I have run the spark application to process input data of size ~14GB with executor memory 10GB. The job got stuck with below message 14/03/21 05:02:07 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, guavus-0102bf, 49347, 0) with no recent heart beats: 85563ms exceeds 45000ms But job completed successfully if i increase executor memory 40GB. Any idea?? Thanks Mohit Goyal -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sprak-Job-stuck-tp2979.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: distinct on huge dataset
FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and have it disabled now. The specific error behavior was that a join would consistently return one count of rows with spill enabled and another count with it disabled. Sent from my mobile phone On Mar 22, 2014 1:52 PM, Kane kane.ist...@gmail.com wrote: But i was wrong - map also fails on big file and setting spark.shuffle.spill doesn't help. Map fails with the same error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html Sent from the Apache Spark User List mailing list archive at Nabble.com.