Yet another question on saving RDD into files

2014-03-22 Thread Jaonary Rabarisoa
Dear all, As a Spark newbie, I need some help to understand how RDD save to file behaves. After reading the post on saving single files efficiently http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-as-a-single-file-efficiently-td3014.html I understand that each partition of the

答复: unable to build spark - sbt/sbt: line 50: killed

2014-03-22 Thread 林武康
Large memory is need to build spark, I think you should make xmx larger, 2g for example. -原始邮件- 发件人: Bharath Bhushan manku.ti...@outlook.com 发送时间: ‎2014/‎3/‎22 12:50 收件人: user@spark.apache.org user@spark.apache.org 主题: unable to build spark - sbt/sbt: line 50: killed I am getting the

Configuring shuffle write directory

2014-03-22 Thread Tsai Li Ming
Hi, Each of my worker node has its own unique spark.local.dir. However, when I run spark-shell, the shuffle writes are always written to /tmp despite being set when the worker node is started. By specifying the spark.local.dir for the driver program, it seems to override the executor? Is

Re: distinct on huge dataset

2014-03-22 Thread Kane
Yes it works with smaller file, it can count and map, but not distinct. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3033.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: 答复: unable to build spark - sbt/sbt: line 50: killed

2014-03-22 Thread Bharath Bhushan
Thanks for the reply. It turns out that my ubuntu-in-vagrant had 512MB of ram only. Increasing it to 1024MB allowed the assembly to finish successfully. Peak usage was around 780MB. To: user@spark.apache.org From: vboylin1...@gmail.com Subject: 答复: unable to build spark - sbt/sbt: line 50:

Re: 答复: unable to build spark - sbt/sbt: line 50: killed

2014-03-22 Thread Koert Kuipers
i have found that i am unable to build/test spark with sbt and java6, but using java7 works (and it compiles with java target version 1.6 so binaries are usable from java 6) On Sat, Mar 22, 2014 at 3:11 PM, Bharath Bhushan manku.ti...@outlook.comwrote: Thanks for the reply. It turns out that

Re: distinct on huge dataset

2014-03-22 Thread Aaron Davidson
This could be related to the hash collision bug in ExternalAppendOnlyMap in 0.9.0: https://spark-project.atlassian.net/browse/SPARK-1045 You might try setting spark.shuffle.spill to false and see if that runs any longer (turning off shuffle spill is dangerous, though, as it may cause Spark to OOM

Re: distinct on huge dataset

2014-03-22 Thread Kane
Yes, that helped, at least it was able to advance a bit further. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3038.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: distinct on huge dataset

2014-03-22 Thread Kane
But i was wrong - map also fails on big file and setting spark.shuffle.spill doesn't help. Map fails with the same error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html Sent from the Apache Spark User List mailing

Re: worker keeps getting disassociated upon a failed job spark version 0.90

2014-03-22 Thread sam
I have this problem too. Eventually the job fails (on the UI) and hangs the terminal until I CTRL + C. (Logs below) Now the Spark docs explain the heartbeat configuration stuff can be tweaked to handle GC hangs. I'm wondering if this is symptomatic of pushing the cluster a little too hard (we

Re: Largest input data set observed for Spark.

2014-03-22 Thread Usman Ghani
I am having similar issues with much smaller data sets. I am using spark EC2 scripts to launch clusters, but I almost always end up with straggling executors that take over a node's CPU and memory and end up never finishing. On Thu, Mar 20, 2014 at 1:54 PM, Soila Pertet Kavulya

Re: Sprak Job stuck

2014-03-22 Thread Usman Ghani
Were you able to figure out what this was? You can try setting spark.akka.askTimeout to a larger value. That might help. On Thu, Mar 20, 2014 at 10:24 PM, mohit.goyal mohit.go...@guavus.comwrote: Hi, I have run the spark application to process input data of size ~14GB with executor memory

Re: distinct on huge dataset

2014-03-22 Thread Andrew Ash
FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and have it disabled now. The specific error behavior was that a join would consistently return one count of rows with spill enabled and another count with it disabled. Sent from my mobile phone On Mar 22, 2014 1:52 PM, Kane