Re: Some questions about task distribution and execution in Spark

2013-10-03 Thread Shay Seng
Inlined. On Wed, Oct 2, 2013 at 1:00 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Hi Shangyu, (1) When we read in a local file by SparkContext.textFile and do some map/reduce job on it, how will spark decide to send data to which worker node? Will the data be divided/partitioned equally

Re: Some questions about task distribution and execution in Spark

2013-10-03 Thread Mark Hamstra
No, that is not what allowLocal means. For a very few actions, the DAGScheduler will run the job locally (in a separate thread on the master node) if the RDD in the action has a single partition and no dependencies in its lineage. If allowLocal is false, that doesn't mean that

Re: Some questions about task distribution and execution in Spark

2013-10-03 Thread Shay Seng
Ok, even if my understanding of allowLocal is incorrect, nevertheless (1) I'm loading a local file (2) The tasks seem as if they are getting executed on a slave node (ip-10-129-25-28) is not my master node ?? On Thu, Oct 3, 2013 at 12:22 PM, Mark Hamstra m...@clearstorydata.comwrote: No,

Re: Some questions about task distribution and execution in Spark

2013-10-03 Thread Eduardo Berrocal
The spark code is on my /home directory, which is shared on NFS to all nodes. So all workers should be able to access the same file. On Thu, Oct 3, 2013 at 2:34 PM, Mark Hamstra m...@clearstorydata.comwrote: But the worker has to be on a node that has local access to the file. On Thu, Oct

Troubleshooting and how to interpret the logs

2013-10-03 Thread Ashish Rangole
Hi, Trying to figure out what does it mean when the application (driver program) logs end with the the lines like the ones below. This is with the application running on Spark 0.8.0 on EC2. Any help will be greatly appreciated. Thanks! 13/10/03 16:17:33 INFO cluster.ClusterTaskSetManager:

Re: java.lang.AbstractMethodError

2013-10-03 Thread Martin Weindel
Hi Eduardo, it seems to me that your second problem is caused by inconsistent, i.e. different classes in master and worker JVMs. Are you sure, that you have replaced the changed FlatMapFunction on all worker nodes and also on master? Regards, Martin 13/10/03 13:27:44 INFO

Re: java.lang.AbstractMethodError

2013-10-03 Thread Eduardo Berrocal
Hi Martin, Yes, that is what is seems. However, it is unlikely that is the case, because I have all spark classes on my home, which is mounted on NFS to all nodes. Unless there is something else I am missing... Edu On Thu, Oct 3, 2013 at 3:29 PM, Martin Weindel martin.wein...@gmail.comwrote:

Re: java.lang.AbstractMethodError

2013-10-03 Thread Martin Weindel
Hi Eduardo, if you are using Spark 0.7.3, I remember that I had to replace the class file additional at spark-0.7.3/core/target/scala-2.9.3/classes/spark/api/java/function/ Martin Am 03.10.2013 22:35, schrieb Eduardo Berrocal: Hi Martin, Yes, that is what is seems. However, it is unlikely

Re: Some questions about task distribution and execution in Spark

2013-10-03 Thread Shay Seng
Ah, ok. Thanks for the clarification. When I create a file that is only visible on the master I get the following error... f.map(l=l.split( )).collect 13/10/03 20:38:48 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/10/03 20:38:48 WARN snappy.LoadSnappy: Snappy native library not

Re: java.lang.AbstractMethodError

2013-10-03 Thread Eduardo Berrocal
And by .java, I mean .scala On Thu, Oct 3, 2013 at 4:03 PM, Eduardo Berrocal eberr...@hawk.iit.eduwrote: That directory is not present on version 0.8.0 (the one I am using). However, there are the files for FlatMapFunction in spark 0.8.0: $ find ./ -name FlatMapFunction*

Re: java.lang.AbstractMethodError

2013-10-03 Thread Eduardo Berrocal
That directory is not present on version 0.8.0 (the one I am using). However, there are the files for FlatMapFunction in spark 0.8.0: $ find ./ -name FlatMapFunction*

Re: java.lang.AbstractMethodError

2013-10-03 Thread Eduardo Berrocal
Ok, I found the mistake!. Wow, it really come to me by inspiration, if not I don't know how this suddenly come to me. The problem is that I packed my application in a jar file with the old spark-assembly-0.8.0-incubating-hadoop1.0.4.jar in it. So that is the reason why workers and master have

Sort order of RDD rows

2013-10-03 Thread Mingyu Kim
Hi all, Is the sort order guaranteed if you apply operations like map(), filter() or distinct() after sort in a distributed setting (run on a cluster of machines backed by HDFS)? In other words, does rdd.sortByKey().map() have the same sort order as rdd.sortByKey()? If so, is it documented

Re: Sort order of RDD rows

2013-10-03 Thread Matei Zaharia
Yes, it is for these map-like operations. The only time when it isn't is when you change the RDD's partitioner, e.g. by doing sortByKey or groupByKey. It would definitely be good to document this more formally. Matei On Oct 3, 2013, at 3:33 PM, Mingyu Kim m...@palantir.com wrote: Hi all,

Re: Sort order of RDD rows

2013-10-03 Thread Mingyu Kim
Got it. Thanks a lot! From: Matei Zaharia matei.zaha...@gmail.com Reply-To: user@spark.incubator.apache.org user@spark.incubator.apache.org Date: Thursday, October 3, 2013 6:00 PM To: user@spark.incubator.apache.org user@spark.incubator.apache.org Subject: Re: Sort order of RDD rows Yes, it