Re: Why collect() has a stage but first() not?

2014-02-19 Thread Aaron Davidson
first() is allowed to run locally, which means that the driver will execute the action itself without launching any tasks. This is also true of take(n) for sufficiently small n, for instance. On Wed, Feb 19, 2014 at 9:55 AM, David Thomas dt5434...@gmail.com wrote: If I perform a 'collect'

Re: Why collect() has a stage but first() not?

2014-02-19 Thread Aaron Davidson
wrote: But my RDD is placed on the worker nodes. So how can driver perform the action by itself? On Wed, Feb 19, 2014 at 10:57 AM, Aaron Davidson ilike...@gmail.comwrote: first() is allowed to run locally, which means that the driver will execute the action itself without launching any

Re: Using local[N] gets Too many open files?

2014-02-16 Thread Aaron Davidson
If you are intentionally opening many files at once and getting that error, then it is a fixable OS issue. Please check out this discussion regarding changing the file limit in /etc/limits.conf: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-td1464.html If you feel that

Re: [External] Re: Too many open files

2014-02-13 Thread Aaron Davidson
Note that you may just be changing the soft limit, and are still being capped by the hard (system-wide) limit. Changing the /etc/limit.conf file as specified above allows you to modify both the soft and hard limits, and requires a restart of the machine to take effect. On Thu, Feb 13, 2014 at

Re: Shuffle file not found Exception

2014-02-09 Thread Aaron Davidson
This sounds bad, and probably related to shuffle file consolidation. Turning off consolidation would probably get you working again, but I'd really love to track down the bug. Do you know if any tasks fail before those errors start occurring? It's very possible that another exception is occurring

Re: Hash Join in Spark

2014-02-03 Thread Aaron Davidson
This method is doing very little. Line 2 constructs the CoGroupedRDD, which will do all the real work. Note that while this cogroup function just groups 2 RDDs together, CoGroupedRDD allows general n-way cogrouping, so it takes a Seq[RDD(K, _)] rather than just 2 such key-value RDDs. The rest of

Re: Spark app gets slower as it gets executed more times

2014-02-02 Thread Aaron Davidson
Are you seeing any exceptions in between running apps? Does restarting the master/workers actually cause Spark to speed back up again? It's possible, for instance, that you run out of disk space, which should cause exceptions but not go away by restarting the master/workers. One thing to worry

Re: default parallelism in trunk

2014-02-02 Thread Aaron Davidson
Could you give an example where default parallelism is set to 2 where it didn't used to be? Here is the relevant section for the spark standalone mode:

Re: How to access global kryo instance?

2014-01-06 Thread Aaron Davidson
: On Tue, Jan 7, 2014 at 2:52 AM, Aaron Davidson ilike...@gmail.com wrote: Please take a look at the source code -- it's relatively friendly, and very useful for digging into Spark internals! (KryoSerializerhttps://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache

Re: ADD_JARS doesn't properly work for spark-shell

2014-01-04 Thread Aaron Davidson
Buendia buendia...@gmail.comwrote: On Sun, Jan 5, 2014 at 2:28 AM, Aaron Davidson ilike...@gmail.com wrote: Additionally, which version of Spark are you running? 0.8.1. Unfortunately, this doesn't work either: MASTER=local[2] ADD_JARS=/path/to/my/jar SPARK_CLASSPATH=/path/to/my/jar

Re: NPE while reading broadcast variable.

2013-12-30 Thread Aaron Davidson
Could you post the stack trace you see for the NPE? On Mon, Dec 30, 2013 at 11:31 AM, Archit Thakur archit279tha...@gmail.comwrote: I am still getting it. I googled and found a similar open problem on stackoverflow:

Re: How to set Akka frame size

2013-12-24 Thread Aaron Davidson
The error you're receiving is because the Akka frame size must be a positive Java Integer, i.e., less than 2^31. However, the frame size is not intended to be nearly the size of the job memory -- it is the smallest unit of data transfer that Spark does. In this case, your task result size is

Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-19 Thread Aaron Davidson
I'd be fine with one-way mirrors here (Apache threads being reflected in Google groups) -- I have no idea how one is supposed to navigate the Apache list to look for historic threads. On Thu, Dec 19, 2013 at 7:58 PM, Mike Potts maspo...@gmail.com wrote: Thanks very much for the prompt and

Re: problems with standalone cluster

2013-12-12 Thread Aaron Davidson
You might also check the spark/work/ directory for application (Executor) logs on the slaves. On Tue, Nov 19, 2013 at 6:13 PM, Umar Javed umarj.ja...@gmail.com wrote: I have a scala script that I'm trying to run on a Spark standalone cluster with just one worker (existing on the master node).

Re: groupBy() with really big groups fails

2013-12-09 Thread Aaron Davidson
This is very likely due to memory issues. The problem is that each reducer (partition of the groupBy) builds an in-memory table of that partition. If you have very few partitions, this will fail, so the solution is to simply increase the number of reducers. For example: sc.parallelize(1 to

Re: Serializable incompatible with Externalizable error

2013-12-03 Thread Aaron Davidson
This discussion seems to indicate the possibility of a mismatch between one side being Serializable and the other being Externalizable: https://forums.oracle.com/thread/2147644 In general, the semantics of Serializable can be pretty strange as it doesn't really behave the same as usual

Re: spark-shell not working on standalone cluster (java.io.IOException: Cannot run program compute-classpath.sh)

2013-11-25 Thread Aaron Davidson
There is a pull request currently to fix this exact issue, I believe, at https://github.com/apache/incubator-spark/pull/192. It's very small and only touches the script files, so you could apply it to your current version and distribute it to the workers. The fix here is that you add an additional

Re: oome from blockmanager

2013-11-22 Thread Aaron Davidson
compression or seeing more than 48k DiskBlockObjectWriters to account for the remaining memory used. On Fri, Nov 22, 2013 at 9:05 AM, Aaron Davidson ilike...@gmail.com wrote: Great, thanks for the feedback. It sounds like you're using the LZF compression scheme -- switching to Snappy should see

Re: debugging a Spark error

2013-11-18 Thread Aaron Davidson
Have you looked a the Spark executor logs? They're usually located in the $SPARK_HOME/work/ directory. If you're running in a cluster, they'll be on the individual slave nodes. These should hopefully reveal more information. On Mon, Nov 18, 2013 at 3:42 PM, Chris Grier

Re: EC2 node submit jobs to separate Spark Cluster

2013-11-18 Thread Aaron Davidson
The main issue with running a spark-shell locally is that it orchestrates the actual computation, so you want it to be close to the actual Worker nodes for latency reasons. Running a spark-shell on EC2 in the same region as the Spark cluster avoids this problem. The error you're seeing seems to

Re: foreachPartition in Java

2013-11-17 Thread Aaron Davidson
Also, in general, you can workaround shortcomings in the Java API by converting to a Scala RDD (using JavaRDD's rdd() method). The API tends to be much clunkier since you have to jump through some hoops to talk to a Scala API in Java, though. In this case, JavaRDD's mapPartition() method will

Re: number of splits for standalone cluster mode

2013-11-17 Thread Aaron Davidson
The number of splits can be configured when reading the file, as an argument to textFile(), sequenceFile(), etc (see docshttp://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext@textFile(String,Int):RDD[String]). Note that this is a minimum, however, as

Re: failure notice

2013-11-17 Thread Aaron Davidson
Apache Ant(TM) version 1.8.2 compiled on May 18 2012 Mvh/BR Egon Kidmose On Sun, Nov 17, 2013 at 7:40 PM, Aaron Davidson ilike...@gmail.com wrote: Could you report your ant/Ivy version? Just run ant -version The fundamental problem is that Ivy is stupidly thinking .orbit is the file

Re: Memory configuration in local mode

2013-11-16 Thread Aaron Davidson
are using local mode, you can just pass -Xmx32g to the JVM that is launching spark and it will have that much memory. On Fri, Nov 15, 2013 at 6:30 PM, Aaron Davidson ilike...@gmail.com wrote: One possible workaround would be to use the local-cluster Spark mode. This is normally used only

Re: Memory configuration in local mode

2013-11-15 Thread Aaron Davidson
One possible workaround would be to use the local-cluster Spark mode. This is normally used only for testing, but it will actually spawn a separate process for the executor. The format is: new SparkContext(local-cluster[1,4,32000]) This will spawn 1 Executor that is allocated 4 cores and 32GB

Re: mapping of shuffle outputs to reduce tasks

2013-11-10 Thread Aaron Davidson
It is responsible for a subset of shuffle blocks. MapTasks split up their data, creating one shuffle block for every reducer. During the shuffle phase, the reducer will fetch all shuffle blocks that were intended for it. On Sun, Nov 10, 2013 at 9:38 PM, Umar Javed umarj.ja...@gmail.com wrote:

Re: oome from blockmanager

2013-11-05 Thread Aaron Davidson
As a followup on this, the memory footprint of all shuffle metadata has been greatly reduced. For your original workload with 7k mappers, 7k reducers, and 5 machines, the total metadata size should have decreased from ~3.3 GB to ~80 MB. On Tue, Oct 29, 2013 at 9:07 AM, Aaron Davidson ilike

Re: Executor could not connect to Driver?

2013-11-01 Thread Aaron Davidson
I've seen this happen before due to the driver doing long GCs when the driver machine was heavily memory-constrained. For this particular issue, simply freeing up memory used by other applications fixed the problem. On Fri, Nov 1, 2013 at 12:14 AM, Liu, Raymond raymond@intel.com wrote: Hi

Re: repartitioning RDDS

2013-10-31 Thread Aaron Davidson
Stephen is exactly correct, I just wanted to point out that in Spark 0.8.1 and above, the repartition function has been added to be a clearer way to accomplish what you want. (Coalescing into a larger number of partitions doesn't make much linguistic sense.) On Thu, Oct 31, 2013 at 7:48 AM,

Re: Spark cluster memory configuration for spark-shell

2013-10-29 Thread Aaron Davidson
You are correct. If you are just using spark-shell in local mode (i.e., without cluster), you can set the SPARK_MEM environment variable to give the driver more memory. E.g.: SPARK_MEM=24g ./spark-shell Otherwise, if you're using a real cluster, the driver shouldn't require a significant amount

Re: gc/oome from 14,000 DiskBlockObjectWriters

2013-10-25 Thread Aaron Davidson
Snappy sounds like it'd be a better solution here. LZF requires a pretty sizeable buffer per stream (accounting for the 300k you're seeing). It looks like you have 7000 reducers, and each one requires an LZF-compressed stream. Snappy has a much lower overhead per stream, so I'd give it a try.

Re: Broken link in quickstart

2013-10-22 Thread Aaron Davidson
Thanks for the heads up! I have submitted a pull request to fix it (#98https://github.com/apache/incubator-spark/pull/98), so it should be corrected soon. In the meantime, if anyone is curious, the real link should be

Re: Help with Initial Cluster Configuration / Tuning

2013-10-22 Thread Aaron Davidson
On the other hand, I totally agree that memory usage in Spark is rather opaque, and is one area where we could do a lot better at in terms of communicating issues, through both docs and instrumentation. At least with serialization and such, you can get meaningful exceptions (hopefully), but OOMs

Re: Spark REPL produces error on a piece of scala code that works in pure Scala REPL

2013-10-12 Thread Aaron Davidson
with peculiarity of the Scala shell: https://groups.google.com/forum/#!searchin/spark-users/error$3A$20type$20mismatch|sort:relevance/spark-users/bwAmbUgxWrA/HwP4Nv4adfEJ On Fri, Oct 11, 2013 at 6:01 PM, Aaron Davidson ilike...@gmail.comwrote: Playing around with this a little more, it seems

Re: Spark REPL produces error on a piece of scala code that works in pure Scala REPL

2013-10-11 Thread Aaron Davidson
Playing around with this a little more, it seems that classOf[Animal] is this.Animal in Spark and Animal in normal Scala. Also, trying to do something like this: class Zoo[A : *this.*Animal](thing: A) { } works in Scala but throws a weird error in Spark: error: type Animal is not a member of

Re: spark_ec2 script in 0.8.0 and mesos

2013-10-08 Thread Aaron Davidson
Also, please post feature requests here: http://spark-project.atlassian.net Make sure to search prior to posting to avoid duplicates. On Tue, Oct 8, 2013 at 11:50 AM, Matei Zaharia matei.zaha...@gmail.comwrote: Hi Shay, We actually don't support Mesos in the EC2 scripts anymore -- sorry

Re: spark through vpn, SPARK_LOCAL_IP

2013-10-05 Thread Aaron Davidson
You might try also setting spark.driver.host to the correct IP in the conf/spark-env.sh SPARK_JAVA_OPTs as well. e.g., -Dspark.driver.host=192.168.250.47 On Sat, Oct 5, 2013 at 2:45 PM, Aaron Babcock aaron.babc...@gmail.comwrote: Hello, I am using spark through a vpn. My driver machine