Problems with broadcast large datastructure

2014-01-07 Thread Sebastian Schelter
Spark repeatedly fails broadcast a large object on a cluster of 25 machines for me. I get log messages like this: [spark-akka.actor.default-dispatcher-4] WARN org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(3, cloud-33.dima.tu-berlin.de, 42185, 0) with no

Re: NoSuchMethodError running Spark on YARN

2014-01-07 Thread Sean Owen
Sandy looks like a version mismatch: compiling vs Spark HEAD and running vs 0.8.1? The method was added to typesafe config on Oct 8 2012 and released in about version 0.6: https://github.com/typesafehub/config/commit/5f486f65ac68745ca89059a5b6b144c2daa5d157#diff-3d5ac6ed49837be68d4f47d3b96b1c81

split a RDD by pencetage

2014-01-07 Thread redocpot
Hi, I want to split a RDD by certain percentage, like 10 % (split the RDD into 10 piece) Ideally, the function preferred is as below: def deterministicSplit[T](dataSet: RDD[T], nb: Int): Array[RDD[T]] = { /* code */ } the dataSet is a RDD sorted by its key. For example, if nb = 10 here,

Re: is it forgotten to document how to set SPARK_WORKER_DIR?

2014-01-07 Thread Archit Thakur
What do you mean by worker dir? On Tue, Jan 7, 2014 at 11:43 AM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, all I’m trying to change my worker dir to a mounted disk with larger space but I found that no document telling me how to do this, I have to check the source code and found the

Re: is it forgotten to document how to set SPARK_WORKER_DIR?

2014-01-07 Thread Nan Zhu
In worker machines, you will find $SPARK_HOME/work/ that’s worker dir -- Nan Zhu On Tuesday, January 7, 2014 at 7:29 AM, Archit Thakur wrote: What do you mean by worker dir? On Tue, Jan 7, 2014 at 11:43 AM, Nan Zhu zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com) wrote:

Re: the spark worker assignment Question?

2014-01-07 Thread Aureliano Buendia
On Thu, Jan 2, 2014 at 5:52 PM, Andrew Ash and...@andrewash.com wrote: That sounds right Mayur. Also in 0.8.1 I hear there's a new repartition method that you might be able to use to further distribute the data. But if your data is so small that it fits in just a couple blocks, why are you

How to time transformations and provide more detailed progress report?

2014-01-07 Thread Aureliano Buendia
Hi, When we time an action it includes all the transformations timings too, and it is not clear which transformation takes how long. Is there a way of timing each transformation separately? Also, does spark provide a way of more detailed progress reporting, broken to transformation steps? For

Re: Problems with broadcast large datastructure

2014-01-07 Thread Aureliano Buendia
What's the size of your large object to be broadcast? On Tue, Jan 7, 2014 at 8:55 AM, Sebastian Schelter s...@apache.org wrote: Spark repeatedly fails broadcast a large object on a cluster of 25 machines for me. I get log messages like this: [spark-akka.actor.default-dispatcher-4] WARN

Re: the spark worker assignment Question?

2014-01-07 Thread Andrew Ash
If small-file is hosted in HDFS I think the default is one partition per HDFS block. If it's in one block, which are 64MB each by default, that might be one partition. Sent from my mobile phone On Jan 7, 2014 8:46 AM, Aureliano Buendia buendia...@gmail.com wrote: On Thu, Jan 2, 2014 at 5:52

Re: the spark worker assignment Question?

2014-01-07 Thread Andrew Ash
I think that would do what you want. I'm guessing in ... you have an rdd and then call .collect on it -- normally this would be a bad idea because of large data sizes, but if you KNOW that it's small then you can force it through just that one machine. On Tue, Jan 7, 2014 at 9:20 AM, Aureliano

Re: the spark worker assignment Question?

2014-01-07 Thread Aureliano Buendia
On Tue, Jan 7, 2014 at 6:04 PM, Andrew Ash and...@andrewash.com wrote: I think that would do what you want. I'm guessing in ... you have an rdd and then call .collect on it -- normally this would be a bad idea because of large data sizes, but if you KNOW that it's small then you can force it

Re: How to time transformations and provide more detailed progress report?

2014-01-07 Thread Mark Hamstra
When we time an action it includes all the transformations timings too, and it is not clear which transformation takes how long. Is there a way of timing each transformation separately? Not really, because even though you may logically specify several different transformations within your

Re: How to make Spark merge the output file?

2014-01-07 Thread Nan Zhu
Hi, all Thanks for the reply I actually need to provide a single file to an external system to process it…seems that I have to make the consumer of the file to support multiple inputs Best, -- Nan Zhu On Tuesday, January 7, 2014 at 12:37 PM, Aaron Davidson wrote: HDFS, since 0.21

Re: How to make Spark merge the output file?

2014-01-07 Thread Debasish Das
Hi Nan, A cleaner approach is to export a RESTful service to the external system. The external system calls the service with appropriate api. For Scala, Spray can be used to make these services. Twitter oss also many examples of this service design. Thanks. Deb On Tue, Jan 7, 2014 at 10:25

Spark on Yarn classpath problems

2014-01-07 Thread Eric Kimbrel
I am trying to run spark version 0.8.1 on hadoop 2.2.0-cdh5.0.0-beta-1 with YARN. I am using YARN Client with yarn-standalone mode as described here http://spark.incubator.apache.org/docs/latest/running-on-yarn.html For simplifying matters I’ll say my application code is all contained in

EC2 scripts documentations lacks how to actually run applications

2014-01-07 Thread Aureliano Buendia
Hi, The EC2 documentshttp://spark.incubator.apache.org/docs/0.8.1/ec2-scripts.htmlhas a section called 'Running Applications', but it actually lacks the step which should describe how to run the application. The spark_ec2

ship MatrixFactorizationModel with each partition?

2014-01-07 Thread Nan Zhu
Hi, all I ‘m trying the ALS in mllib the following is my code val result = als.run(ratingRDD) val allMovies = ratingRDD.map(rating = rating.product).distinct() val allUsers = ratingRDD.map(rating = rating.user).distinct() val allUserMoviePair = allUsers.cartesian(allMovies)

Spark SequenceFile Java API Repeat Key Values

2014-01-07 Thread Michael Quinlan
I've spent some time trying to import data into an RDD using the Spark Java API, but am not able to properly load data stored in a Hadoop v1.1.1 sequence file with key and value types both LongWritable. I've attached a copy of the sequence file to this posting. It contains 3000 key, value pairs.

RE: Spark on Yarn classpath problems

2014-01-07 Thread Liu, Raymond
Not found in which part of code? If in sparkContext thread, say on AM, --addJars should work If on tasks, then --addjars won't work, you need to use --file=local://xxx etc, not sure is it available in 0.8.1. And adding to a single jar should also work, if not works, might be something wrong

Re: ship MatrixFactorizationModel with each partition?

2014-01-07 Thread Matei Zaharia
Sorry, you actually can’t call predict() on the cluster because the model contains some RDDs. There was a recent patch that added a parallel predict method, here: https://github.com/apache/incubator-spark/pull/328/files. You can grab the code from that method there (which does a join) and call

Re: Spark SequenceFile Java API Repeat Key Values

2014-01-07 Thread Matei Zaharia
Yeah, unfortunately sequenceFile() reuses the Writable object across records. If you plan to use each record repeatedly (e.g. cache it), you should clone them using a map function. It was originally designed assuming you only look at each record once, but it’s poorly documented. Matei On Jan

Re: Spark SequenceFile Java API Repeat Key Values

2014-01-07 Thread Andrew Ash
Matei, do you mean something like A rather than B below? A) rdd.map(_.clone).cache B) rdd.cache I'd be happy to add documentation if there's a good place for it, but I'm not sure there's an obvious place for it. On Tue, Jan 7, 2014 at 9:35 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

Re: Spark SequenceFile Java API Repeat Key Values

2014-01-07 Thread Matei Zaharia
Yup, a) would make it work. I’d actually prefer that we change it so it clones the objects by default, and add a boolean flag (default false) for people who want to reuse objects. We’d have to do the same in hadoopRDD and the various versions of that as well. Matei On Jan 8, 2014, at 12:38

shark not able to connect to spark master

2014-01-07 Thread danoomistmatiste
Hi, I have installed spark0.8.1-incubating. Both my master and worker processes are running. Master is listening on port 7077. I have also installed shark (0.8.0). When I try to start the shark shell (./shark-withinfo), I get this exception, 14/01/07 21:49:21 INFO client.Client$ClientActor:

Re: Spark SequenceFile Java API Repeat Key Values

2014-01-07 Thread Andrew Ash
Agreed on the clone by default approach -- this reused object gotcha has hit several people I know when using Avro. We should be careful to not ignore the performance impact that made Hadoop reuse objects in the first place though. I'm not sure what this means in practice though, you either