Re: Problems with broadcast large datastructure

2014-01-12 Thread lihu
Oh, I misleading by the following log info, that I thought the broadcast variable is send back to driver. then the sending result to driver has no relationship with the broadcast variable, but what it is , since there seem no data will send back? *org.apache.spark.executor.Executor - Serialized s

Re: Problems with broadcast large datastructure

2014-01-12 Thread Mosharaf Chowdhury
Size calculation is correct, but broadcast happens from the driver to the workers. btw, your code is broadcasting 400MB 30 times, which are not being evicted from the cache fast enough, which, I think, is causing blockManagers to run out of memory. On Sun, Jan 12, 2014 at 9:34 PM, lihu wrote: >

Re: Problems with broadcast large datastructure

2014-01-12 Thread lihu
Yes, I just using the code snippet from the broadcast example, and using the spark-shell run this code. I thought the broadcast is driver send to the executor, and the executor will send back, is there some wrong for calculate the broadcast size? *val MAX_ITER = 30* *val num = 1* *var ar

Re: Spark on google compute engine

2014-01-12 Thread Debasish Das
Hi Aureliano, Look for google compute engine scripts from typesafe repo. They recently tested Akka Cluster on 2400 nodes from Google Compute Engine. You should be able to reuse the scripts. Thanks. Deb On Sun, Jan 12, 2014 at 8:00 PM, Aureliano Buendia wrote: > Hi, > > Has anyone worked on a s

Re: Problems with broadcast large datastructure

2014-01-12 Thread Mosharaf Chowdhury
broadcast is supposed to send data from the driver to the executors and not the other direction. can you share the code snippet you are using to broadcast? -- Mosharaf Chowdhury http://www.mosharaf.com/ On Sun, Jan 12, 2014 at 8:52 PM, lihu wrote: > In my opinion, the spark system is for big d

Re: Problems with broadcast large datastructure

2014-01-12 Thread lihu
In my opinion, the spark system is for big data, then 400M seem not big . I read slides about the broadcast, in my understanding, the executor will send the broadcast variable back to the driver. each executor own a complete copy of the broadcast variable. In my experiment, I have 20 machine, eac

Stalling during large iterative PySpark jobs

2014-01-12 Thread Jeremy Freeman
I'm reliably getting a bug in PySpark where jobs with many iterative calculations on cached data stall out. Data is a folder of ~40 text files, each with 2 mil rows and 360 entries per row, total size is ~250GB. I'm testing with the KMeans analyses included as examples (though I see the same er

Re: Problems with broadcast large datastructure

2014-01-12 Thread Mosharaf Chowdhury
400MB isn't really that big. Broadcast is expected to work with several GB of data and in even larger clusters (100s of machines). if you are using the default HttpBroadcast, then akka isn't used to move the broadcasted data. But block manager can run out of memory if you repetitively broadcast la

Re: Problems with broadcast large datastructure

2014-01-12 Thread Aureliano Buendia
On Mon, Jan 13, 2014 at 4:17 AM, lihu wrote: > I have occurred the same problem with you . > I have a node of 20 machines, and I just run the broadcast example, what I > do is just change the data size in the example, to 400M, this is really a > small data size. > Is 400 MB a really small size f

Re: Problems with broadcast large datastructure

2014-01-12 Thread lihu
I have occurred the same problem with you . I have a node of 20 machines, and I just run the broadcast example, what I do is just change the data size in the example, to 400M, this is really a small data size. but I occurred the same problem with you . *So I wonder maybe the broadcast capacity is w

Spark on google compute engine

2014-01-12 Thread Aureliano Buendia
Hi, Has anyone worked on a script similar to spark-ec2 for google compute engine? Google compute engine claims that they have faster instance start up time, and that together with by minute charging makes it a desirable choice for spark.

Re: Development version error on sbt compile publish-local

2014-01-12 Thread Patrick Wendell
Ah okay - glad you got it working... it must be due to a corruption somewhere in sbt's state. On Sun, Jan 12, 2014 at 2:18 AM, Shing Hing Man wrote: > There is no error if I do sbt/sbt clean between "sbt compile publish-local" > and "sbt/sbt assembly". Namely > > 1) sbt/sbt clean > 2) sbt/sbt co

Re: Problem running example GroupByTest from scala command line

2014-01-12 Thread Patrick Wendell
You should launch with "java" and not "scala" to launch. The "scala" command in newer versions manually adds a specific version of akka to the classpath which conflicts with the version spark is using. This causes the error you are seeing. It's discussed in this thread on the dev list: http://apac

Problem running example GroupByTest from scala command line

2014-01-12 Thread Shing Hing Man
Hi,   I am using the development version of Spark from git://github.com/apache/incubator-spark.git with Scala 2.10.3. The example GroupByTest runs successfully using : matmsh@gauss:~/Downloads/spark/github/incubator-spark> bin/run-example org.apache.spark.examples.GroupByTest local The script

Re: Development version error on sbt compile publish-local

2014-01-12 Thread Shing Hing Man
There is no error if I do sbt/sbt clean between "sbt compile publish-local" and "sbt/sbt assembly". Namely 1) sbt/sbt clean 2) sbt/sbt compile publish-local 3) sbt/sbt clean 4) SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly Now I have the spark jars in my local ivy repository and I can run  spark

Re: Development version error on sbt compile publish-local

2014-01-12 Thread Shing Hing Man
Hi,   Thanks for your reply ! sbt/sbt clean does not help. I did the following in incubator-spark directory and still get the same error as before. 1) sbt/sbt clean  2) SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly 3) sbt/sbt compile publish-local Shing On Sunday, January 12, 2014 12:32 AM