Re: DataFrame RDDs

2013-11-19 Thread andy petrella
indeed the scala version could be blocking (I'm not sure what it needs 2.11, maybe Miles uses quasiquotes...) Andy On Tue, Nov 19, 2013 at 8:48 AM, Anwar Rizal anriza...@gmail.com wrote: I had that in mind too when Miles Sabin presented Shapeless at Scala.IO Paris last month. If anybody

Reuse the Buffer Array in the map function?

2013-11-19 Thread Wenlei Xie
Hi, I am trying to do some tasks with the following style map function: rdd.map { e = val a = new Array[Int](100) ...Some calculation... } But here the array a is really just used as a temporary buffer and can be reused. I am wondering if I can avoid constructing it everytime? (As it

How to start spark Workers with fixed number of Cores?

2013-11-19 Thread ioannis.deligiannis
I am trying to start 12 workers with 2 cores on each Node using the following: In spark-env.sh (copied in every slave) I have set: SPARK_WORKER_INSTANCES=12 SPARK_WORKER_CORES=2 I start Scala console with: SPARK_WORKER_CORES=2 SPARK_MEM=3g MASTER=spark://x:7077

Re: App master failed to find application jar in the master branch on YARN

2013-11-19 Thread guojc
Hi Tom, Thank you for your response. I have double checked that I had upload both jar in the same folder on hdfs. I think the namefs.default.name/name you pointed out is the old deprecated name for fs.defaultFS config accordiing

Re: Reuse the Buffer Array in the map function?

2013-11-19 Thread Mark Hamstra
mapWith can make this use case even simpler. On Nov 19, 2013, at 1:29 AM, Sebastian Schelter ssc.o...@googlemail.com wrote: You can use mapPartition, which allows you to apply the map function elementwise to all elements of a partition. Here you can place custom code around your

Re: multiple concurrent jobs

2013-11-19 Thread Mark Hamstra
On Tue, Nov 19, 2013 at 6:50 AM, Yadid Ayzenberg ya...@media.mit.eduwrote: Hi all, According to the documentation, spark standalone currently only supports a FIFO scheduling system. I understand its possible to limit the number of cores a job uses by setting spark.cores.max. When running

Re: multiple concurrent jobs

2013-11-19 Thread Mark Hamstra
According to the documentation, spark standalone currently only supports a FIFO scheduling system. That's not true.http://spark.incubator.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools [sorry for the prior misfire] On Tue, Nov 19, 2013 at 7:30 AM, Mark Hamstra

Re: multiple concurrent jobs

2013-11-19 Thread Prashant Sharma
I think that is Scheduling Within an Application, and he asked across apps. Actually spark standalone supports two ways of scheduling both are FIFO type. http://spark.incubator.apache.org/docs/latest/spark-standalone.html One is spread out mode and the other is use as fewer node as possible [1]

Re: App master failed to find application jar in the master branch on YARN

2013-11-19 Thread Tom Graves
The property is deprecated but will still work. Either one is fine. Launching the job from the namenode is fine .  I brought up a cluster with 2.0.5-alpha and built the latest spark master branch and it runs fine for me. It looks like namenode 2.0.5-alpha won't even start with the defaulFs of

Re: multiple concurrent jobs

2013-11-19 Thread Mark Hamstra
Ah, sorry -- misunderstood the question. On Nov 19, 2013, at 7:48 AM, Prashant Sharma scrapco...@gmail.com wrote: I think that is Scheduling Within an Application, and he asked across apps. Actually spark standalone supports two ways of scheduling both are FIFO type.

Re: multiple concurrent jobs

2013-11-19 Thread Prashant Sharma
Actually it is not documented anywhere, maybe we can do that. Was having tough time thinking how to implement this in Spark-743. On Tue, Nov 19, 2013 at 9:33 PM, Yadid Ayzenberg ya...@media.mit.eduwrote: My bad - I should have stated that up front. I guess it was kind of implicit within my

Re: HttpBroadcast strange behaviour, bug?

2013-11-19 Thread Eugen Cepoi
Yes sure for usual tests it is fine, but the broadcast is only done if we are not in local mode (at least seems so). In SparkContext we have def broadcast[T](value: T) = env.broadcastManager.newBroadcast[T](value, isLocal) the is local is computed from the master name (local or local[...). Now If

Re: multiple concurrent jobs

2013-11-19 Thread Mark Hamstra
No, it's my fault for not reading more carefully. We do use a somewhat overloaded and specialized lexicon to describe Spark, which helps when it is used uniformly, but penalizes those who leap to misunderstanding. Prashant is correct that the largest granularity thing that a user launches to do

Re: HttpBroadcast strange behaviour, bug?

2013-11-19 Thread Sriram Ramachandrasekaran
aah, yes. I missed that. I looked into the code. Both TreeBroadcast and HttpBroadcast don't do send or write respectively.. Will wait for other inputs. On Tue, Nov 19, 2013 at 10:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Yes sure for usual tests it is fine, but the broadcast is only done

Adding PySpark to an existing Python installation

2013-11-19 Thread Michal Romaniuk
Hi, I would like to use Spark to distribute some computations that rely on my existing Python installation. I know that Spark includes its own Python but it would be much easier to just install a package and perhaps do a bit of configuration. Thanks, Michal

Re: HttpBroadcast strange behaviour, bug?

2013-11-19 Thread Eugen Cepoi
This is the code creating the treemap: object CaseInsensitiveOrdered extends Ordering[String] { def compare(x: String, y: String): Int = x.compareToIgnoreCase(y) } TreeMap[String, JobTitle](dico.toArray:_*)(CaseInsensitiveOrdered) this is the map that is broadcasted. BTW* if I remove the

Re: multiple concurrent jobs

2013-11-19 Thread Imran Rashid
btw, there is an open PR to allow spreadOut to be configured per-app, instead of per-cluster https://github.com/apache/incubator-spark/pull/136 On Tue, Nov 19, 2013 at 11:20 AM, Mark Hamstra m...@clearstorydata.com wrote: No, it's my fault for not reading more carefully. We do use a somewhat

Re: multiple concurrent jobs

2013-11-19 Thread Yadid Ayzenberg
Yes, I see - I misused the term job. On 11/19/13 12:20 PM, Mark Hamstra wrote: No, it's my fault for not reading more carefully. We do use a somewhat overloaded and specialized lexicon to describe Spark, which helps when it is used uniformly, but penalizes those who leap to

Re: multiple concurrent jobs

2013-11-19 Thread Yadid Ayzenberg
Assuming I also want to run n concurrent jobs of the following type: each RDD is of the same form (JavaPairRDD), and I would like to run the same transformation on all RDDs. The brute force way would be to instantiate n threads and submit a job from each thread. Would this way be valid as

Re: Adding PySpark to an existing Python installation

2013-11-19 Thread Josh Rosen
PySpark doesn't include a Python interpreter; by default, it will use your system `python`. The pyspark script ( https://github.com/apache/incubator-spark/blob/master/pyspark) just performs some setup of environment variables, adds the PySpark python dependencies to PYTHONPATH, and adds some code

Re: configuring final partition length

2013-11-19 Thread Josh Rosen
I think that the reduce() action is implemented as mapPartitions.collect().reduce(), so the number of result tasks is determined by the degree of parallelism of the RDD being reduced. Some operations, like reduceByKey(), accept a `numPartitions` argument for configuring the number of reducers:

RAM question from the Shell

2013-11-19 Thread Gary Malouf
We have a 4 node Spark cluster with 3 gigs of ram available per executor (via the spark.executor.memory setting). When we run a Spark job, we see the following output: Using Scala version 2.9.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_21) Initializing interpreter... Creating

Re: RAM question from the Shell

2013-11-19 Thread Gary Malouf
To explain more, we upgraded from 0.7.3 to 0.9 incubating snapshot today and are getting out of memory errors very quickly even though our cluster has plenty of RAM and the data is relatively small: Using Scala version 2.9.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_21) Initializing

Re: RAM question from the Shell

2013-11-19 Thread Ewen Cheslack-Postava
This line 13/11/19 23:17:20 INFO MemoryStore: MemoryStore started with capacity 323.9 MB looks like what you'd get if you haven't set spark.executor.memory (or SPARK_MEM). Without setting it you'll get the default to 512m per executor and .66 of that for the cache. -Ewen - Ewen

Re: RAM question from the Shell

2013-11-19 Thread Gary Malouf
In our spark-env.sh, we have: export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so export ADD_JARS=/opt/spark/mx-lib/verrazano_2.9.3-0.1-SNAPSHOT-assembly.jar if [ -z $SPARK_JAVA_OPTS ] ; then SPARK_JAVA_OPTS=-Xss20m -Dspark.local.dir=/opt/spark/tmp -Dspark.executor.memory=3g

Re: RAM question from the Shell

2013-11-19 Thread Josh Rosen
Is spark-env.sh being sourced prior to running your job? The spark-shell script handles this automatically, but you may need to `source spark-env.sh` in the shell that runs your driver program in order for these environment variables to be set. On Tue, Nov 19, 2013 at 4:25 PM, Gary Malouf

Re: convert streaming data into table

2013-11-19 Thread Michael Kun Yang
the use case is to fit a moving average model for stock prices with the form: x_n = \sum_{i = 1}^k \alpha_i * x_{n - i} Can you please provide me the pseudo-code? On Tue, Nov 19, 2013 at 4:30 PM, andy petrella andy.petre...@gmail.comwrote: 1/ you mean like reshape in R? 2/ Or you mean by

Re: RAM question from the Shell

2013-11-19 Thread Gary Malouf
Hi Josh, We are running our job within the spark shell, so I think it is being sourced. We are on Mesos 0.13 and Spark 0.9 SNAPSHOT taken from around 9am eastern this morning. Any other possible culprits? On Tue, Nov 19, 2013 at 7:30 PM, Josh Rosen rosenvi...@gmail.com wrote: Is

Re: convert streaming data into table

2013-11-19 Thread Michael Kun Yang
sure, thanks a lot. If anyone dealt with issue before, I will appreciate the help :) On Tue, Nov 19, 2013 at 4:56 PM, andy petrella andy.petre...@gmail.comwrote: h ok. Providing a pseudo-code would require me to be a bit more awake -- 2AM in Belgium... have to go to sleep, otherwise the

Exception: Input path does not exist when using function objectFile

2013-11-19 Thread 杨强
Hi, all. I wrote two programs: A.scala and B.scala. A.scala writes trained model to HDFS with: _wcount_rdd.saveAsObjectFile(save_path) I used the command hadoop fs -ls $save_path, and find a directory named $save_path: [root@gd39 spark-0.8.0-incubating] # hadoop fs -ls

problems with standalone cluster

2013-11-19 Thread Umar Javed
I have a scala script that I'm trying to run on a Spark standalone cluster with just one worker (existing on the master node). But the application just hangs. Here's the worker log output at the time of starting the job: 3/11/19 18:03:13 INFO Worker: Asked to launch executor

Re: App master failed to find application jar in the master branch on YARN

2013-11-19 Thread guojc
Hi Tom, Thank you for your help. I finally found the problem. It's a silly mistake for me. After checkout git repository, I forgot to change the spark-env.sh under conf folder to add yarn config folder. I guess it might be helpful to display warning message about that. Anyway, thank you for