Re: Rename filter() into keep(), remove() or take() ?

2014-02-27 Thread Nick Pentreath
Agree that filter is perhaps unintuitive. Though the Scala collections API has filter and filterNot which together provide context that makes it more intuitive. And yes the change could be via added methods that don't break existing API. Still overall I would be -1 on this unless a

Re: Spark app gets slower as it gets executed more times

2014-02-27 Thread Aureliano Buendia
On Fri, Feb 7, 2014 at 7:48 AM, Aaron Davidson ilike...@gmail.com wrote: Sorry for delay, by long-running I just meant if you were running an iterative algorithm that was slowing down over time. We have observed this in the spark-perf benchmark; as file system state builds up, the job can

Re: Spark streaming on ec2

2014-02-27 Thread Tathagata Das
Yes! Spark streaming programs are just like any spark program and so any ec2 cluster setup using the spark-ec2 scripts can be used to run spark streaming programs as well. On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia buendia...@gmail.comwrote: Hi, Does the ec2 support for spark 0.9

Re: Spark streaming on ec2

2014-02-27 Thread Aureliano Buendia
On Thu, Feb 27, 2014 at 6:17 PM, Tathagata Das tathagata.das1...@gmail.comwrote: Yes! Spark streaming programs are just like any spark program and so any ec2 cluster setup using the spark-ec2 scripts can be used to run spark streaming programs as well. Great. Does it come with any input

Re: ReduceByKey or groupByKey to Count?

2014-02-27 Thread Mayur Rustagi
Sortbykey would be better I think as I am not sure groupbyKey will sort the keyspace globally. I would say you should you take input K, V GroupbyKey K,V = K,Seq(V..) partitionBy default partitioner (hash) SoryByKey K,Seq(V..) Output this, only thing is if you need K,V pairs you will have to

Re: Build Spark Against CDH5

2014-02-27 Thread Brian Brunner
Just as a second note, I am able to build the source in the official 0.9.0 release (http://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop2.tgz). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Build-Spark-Against-CDH5-tp2129p2130.html Sent

Running Spark with Python 2.7.5+

2014-02-27 Thread nicholas.chammas
The provided Spark EC2 scriptshttps://spark.incubator.apache.org/docs/0.9.0/ec2-scripts.htmland default AMI ship with Python 2.6.8. I would like to use Python 2.7.5 or later. I believe that among the 2.x versions, 2.7 is the most popular. What's the easiest way to get my Spark cluster on Python

Re: Spark streaming on ec2

2014-02-27 Thread Tathagata Das
Yes, the default spark EC2 cluster runs the standalone deploy mode. Since Spark 0.9, the standalone deploy mode allows you to launch the driver app within the cluster itself and automatically restart it if it fails. You can read about launching your app inside the cluster