Re: ReduceByKey or groupByKey to Count?

2014-02-27 Thread dmpour23
I am probably not stating my problem correctly and have yet fully understood the Java Spark API, This is what i would like to do. Read a file (which is not sorted), sort the file by a key extracted by each line. Then partition the initial file into k files. The only restriction is that all lines a

Rename filter() into keep(), remove() or take() ?

2014-02-27 Thread Bertrand Dechoux
Hi, It might seem like a trivial issue but even though it is somehow a standard name filter() is not really explicit in which way it does work. Sure, it makes sense to provide a filter function but what happens when it returns true? Is the current element removed or kept? It is not really obvious.

Re: JVM error

2014-02-27 Thread Evgeniy Shishkin
On 27 Feb 2014, at 07:22, Aaron Davidson wrote: > Setting spark.executor.memory is indeed the correct way to do this. If you > want to configure this in spark-env.sh, you can use > export SPARK_JAVA_OPTS=" -Dspark.executor.memory=20g" > (make sure to append the variable if you've been using SPA

Re: Rename filter() into keep(), remove() or take() ?

2014-02-27 Thread Nick Pentreath
filter comes from the Scala collection method "filter". I'd say it's best to keep in line with the Scala collections API, as Spark has done with RDDs generally (map, flatMap, take etc), so that is is easier and natural for developers to apply the same thinking for Scala (parallel) collections to Sp

Re: Rename filter() into keep(), remove() or take() ?

2014-02-27 Thread Bertrand Dechoux
I understand the explanation but I had to try. However, the change could be made without breaking anything but that's another story. Regards Bertrand Bertrand Dechoux On Thu, Feb 27, 2014 at 2:05 PM, Nick Pentreath wrote: > filter comes from the Scala collection method "filter". I'd say it's

Re: Rename filter() into keep(), remove() or take() ?

2014-02-27 Thread Nick Pentreath
Agree that filter is perhaps unintuitive. Though the Scala collections API has "filter" and "filterNot" which together provide context that makes it more intuitive. And yes the change could be via added methods that don't break existing API. Still overall I would be -1 on this unless a signif

RE: window every n elements instead of time based

2014-02-27 Thread Adrian Mocanu
If there is somewhere I can +1 this feature let me know. My use case is for financial indicators (math formulas) and a lot of them go by window count like moving average. Thanks A From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: February-26-14 2:05 PM To: user@spark.apache.org Cc:

is RDD failure transparent to stream consumer

2014-02-27 Thread Adrian Mocanu
Is RDD failure transparent to a spark stream consumer except for the slowdown needed to recreate the RDD. After reading the papers on RDDs and DStreams from spark homepage I believe it is, but I'd like a confirmation. Thanks -Adrian

Re: Spark app gets slower as it gets executed more times

2014-02-27 Thread Aureliano Buendia
On Fri, Feb 7, 2014 at 7:48 AM, Aaron Davidson wrote: > Sorry for delay, by long-running I just meant if you were running an > iterative algorithm that was slowing down over time. We have observed this > in the spark-perf benchmark; as file system state builds up, the job can > slow down. Once th

Spark streaming on ec2

2014-02-27 Thread Aureliano Buendia
Hi, Does the ec2 support for spark 0.9 also include spark streaming? If not, is there an equivalent?

Re: Spark streaming on ec2

2014-02-27 Thread Tathagata Das
Yes! Spark streaming programs are just like any spark program and so any ec2 cluster setup using the spark-ec2 scripts can be used to run spark streaming programs as well. On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia wrote: > Hi, > > Does the ec2 support for spark 0.9 also include spark

Re: Spark streaming on ec2

2014-02-27 Thread Aureliano Buendia
On Thu, Feb 27, 2014 at 6:17 PM, Tathagata Das wrote: > Yes! Spark streaming programs are just like any spark program and so any > ec2 cluster setup using the spark-ec2 scripts can be used to run spark > streaming programs as well. > Great. Does it come with any input source support as well? (Eg

Re: ReduceByKey or groupByKey to Count?

2014-02-27 Thread Mayur Rustagi
Sortbykey would be better I think as I am not sure groupbyKey will sort the keyspace globally. I would say you should you take input K, V GroupbyKey K,V => K,Seq(V..) partitionBy default partitioner (hash) SoryByKey K,Seq(V..) Output this, only thing is if you need K,V pairs you will have to const

IncompatibleClassChangeError while running a spark program

2014-02-27 Thread Usman Ghani
Exception in thread "main" java.lang.IncompatibleClassChangeError: class org.apache.spark.util.InnerClosureFinder has interface org.objectweb.asm.ClassVisitor as super class at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) at java.l

Re: Spark streaming on ec2

2014-02-27 Thread Tathagata Das
Zookeeper is automatically set up in the cluster as Spark uses Zookeeper. However, you have to setup your own input source like Kafka or Flume. TD On Thu, Feb 27, 2014 at 10:32 AM, Aureliano Buendia wrote: > > > > On Thu, Feb 27, 2014 at 6:17 PM, Tathagata Das < > tathagata.das1...@gmail.com> w

What's the difference between map and transform in spark streaming?

2014-02-27 Thread Aureliano Buendia
Hi, Is transform supposed to be a generalized form of map? They seem to be doing the same job.

Re: What's the difference between map and transform in spark streaming?

2014-02-27 Thread Mayur Rustagi
DStream -> RDD ->partitions -> rows map works on each rows mapPartitions: works on each partition transform: each rdd Mayur Rustagi Ph: +919632149971 h ttp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Thu, Feb 27, 2014 at 1:42 PM, Aurelian

Re: unit testing with spark

2014-02-27 Thread Ameet Kini
Turns out that my race condition was caused by sbt running suites in parallel. I had to disable it like so: parallelExecution in Test := false This is mentioned in the sbt docs http://www.scala-sbt.org/0.13.1/docs/Detailed-Topics/Testing#disable-parallel-execution-of-tests With suites running se

Re: Build Spark Against CDH5

2014-02-27 Thread Brian Brunner
Just as a second note, I am able to build the source in the official 0.9.0 release (http://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop2.tgz). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Build-Spark-Against-CDH5-tp2129p2130.html Sent f

Running Spark with Python 2.7.5+

2014-02-27 Thread nicholas.chammas
The provided Spark EC2 scriptsand default AMI ship with Python 2.6.8. I would like to use Python 2.7.5 or later. I believe that among the 2.x versions, 2.7 is the most popular. What's the easiest way to get my Spark cluster on Python

Re: Running Spark with Python 2.7.5+

2014-02-27 Thread Bryn Keller
Hi Nick, All the nodes of the cluster need to have the same Python setup (path and version). So if, e.g. you start running in 2.7.5 on the master and it ships code to nodes that have 2.6.x, you'll get invalid opcode errors. Thanks, Bryn On Thu, Feb 27, 2014 at 3:48 PM, nicholas.chammas < nichol

Scalatra servlet with actors and SparkContext

2014-02-27 Thread Ognen Duzlevski
I spent a week trying to figure this out and I think I finally did. My write up is here: http://corripio.blogspot.com/2014/02/scalatra-with-actors-command-spark-at.html I am sure for most of you this is basic - sorry for wasting bandwidth if this is the case. I am a Scala/Spark noob so for me t

Re: IncompatibleClassChangeError while running a spark program

2014-02-27 Thread Sandy Ryza
Hi Usman, This is a known issue that stems from Spark dependencies using two different versions of ASM - https://spark-project.atlassian.net/browse/SPARK-782. How are you setting up the classpath for your app? -Sandy On Thu, Feb 27, 2014 at 11:59 AM, Usman Ghani wrote: > Exception in thread

Re: Spark streaming on ec2

2014-02-27 Thread Aureliano Buendia
How about spark stream app itself? Does the ec2 script also provide means for daemonizing and monitoring spark streaming apps which are supposed to run 24/7? If not, any suggestions for how to do this? On Thu, Feb 27, 2014 at 8:23 PM, Tathagata Das wrote: > Zookeeper is automatically set up in t

Re: IncompatibleClassChangeError while running a spark program

2014-02-27 Thread Usman Ghani
Thanks Sandy. Actually found out that I had three versions of ASM lib, one in its own jar, one in mockito and one in spark. I moved the standalone one and downloaded the version of mockito-core version of the jar and things seem to be working, but for future reference, the mockito-all and asm jars

Re: window every n elements instead of time based

2014-02-27 Thread Tathagata Das
Well, it has been +1'd in our heads. ;) We will keep this in mind. TD On Thu, Feb 27, 2014 at 7:45 AM, Adrian Mocanu wrote: > If there is somewhere I can +1 this feature let me know. > > > > My use case is for financial indicators (math formulas) and a lot of them > go by window count like mo

Re: Need some tutorials and examples about customized partitioner

2014-02-27 Thread Tao Xiao
Also thanks Matei 2014-02-26 15:19 GMT+08:00 Matei Zaharia : > Take a look at the "advanced Spark features" talk here too: > http://ampcamp.berkeley.edu/amp-camp-one-berkeley-2012/. > > Matei > > On Feb 25, 2014, at 6:22 PM, Tao Xiao wrote: > > Thank you Mayur, I think that will help me a lot >

Re: Spark streaming on ec2

2014-02-27 Thread Tathagata Das
Yes, the default spark EC2 cluster runs the standalone deploy mode. Since Spark 0.9, the standalone deploy mode allows you to launch the driver app within the cluster itself and automatically restart it if it fails. You can read about launching your app inside the cluster here

Re: What's the difference between map and transform in spark streaming?

2014-02-27 Thread Tathagata Das
To clarify further. A* yourDStream.map(record => yourFunction(record)) * will do something on every record in every RDDs in the DStream. Which essentially means every records in the DStream. But *yourDStream.transform(rdd => anotherFunction(rdd))* allows you to do arbitrary stuff on every RDD in th

Re: Having Spark read a JSON file

2014-02-27 Thread Nicholas Chammas
Thanks for the direction, Paul and Deb. I'm currently reading in the data using sc.textFile() and Python's json.loads(). It turns out that the big JSON data sources I care most about happen to be structured so that there is one object per line, even though the objects are correctly strung together

Re: Running Spark with Python 2.7.5+

2014-02-27 Thread Nicholas Chammas
Makes sense. I'll give it a shot and check back here if that doesn't work. Are there plans to upgrade the EC2 deployment scripts and/or AMI to have Python 2.7 by default? If so, is there a ticket somewhere I can follow? Nick On Thu, Feb 27, 2014 at 6:50 PM, Bryn Keller wrote: > Hi Nick, > > A

Re: Running Spark with Python 2.7.5+

2014-02-27 Thread Josh Rosen
There's an open ticket to update the Python version: https://spark-project.atlassian.net/browse/SPARK-922. In that ticket, I included instructions for a workaround to manually update a cluster to Python 2.7. Did you set the PYSPARK_PYTHON environment variable to the name of your new Python execut

Re: Running Spark with Python 2.7.5+

2014-02-27 Thread Nicholas Chammas
Perfect. Thanks Josh. I've added myself as a watcher on the ticket. (By the way, when I upgraded to 2.7 I replaced 2.6 so the executable name didn't change.) On Fri, Feb 28, 2014 at 12:12 AM, Josh Rosen wrote: > There's an open ticket to update the Python version: > https://spark-project.atlas

Re: Implementing a custom Spark shell

2014-02-27 Thread Sampo Niskanen
Hi, Thanks for the pointers. I did get my code working within the normal spark-shell. However, since I'm building a separate analysis service which pulls in the Spark libraries using SBT, I'd much rather have the custom shell incorporated in that, instead of having to use the default downloadabl