Re: Defining SparkShell Init?

2014-02-18 Thread Andrew Ash
Why would scala 0.11 change things here? I'm not familiar with what features you're referring. I would support a prelude file in ~/.sparkrc our similar that is automatically imported on spark shell startup if it exists. Sent from my mobile phone On Feb 17, 2014 9:11 PM, Prashant Sharma

Re: How to efficiently join this two complicated rdds

2014-02-18 Thread Eugen Cepoi
Hi, What is the size of RDD two? You want to map à line from RDD one to multiple values from RDD two and get the sum of all of them? So as result you would have an rdd of size RDD1 and containing a number per line? 2014-02-18 8:06 GMT+01:00 hanbo hanbo...@gmail.com: Sincerely thank you for

Re: DStream.saveAsTextFiles() saves nothing

2014-02-18 Thread Amit Behera
Hi Robin, Please make sure , whether any new file is taken as a input or not. If it is fine then please check what is the the exact Error is coming on Cluster WEB UI. Thanks Regards Amit Ku. Behera

Re: How to efficiently join this two complicated rdds

2014-02-18 Thread hanbo
Thanks for your reply: What is the size of RDD two? RDD two is a paried rdd, during iterating, its size may differ from 4 to 450. You want to map à line from RDD one to multiple values from RDD two and get the sum of all of them? Yes So as result you would have an

Re: How to efficiently join this two complicated rdds

2014-02-18 Thread Guillaume Pitel
Here's what I would do : RDD1 : "1" "2" "3" "1" "3" "5" RDD2 : ("1", 11) ("2", 22) ("3", 33) ("5", 55) 1 / flatMap your lines from RDD1 to RDD1bis (key,lineId) (you'll have to use a mapPartitionWithIndex

Testing if an RDD is empty?

2014-02-18 Thread Sampo Niskanen
Hi, I couldn't find any documentation on how to test whether an RDD is empty. I'm doing a reduce operation, but it throws an UnsupportedOperationException if the RDD is empty. I'd like to check if the RDD is empty before calling reduce. On the Google Groups list Reynold Xin had suggested using

Re: Testing if an RDD is empty?

2014-02-18 Thread Cheng Lian
Unfortunately there isn't an easy way to test whether an RDD is empty unless you count it. But fortunately, RDD.fold can solve your problem. All you need to do is to provide a zero value. For an RDD r, r.reduce(f) is roughly equivalent to r.fold(r.first)(f), that is, the first element of r is

Java Spark job significantly slower than Python

2014-02-18 Thread Korb, Michael [USA]
Hi, I'm experimenting with a Spark analytic on a 9-node cluster, and the Python version runs in about 5 minutes, whereas the Java version with all the same SparkContext configurations (and everything else being equal) takes 40+ minutes. Does anyone know what may be causing this performance

RE: checkpoint and not running out of disk space

2014-02-18 Thread Adrian Mocanu
Hi Tathagata! Thanks for getting back to me. I've 2 follow up questions. Q3: You've explained how data is checkpointed, but never addressed the part about new batches overwriting old batches. Is it true that data is overwritten or is it that all data is saved resulting in a huge blob. Q4: I

Re: Interleaving stages

2014-02-18 Thread David Thomas
Here is an example code that is bundled with Spark for (i - 1 to ITERATIONS) { println(On iteration + i) val gradient = points.map { p = (1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } As you can see, an action is called on

NetworkWordCount Tests

2014-02-18 Thread Eduardo Costa Alfaia
Hi Guys, I am doing some test with NetworkWordCount scala code where I am counting and summing a stream of words received from network using foreach action, thanks TD. Firstly I have began with this scenario 1 Master + 1 Worker(also actioning like Stream source) and I have obtained the

reduceByKey() is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)]

2014-02-18 Thread bethesda
I am getting this error when trying to code from the following page in the shell: http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html I believe that DStream[(String, Int)] is precisely the class that should have this function so I am confused. Thanks! scala val

Re: Spark + MongoDB

2014-02-18 Thread Sampo Niskanen
Hi, Since getting Spark + MongoDB to work together was not very obvious (at least to me) I wrote a tutorial about it in my blog with an example application: http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ Hope it's of use to someone else as well. Cheers, *Sampo Niskanen*

Re: Spark + MongoDB

2014-02-18 Thread Matei Zaharia
Very cool, thanks for writing this. I’ll link it from our website. Matei On Feb 18, 2014, at 12:44 PM, Sampo Niskanen sampo.niska...@wellmo.com wrote: Hi, Since getting Spark + MongoDB to work together was not very obvious (at least to me) I wrote a tutorial about it in my blog with an

Spark cannot find a class at runtime for a standalone Scala program

2014-02-18 Thread Soumya Simanta
I'm using sbt to build and run my Spark driver program. It's complaining that it cannot find a class (twitter4j/Status) even though my code compiles fine. Do I need to package all external dependencies into one fat jar ? If yes, can someone tell me the preferred way of doing it with sbt. I'm new

unit testing with spark

2014-02-18 Thread Ameet Kini
I'm writing unit tests with Spark and need some help. I've already read this helpful article: http://blog.quantifind.com/posts/spark-unit-test/ There are a couple differences in my testing environment versus the blog. 1. I'm using FunSpec instead of FunSuite. So my tests look like class

Re: Building a Standalone App in Scala and graphX

2014-02-18 Thread dachuan
Hi, sorry about this, my question is not related to yours, I am just checking around the mailing list in the hope of finding the solution to my question. I am compiling one standalone scala spark program, but it reports sbt.ResolveException: unresolved dependency:

Is DStream read from the beginning upon node crash or from where it left off

2014-02-18 Thread Adrian Mocanu
Hi, I have a question on achieving fault tolerance when counting with spark and storing the aggregate count into Cassandra. Example input: RDD 1 [a,a,a], RDD 2 [a,a] After aggregation of RDD1 (ie map + reduceByKey) we get Map:[a-3] And after aggregation for RDD2 we get Map:[a-2] Now lets store

RE: Is DStream read from the beginning upon node crash or from where it left off

2014-02-18 Thread Adrian Mocanu
Addendum Forgot to mention that I use a StreamingContext: val streamingContext = new StreamingContext(conf, Seconds(10)) And I have no idea how option a) reading the DStream from the beginning would be implemented in Spark, but I think it might instead read it somewhere in the middle or last

Re: checkpoint and not running out of disk space

2014-02-18 Thread Tathagata Das
A3: The basic RDD model is that the dataset is immutable. As new batches of data come in, each batch is treat as a RDD. Then RDD transformations are applied to create new RDDs. When some of these RDDs are checkpointed, then create separate HDFS files. So, yes, the checkpoint files will keep

Re: reduceByKey() is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)]

2014-02-18 Thread Tathagata Das
You have to import StreamingContext._ That imports implicit conversions that allow reduceByKey() to be applied on DStreams with key-value pairs. TD On Tue, Feb 18, 2014 at 12:03 PM, bethesda swearinge...@mac.com wrote: I am getting this error when trying to code from the following page in

Re: checkpoint and not running out of disk space

2014-02-18 Thread dachuan
I'm sorry but what does its not fundamentally not possible to avoid checkpointing mean? Are you saying that for these two stateful streaming app, it's possible to avoid checkpointing? how? On Tue, Feb 18, 2014 at 5:44 PM, Tathagata Das tathagata.das1...@gmail.comwrote: A3: The basic RDD model

Re: question about compiling SimpleApp

2014-02-18 Thread dachuan
Thanks for your reply. I have changed scalaVersion := 2.10 to scalaVersion := 2.10.3 then everything is good. So this is a documentation bug :) dachuan. On Tue, Feb 18, 2014 at 6:50 PM, Denny Lee denny.g@gmail.com wrote: What version of Scala are you using? For example, if you're using

Re: question about compiling SimpleApp

2014-02-18 Thread Andrew Ash
Dachuan, Where did you find that faulty documentation? I'd like to get it fixed. Thanks! Andrew On Tue, Feb 18, 2014 at 4:15 PM, dachuan hdc1...@gmail.com wrote: Thanks for your reply. I have changed scalaVersion := 2.10 to scalaVersion := 2.10.3 then everything is good. So this is a

Re: question about compiling SimpleApp

2014-02-18 Thread Nan Zhu
oh~Sorry, Andrew I just made the PR, it’s a error in site config.yml Best, -- Nan Zhu On Tuesday, February 18, 2014 at 7:16 PM, Andrew Ash wrote: Dachuan, Where did you find that faulty documentation? I'd like to get it fixed. Thanks! Andrew On Tue, Feb 18, 2014 at

Re: question about compiling SimpleApp

2014-02-18 Thread dachuan
http://spark.incubator.apache.org/docs/latest/quick-start.html it's in this page. On Tue, Feb 18, 2014 at 7:24 PM, Nan Zhu zhunanmcg...@gmail.com wrote: oh~Sorry, Andrew I just made the PR, it's a error in site config.yml Best, -- Nan Zhu On Tuesday, February 18, 2014 at 7:16 PM,

Re: question about compiling SimpleApp

2014-02-18 Thread Mayur Rustagi
I guess here: http://spark.incubator.apache.org/docs/latest/quick-start.html Regards Mayur Mayur Rustagi Ph: +919632149971 h https://twitter.com/mayur_rustagittp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Tue, Feb 18, 2014 at 4:16 PM, Andrew Ash and...@andrewash.com

Re: How to efficiently join this two complicated rdds

2014-02-18 Thread zhaoxw12
Thank you very much for your answer. We have try the method above before. This is the problem during doing so. 1. We want to avoid collect method because we do this step in the iteration and the RDD2 changes in every iteration. So the speed ,usage of memory and network traffic bother us a lot.

Mutating RDD

2014-02-18 Thread David Thomas
Let's say I have an RDD of text files from HDFS. During the runtime, is it possible to check for new files in a particular directory and if present, add them to the existing RDD?

Resource Allocation: Spark on Mesos

2014-02-18 Thread Sai Prasanna
Does Mesos offer a given resource to only one framework to avoid conflicts or does it offer to many frameworks simultaneously and resolves it similar to the concept of late binding?? Because http://mesos.apache.org/documentation/latest/app-framework-development-guide/ tells thats possibility of

Re: Question on web UI

2014-02-18 Thread Nan Zhu
the driver is running on the machine where you run command like ./spark-shell but in 0.9, you can run in-cluster driver http://spark.incubator.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster Best, -- Nan Zhu On Tuesday, February 18, 2014 at 10:06 PM,

Re: Testing if an RDD is empty?

2014-02-18 Thread Cheng Lian
Yea, for testing empty RDD, Guillaume's *mapPartitions* solution is definitely better than *count* since it doesn't require traversing all elements :-) On Tue, Feb 18, 2014 at 10:08 PM, Guillaume Pitel guillaume.pi...@exensa.com wrote: I think you could do something like that : (but Cheng's

Re: How to efficiently join this two complicated rdds

2014-02-18 Thread hanbo
Thank you for your reply. we have tried this method before, but step 2 is very time cosuming due to the value number of different keys is not well-distributed. Some key in lines of RDD1 is very dense, but others are very sparse. After join, the splits containing dense keys is very large and

Unable to submit an application to standalone cluster which on hdfs.

2014-02-18 Thread samuel281
I'm trying to launch application inside the cluster (standalone mode) According to docs, jar-url can be either file:// or hdfs:// format. ( https://spark.incubator.apache.org/docs/latest/spark-standalone.html) But, when I tried to run spark-class It seemed unable to parse hdfs://xx format.

Re: Spark Streaming on a cluster

2014-02-18 Thread Tathagata Das
The default zeromq receiver that comes with the Spark repository does guarantee which machine the zeromq receiver will be launched, it can be on any of the worker machines in the cluster, NOT the application machine (called the driver in our terms). And your understanding of the code and the

Re: Spark cannot find a class at runtime for a standalone Scala program

2014-02-18 Thread Akhil Das
You can create a *lib * directory at the root of your project home and put all the jars in it. Or you can specify the location of your jars by setting the *.setJars *property in the sparkConf like: val conf = new SparkConf() .setMaster(mesos://akhldz:5050)

Spark Streaming windowing Driven by absolutely time?

2014-02-18 Thread Aries Kong
hi all, It seems that the Windowing in Spark Streaming Driven by absolutely time not conventionally by the timestamp of the data, can anybody kindly explains why? How can I do if I need Windowing driven by the data-timestamp? Thanks! Aries.Kong

Re: How to use FlumeInputDStream in spark cluster?

2014-02-18 Thread Tathagata Das
It could be that the hostname that Spark uses to identify the node is different from the one you are providing. Are you using the Spark standalone mode? In that case, you can check out the hostnames that Spark is seeing and use that name. Let me know if that works out. TD On Mon, Feb 17, 2014

Re: checkpoint and not running out of disk space

2014-02-18 Thread Tathagata Das
Apologies if that was confusing. I meant that in streaming application where a few specific DStream operations like updateStateByKey, etc. are used, you have to enable checkpointing. In fact, the system automatically enables checkpointing with some default interval for DStreams that are generated

Re: Spark Streaming on a cluster

2014-02-18 Thread Sourav Chandra
Hi TD, in case of multiple streams will the streaming code be like: val ssc = ... (1 to n).foreach { val nwStream = kafkaUtils.createStream(...) nwStream.flatMap(...).map(...).reduceByKeyAndWindow(...).foreachRDD(saveToCassandra()) } ssc.start() Will it create any problem in execution (like

Re: unit testing with spark

2014-02-18 Thread Heiko Braun
Take a look at the trait the spark tests are using: https://github.com/apache/incubator-spark/blob/master/core/src/test/scala/org/apache/spark/SharedSparkContext.scala?source=cc /Heiko On 18 Feb 2014, at 22:36, Ameet Kini ameetk...@gmail.com wrote: I'm writing unit tests with Spark and need

Re: How to use FlumeInputDStream in spark cluster?

2014-02-18 Thread anoldbrain
Both standalone mode and mesos were tested, with the same outcome. After your suggestion, I tried again in standalone mode and specified the host with what was written in the log of a worker node. The problem remains. A bit more detail, the bind failed error is reported on the driver node. Say,