Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-05 Thread Gerard Maas
Hi all, I'm currently working on creating a set of docker images to facilitate local development with Spark/streaming on Mesos (+zk, hdfs, kafka) After solving the initial hurdles to get things working together in docker containers, now everything seems to start-up correctly and the mesos UI

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-05 Thread Gerard Maas
of clusters with Docker, I'm interested by your findings on sharing the settings of Kafka and Zookeeper across nodes. How many broker and zookeeper do you use ? Regards, On Mon, May 5, 2014 at 10:11 PM, Gerard Maas gerard.m...@gmail.comwrote: Hi all, I'm currently working on creating a set

Re: A new resource for getting examples of Spark RDD API calls

2014-05-13 Thread Gerard Maas
Hi Zhen, Thanks a lot for sharing. I'm sure it will be useful for new users. A small note: On the 'checkpoint' explanation: sc.setCheckpointDir(my_directory_name) it would be useful to specify that 'my_directory_name' should exist in all slaves. As an alternative you could use an HDFS directory

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-14 Thread Gerard Maas
...@us.ibm.com - (512) 286-6075 [image: Inactive hide details for Gerard Maas ---05/05/2014 04:18:08 PM---Hi Benjamin, Yes, we initially used a modified version of the]Gerard Maas ---05/05/2014 04:18:08 PM---Hi Benjamin, Yes, we initially used a modified version of the AmpLabs docker scripts From

Re: is Mesos falling out of favor?

2014-05-16 Thread Gerard Maas
By looking at your config, I think there's something wrong with your setup. One of the key elements of Mesos is that you are abstracted from where the execution of your task takes place. The SPARK_EXECUTOR_URI tells Mesos where to find the 'framework' (in Mesos jargon) required to execute a job.

Re: issue with Scala, Spark and Akka

2014-05-20 Thread Gerard Maas
This error message says I can't find the config for the akka subsystem. That is typically included in the Spark assembly. First, you need to compile your spark distro, by running sbt/sbt assembly on the SPARK_HOME dir. Then, use the SPARK_HOME (through env or configuration) to point to your

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
for it to work. The SparkREPL works differently. It uses some dark magic to send the working session to the workers. -kr, Gerard. On Wed, May 21, 2014 at 2:47 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi Tobias, I was curious about this issue and tried to run your example on my local

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
Hi Tobias, On Wed, May 21, 2014 at 5:45 PM, Tobias Pfeiffer t...@preferred.jp wrote: first, thanks for your explanations regarding the jar files! No prob :-) On Thu, May 22, 2014 at 12:32 AM, Gerard Maas gerard.m...@gmail.com wrote: I was discussing it with my fellow Sparkers here and I

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
Hi Andrew, Thanks for the current doc. I'd almost gotten to the point where I thought that my custom code needed to be included in the SPARK_EXECUTOR_URI but that can't possibly be correct. The Spark workers that are launched on Mesos slaves should start with the Spark core jars and then

Spark Job Server first steps

2014-05-22 Thread Gerard Maas
Hi, I'm starting to explore the Spark Job Server contributed by Ooyala [1], running from the master branch. I started by developing and submitting a simple job and the JAR check gave me errors on a seemingly good jar. I disabled the fingerprint checking on the jar and I could submit it, but

Re: Spark Job Server first steps

2014-05-22 Thread Gerard Maas
, copy, print, distribute or rely on this email. On 22 May 2014 18:25, Gerard Maas gerard.m...@gmail.com wrote: Hi, I'm starting to explore the Spark Job Server contributed by Ooyala [1], running from the master branch. I started by developing and submitting a simple job and the JAR check

How to create RDDs from another RDD?

2014-06-02 Thread Gerard Maas
The RDD API has functions to join multiple RDDs, such as PariRDD.join or PariRDD.cogroup that take another RDD as input. e.g. firstRDD.join(secondRDD) I'm looking for ways to do the opposite: split an existing RDD. What is the right way to create derivate RDDs from an existing RDD? e.g.

Re: Reconnect to an application/RDD

2014-06-03 Thread Gerard Maas
I don't think that's supported by default as when the standalone context will close, the related RDDs will be GC'ed You should explore Spark-Job Server, which allows to cache RDDs by name and reuse them within a context. https://github.com/ooyala/spark-jobserver -kr, Gerard. On Tue, Jun 3,

Re: How to create RDDs from another RDD?

2014-06-03 Thread Gerard Maas
. Can you go into more detail about why you want to split one RDD into several? On Mon, Jun 2, 2014 at 1:13 PM, Gerard Maas gerard.m...@gmail.com wrote: The RDD API has functions to join multiple RDDs, such as PariRDD.join or PariRDD.cogroup that take another RDD as input. e.g. firstRDD.join

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-03 Thread Gerard Maas
Have you tried re-compiling your job against the 1.0 release? On Tue, Jun 3, 2014 at 8:46 PM, Marek Wiewiorka marek.wiewio...@gmail.com wrote: Hi All, I've been experiencing a very strange error after upgrade from Spark 0.9 to 1.0 - it seems that saveAsTestFile function is throwing

Re: best practice: write and debug Spark application in scala-ide and maven

2014-06-07 Thread Gerard Maas
I think that you have two options: - to run your code locally, you can use local mode by using the 'local' master like so: new SparkConf().setMaster(local[4]) where 4 is the number of cores assigned to the local mode. - to run your code remotely you need to build the jar with dependencies and

Re: Scheduling code for Spark

2014-06-07 Thread Gerard Maas
Hi, The scheduling related code can be found at: https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/scheduler The DAG (Directed Acyclic Graph) scheduler is a good start point:

Re: How to get the help or explanation for the functions in Spark shell?

2014-06-08 Thread Gerard Maas
You can consult the docs at : https://spark.apache.org/docs/latest/api/scala/index.html#package In particular, the rdd docs contain the explanation of each method : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD Kr, Gerard On Jun 8, 2014 1:00 PM, Carter

Re: initial basic question from new user

2014-06-12 Thread Gerard Maas
The goal of rdd.persist is to created a cached rdd that breaks the DAG lineage. Therefore, computations *in the same job* that use that RDD can re-use that intermediate result, but it's not meant to survive between job runs. for example: val baseData =

Re: NullPointerExceptions when using val or broadcast on a standalone cluster.

2014-06-12 Thread Gerard Maas
That stack trace is quite similar to the one that is generated when trying to do a collect within a closure. In this case, it feels wrong to collect in a closure, but I wonder what's reason behind the NPE. Curious to know whether they are related. Here's a very simple example: rrd1.flatMap(x=

Re: guidance on simple unit testing with Sprk

2014-06-14 Thread Gerard Maas
Ll mlll On Jun 14, 2014 4:05 AM, Matei Zaharia matei.zaha...@gmail.com wrote: You need to factor your program so that it’s not just a main(). This is not a Spark-specific issue, it’s about how you’d unit test any program in general. In this case, your main() creates a SparkContext, so you

Memory footprint of Calliope: Spark - Cassandra writes

2014-06-16 Thread Gerard Maas
Hi, I've been doing some testing with Calliope as a way to do batch load from Spark into Cassandra. My initial results are promising on the performance area, but worrisome on the memory footprint side. I'm generating N records of about 50 bytes each and using the UPDATE mutator to insert them

Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Gerard Maas
Hi, (My excuses for the cross-post from SO) I'm trying to create Cassandra SSTables from the results of a batch computation in Spark. Ideally, each partition should create the SSTable for the data it holds in order to parallelize the process as much as possible (and probably even stream it to

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Gerard Maas
On Wed, Jun 25, 2014 at 8:44 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi, (My excuses for the cross-post from SO) I'm trying to create Cassandra SSTables from the results of a batch computation in Spark. Ideally, each partition should create the SSTable for the data it holds in order

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-27 Thread Gerard Maas
for you. Regards, Rohit *Founder CEO, **Tuplejump, Inc.* www.tuplejump.com *The Data Engineering Platform* On Thu, Jun 26, 2014 at 12:44 AM, Gerard Maas gerard.m...@gmail.com wrote: Thanks Nick. We used the CassandraOutputFormat through Calliope

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-27 Thread Gerard Maas
.* www.tuplejump.com *The Data Engineering Platform* On Thu, Jun 26, 2014 at 12:44 AM, Gerard Maas gerard.m...@gmail.com wrote: Thanks Nick. We used the CassandraOutputFormat through Calliope. The Calliope API makes the CassandraOutputFormat quite accessible

Re: Spark Streaming, external windowing?

2014-07-16 Thread Gerard Maas
Hi Sargun, There have been few discussions on the list recently about the topic. The short answer is that this is not supported at the moment. This is a particularly good thread as it discusses the current state and limitations:

Using case classes as keys does not seem to work.

2014-07-22 Thread Gerard Maas
Using a case class as a key doesn't seem to work properly. [Spark 1.0.0] A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] =

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Gerard Maas
,ArrayBuffer(1, 1))) On Tue, Jul 22, 2014 at 4:20 PM, Gerard Maas gerard.m...@gmail.com wrote: Using a case class as a key doesn't seem to work properly. [Spark 1.0.0] A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Gerard Maas
, Gerard Maas gerard.m...@gmail.com wrote: Just to narrow down the issue, it looks like the issue is in 'reduceByKey' and derivates like 'distinct'. groupByKey() seems to work sc.parallelize(ps).map(x= (x.name,1)).groupByKey().collect res: Array[(String, Iterable[Int])] = Array((charly

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Gerard Maas
, 2014 at 5:37 PM, Gerard Maas gerard.m...@gmail.com wrote: Yes, right. 'sc.parallelize(ps).map(x= (**x.name**,1)).groupByKey(). collect' An oversight from my side. Thanks!, Gerard. On Tue, Jul 22, 2014 at 5:24 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: I can confirm this bug

Re: store spark streaming dstream in hdfs or cassandra

2014-07-31 Thread Gerard Maas
To read/write from/to Cassandra I recommend you to use the Spark-Cassandra connector at [1] Using it, saving a Spark Streaming RDD to Cassandra is fairly easy: sparkConfig.set(CassandraConnectionHost, cassandraHost) val sc = new SparkContext(sparkConfig) val ssc = new StreamingContext(sc,

[Spark Streaming] Tracking/solving 'block input not found'

2014-09-04 Thread Gerard Maas
Hello Sparkers, I'm currently running load tests on a Spark Streaming job. When the task duration increases beyond the batchDuration the job become unstable. In the logs I see tasks failed with the following message: Job aborted due to stage failure: Task 266.0:1 failed 4 times, most recent

Re: advice sought on spark/cassandra input development - scala or python?

2014-09-04 Thread Gerard Maas
Johnny, Currently, probably the easiest (and most performant way) to integrate Spark and Cassandra is using the spark-cassandra-connector [1] Given an rdd, saving it to cassandra is as easy as: rdd.saveToCassandra(keyspace, table, Seq(columns)) We tried many 'hand crafted' options to interact

[SparkStreaming] task failure with 'Unknown exception in doAs'

2014-09-18 Thread Gerard Maas
My Spark Streaming job (running on Spark 1.0.2) stopped working today and consistently throws the exception below. No code changed for it, so I'm really puzzled about the cause of the issue. Looks like a security issue at HDFS level. Has anybody seen this exception and maybe know the root cause?

Re: [SparkStreaming] task failure with 'Unknown exception in doAs'

2014-09-18 Thread Gerard Maas
Found it! (with sweat in my forehead) The job was actually running on Mesos using a Spark 1.1.0 executor. I guess there's some incompatibility between the 1.0.2 and the 1.1 versions - still quite weird. -kr, Gerard. On Thu, Sep 18, 2014 at 12:29 PM, Gerard Maas gerard.m...@gmail.com wrote

Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-17 Thread Gerard Maas
Hi, We have been implementing several Spark Streaming jobs that are basically processing data and inserting it into Cassandra, sorting it among different keyspaces. We've been following the pattern: dstream.foreachRDD(rdd = val records = rdd.map(elem = record(elem))

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-20 Thread Gerard Maas
Pinging TD -- I'm sure you know :-) -kr, Gerard. On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi, We have been implementing several Spark Streaming jobs that are basically processing data and inserting it into Cassandra, sorting it among different keyspaces

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-22 Thread Gerard Maas
on the driver…. mn On Oct 20, 2014, at 3:07 AM, Gerard Maas gerard.m...@gmail.com wrote: Pinging TD -- I'm sure you know :-) -kr, Gerard. On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi, We have been implementing several Spark Streaming jobs

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-22 Thread Gerard Maas
recordStream = dstream.map(elem = record(elem)) targets.foreach{target = recordStream.filter(record = isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))} ? kr, Gerard. On Wed, Oct 22, 2014 at 2:12 PM, Gerard Maas gerard.m...@gmail.com wrote: Thanks Matt, Unlike the feared

Re: Spark Cassandra Connector proper usage

2014-10-23 Thread Gerard Maas
Hi Ashic, At the moment I see two options: 1) You could use the CassandraConnector object to execute your specialized query. The recommended pattern is to to that within a rdd.foreachPartition(...) in order to amortize DB connection setup over the number of elements in on partition. Something

Re: unable to make a custom class as a key in a pairrdd

2014-10-24 Thread Gerard Maas
There's an issue in the way case classes are handled on the REPL and you won't be able to use a case class as a key. See: https://issues.apache.org/jira/browse/SPARK-2620 BTW, case classes already implement equals and hashCode. It's not needed to implement those again. Given that you already

Re: Saving to Cassandra from Spark Streaming

2014-10-28 Thread Gerard Maas
Looks like you're having some classpath issues. Are you providing your spark-cassandra-driver classes to your job? sparkConf.setJars(Seq(jars...)) ? On Tue, Oct 28, 2014 at 5:34 PM, Harold Nguyen har...@nexgate.com wrote: Hi all, I'm having trouble troubleshooting this particular block of

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread Gerard Maas
:37 AM, Gerard Maas gerard.m...@gmail.com wrote: PS: Just to clarify my statement: Unlike the feared RDD operations on the driver, it's my understanding that these Dstream ops on the driver are merely creating an execution plan for each RDD. With feared RDD operations on the driver I meant

Registering custom metrics

2014-10-30 Thread Gerard Maas
vHi, I've been exploring the metrics exposed by Spark and I'm wondering whether there's a way to register job-specific metrics that could be exposed through the existing metrics system. Would there be an example somewhere? BTW, documentation about how the metrics work could be improved. I

Re: EmptyRDD

2014-11-14 Thread Gerard Maas
If I remember correctly, EmptyRDD is private [spark] You can create an empty RDD using the spark context: val emptyRdd = sc.emptyRDD -kr, Gerard. On Fri, Nov 14, 2014 at 11:22 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: To get an empty RDD, I did this: I have an rdd with one

Re: EmptyRDD

2014-11-14 Thread Gerard Maas
at 11:35 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Thank You Gerard. I was trying val emptyRdd = sc.EmptyRDD. Yes it works but I am not able to do *emptyRdd.collect.foreach(println)* Thank You On Fri, Nov 14, 2014 at 3:58 PM, Gerard Maas gerard.m...@gmail.com wrote: If I remember

Re: Functions in Spark

2014-11-17 Thread Gerard Maas
One 'rule of thumbs' is to use rdd.toDebugString and check the lineage for ShuffleRDD. As long as there's no need for restructuring the RDD, operations can be pipelined on each partition. rdd.toDebugString is your friend :-) -kr, Gerard. On Mon, Nov 17, 2014 at 7:37 AM, Mukesh Jha

Spark Streaming Metrics

2014-11-20 Thread Gerard Maas
As the Spark Streaming tuning guide indicates, the key indicators of a healthy streaming job are: - Processing Time - Total Delay The Spark UI page for the Streaming job [1] shows these two indicators but the metrics source for Spark Streaming (StreamingSource.scala) [2] does not. Any reasons

Re: spark code style

2014-11-21 Thread Gerard Maas
I suppose that here function(x) = function3(function2(function1(x))) In that case, the difference will be modularity and readability of your program. If function{1,2,3} are logically different steps and potentially reusable somewhere else, I'd keep them separate. A sequence of map

Re: Spark Streaming Metrics

2014-11-21 Thread Gerard Maas
Looks like metrics are not a hot topic to discuss - yet so important to sleep well when jobs are running in production. I've created Spark-4537 https://issues.apache.org/jira/browse/SPARK-4537 to track this issue. -kr, Gerard. On Thu, Nov 20, 2014 at 9:25 PM, Gerard Maas gerard.m...@gmail.com

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Gerard Maas
Hi TD, We also struggled with this error for a long while. The recurring scenario is when the job takes longer to compute than the job interval and a backlog starts to pile up. Hint: Check If the DStream storage level is set to MEMORY_ONLY_SER and memory runs out, then you will get a 'Cannot

Mesos killing Spark Driver

2014-11-27 Thread Gerard Maas
Hi, We are currently running our Spark + Spark Streaming jobs on Mesos, submitting our jobs through Marathon. We see with some regularity that the Spark Streaming driver gets killed by Mesos and then restarted on some other node by Marathon. I've no clue why Mesos is killing the driver and

Re: Mesos killing Spark Driver

2014-11-28 Thread Gerard Maas
[Ping] Any hints? On Thu, Nov 27, 2014 at 3:38 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi, We are currently running our Spark + Spark Streaming jobs on Mesos, submitting our jobs through Marathon. We see with some regularity that the Spark Streaming driver gets killed by Mesos

Re: Spark-Streaming: output to cassandra

2014-12-04 Thread Gerard Maas
I guess he's already doing so, given the 'saveToCassandra' usage. What I don't understand is the question how do I specify a batch. That doesn't make much sense to me. Could you explain further? -kr, Gerard. On Thu, Dec 4, 2014 at 5:36 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You can

Re: NoSuchMethodError: writing spark-streaming data to cassandra

2014-12-09 Thread Gerard Maas
You're using two conflicting versions of the connector: the Scala version at 1.1.0 and the Java version at 1.0.4. I don't use Java, but I guess you only need the java dependency for your job - and with the version fixed. dependency groupIdcom.datastax.spark/groupId

Specifying number of executors in Mesos

2014-12-09 Thread Gerard Maas
Hi, We've a number of Spark Streaming /Kafka jobs that would benefit of an even spread of consumers over physical hosts in order to maximize network usage. As far as I can see, the Spark Mesos scheduler accepts resource offers until all required Mem + CPU allocation has been satisfied. This

Re: Saving Data only if Dstream is not empty

2014-12-09 Thread Gerard Maas
We have a similar case in which we don't want to save data to Cassandra if the data is empty. In our case, we filter the initial DStream to process messages that go to a given table. To do so, we're using something like this: dstream.foreachRDD{ (rdd,time) = tables.foreach{ table = val

Re: Spark steaming : work with collect() but not without collect()

2014-12-11 Thread Gerard Maas
Have you tried with kafkaStream.foreachRDD(rdd = {rdd.foreach(...)} ? Would that make a difference? On Thu, Dec 11, 2014 at 10:24 AM, david david...@free.fr wrote: Hi, We use the following Spark Streaming code to collect and process Kafka event : kafkaStream.foreachRDD(rdd = {

Re: Key not valid / already cancelled using Spark Streaming

2014-12-11 Thread Gerard Maas
If the timestamps in the logs are to be trusted It looks like your driver is dying with that *java.io.FileNotFoundException*: and therefore the workers loose their connection and close down. -kr, Gerard. On Thu, Dec 11, 2014 at 7:39 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Try to add

Re: Session for connections?

2014-12-11 Thread Gerard Maas
I'm doing the same thing for using Cassandra, For Cassandra, use the Spark-Cassandra connector [1], which does the Session management, as described by TD, for you. [1] https://github.com/datastax/spark-cassandra-connector -kr, Gerard. On Thu, Dec 11, 2014 at 1:55 PM, Ashic Mahtab

Re: RDD.aggregate?

2014-12-11 Thread Gerard Maas
There's some explanation and an example here: http://stackoverflow.com/questions/26611471/spark-data-processing-with-grouping/26612246#26612246 -kr, Gerard. On Thu, Dec 11, 2014 at 7:15 PM, ll duy.huynh@gmail.com wrote: any explaination on how aggregate works would be much appreciated. i

Re: Read data from SparkStreaming from Java socket.

2014-12-14 Thread Gerard Maas
Are you using a bufferedPrintWriter? that's probably a different flushing behaviour. Try doing out.flush() after out.write(...) and you will have the same result. This is Spark unrelated btw. -kr, Gerard.

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Gerard Maas
Hi, I don't get what the problem is. That map to selected columns looks like the way to go given the context. What's not working? Kr, Gerard On Dec 14, 2014 5:17 PM, Denny Lee denny.g@gmail.com wrote: I have a large of files within HDFS that I would like to do a group by statement ala

Re: DStream demultiplexer based on a key

2014-12-14 Thread Gerard Maas
Hi Jean-Pascal, At Virdata we do a similar thing to 'bucketize' our data to different keyspaces in Cassandra. The basic construction would be to filter the DStream (or the underlying RDD) for each key and then apply the usual storage operations on that new data set. Given that, in your case, you

Re: DStream demultiplexer based on a key

2014-12-14 Thread Gerard Maas
in memory anyway? Also any experience with minutes long batch interval? Thanks for the quick answer! On Sun, Dec 14, 2014 at 11:17 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi Jean-Pascal, At Virdata we do a similar thing to 'bucketize' our data to different keyspaces in Cassandra

Re: Data Loss - Spark streaming

2014-12-16 Thread Gerard Maas
Hi Jeniba, The second part of this meetup recording has a very good answer to your question. TD explains the current behavior and the on-going work in Spark Streaming to fix HA. https://www.youtube.com/watch?v=jcJq3ZalXD8 -kr, Gerard. On Tue, Dec 16, 2014 at 11:32 AM, Jeniba Johnson

Re: Why so many tasks?

2014-12-16 Thread Gerard Maas
Creating an RDD from a wildcard like this: val data = sc.textFile(/user/foo/myfiles/*) Will create 1 partition for each file found. 1000 files = 1000 partitions. A task is a job stage (defined as a sequence of transformations) applied to a partition, so 1000 partitions = 1000 tasks per stage.

Re: Passing Spark Configuration from Driver (Master) to all of the Slave nodes

2014-12-16 Thread Gerard Maas
Hi Demi, Thanks for sharing. What we usually do is let the driver read the configuration for the job and pass the config object to the actual job as a serializable object. That way avoids the need of a centralized config sharing point that needs to be accessed from the workers. as you have

Re: Appending an incrental value to each RDD record

2014-12-16 Thread Gerard Maas
You would do: rdd.zipWithIndexGives you an RDD[Original, Int] where the second element is the index. To have a (index,original) tuple, you will need to map that previous RDD to the desired shape: rdd.zipWithIndex.map(_.swap) -kr, Gerard. kr, Gerard. On Tue, Dec 16, 2014 at 4:12 PM,

Re: spark streaming kafa best practices ?

2014-12-17 Thread Gerard Maas
Patrick, I was wondering why one would choose for rdd.map vs rdd.foreach to execute a side-effecting function on an RDD. -kr, Gerard. On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell pwend...@gmail.com wrote: The second choice is better. Once you call collect() you are pulling all of the

Re: Get the value of DStream[(String, Iterable[String])]

2014-12-17 Thread Gerard Maas
You can create a DStream that contains the count, transforming the grouped windowed RDD, like this: val errorCount = grouping.map{case (k,v) = v.size } If you need to preserve the key: val errorCount = grouping.map{case (k,v) = (k,v.size) } or you if you don't care about the content of the

Re: Does Spark 1.2.0 support Scala 2.11?

2014-12-19 Thread Gerard Maas
Check out the 'compiling for Scala 2.11' instructions: http://spark.apache.org/docs/1.2.0/building-spark.html#building-for-scala-211 -kr, Gerard. On Fri, Dec 19, 2014 at 12:00 PM, Jonathan Chayat jonatha...@supersonic.com wrote: The following ticket:

Re: Scala Lazy values and partitions

2014-12-19 Thread Gerard Maas
It will be instantiated once per VM, which translates to once per executor. -kr, Gerard. On Fri, Dec 19, 2014 at 12:21 PM, Ashic Mahtab as...@live.com wrote: Hi Guys, Are scala lazy values instantiated once per executor, or once per partition? For example, if I have: object Something =

Tuning Spark Streaming jobs

2014-12-22 Thread Gerard Maas
Hi, After facing issues with the performance of some of our Spark Streaming jobs, we invested quite some effort figuring out the factors that affect the performance characteristics of a Streaming job. We defined an empirical model that helps us reason about Streaming jobs and applied it to tune

Re: Tuning Spark Streaming jobs

2014-12-22 Thread Gerard Maas
mode? I'm making changes to the spark mesos scheduler and I think we can propose a best way to achieve what you mentioned. Tim Sent from my iPhone On Dec 22, 2014, at 8:33 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi, After facing issues with the performance of some of our Spark

Re: RDD for Storm Streaming in Spark

2014-12-23 Thread Gerard Maas
Hi, I'm not sure what you are asking: Whether we can use spouts and bolts in Spark (= no) or whether we can do streaming in Spark: http://spark.apache.org/docs/latest/streaming-programming-guide.html -kr, Gerard. On Tue, Dec 23, 2014 at 9:03 AM, Ajay ajay.ga...@gmail.com wrote: Hi, Can

Re: RDD for Storm Streaming in Spark

2014-12-23 Thread Gerard Maas
Streaming). The idea is to use Spark as a in-memory computation engine and static data coming from Cassandra/Hbase and streaming data from Storm. Thanks Ajay On Tue, Dec 23, 2014 at 2:03 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi, I'm not sure what you are asking: Whether we

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
and post the code (if possible). In a nutshell, your processing time batch interval, resulting in an ever-increasing delay that will end up in a crash. 3 secs to process 14 messages looks like a lot. Curious what the job logic is. -kr, Gerard. On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das

Re: Discourse: A proposed alternative to the Spark User list

2015-01-23 Thread Gerard Maas
+1 On Fri, Jan 23, 2015 at 5:58 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That sounds good to me. Shall I open a JIRA / PR about updating the site community page? On 2015년 1월 23일 (금) at 오전 4:37 Patrick Wendell patr...@databricks.com wrote: Hey Nick, So I think we what can

Spark (Streaming?) holding on to Mesos resources

2015-01-26 Thread Gerard Maas
Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources and not releasing them, resulting in resource starvation to all jobs running on the Mesos cluster. For example: This is a job that has spark.cores.max = 4 and spark.executor.memory=3g

Spark (Streaming?) holding on to Mesos Resources

2015-01-26 Thread Gerard Maas
(looks like the list didn't like a HTML table on the previous email. My excuses for any duplicates) Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources and not releasing them, resulting in resource starvation to all jobs running on the

Re: Writing RDD to a csv file

2015-02-03 Thread Gerard Maas
this is more of a scala question, so probably next time you'd like to address a Scala forum eg. http://stackoverflow.com/questions/tagged/scala val optArrStr:Option[Array[String]] = ??? optArrStr.map(arr = arr.mkString(,)).getOrElse() // empty string or whatever default value you have for this.

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Gerard Maas
I've have been contributing to SO for a while now. Here're few observations I'd like to contribute to the discussion: The level of questions on SO is often of more entry-level. Harder questions (that require expertise in a certain area) remain unanswered for a while. Same questions here on the

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
://www.virdata.com/tuning-spark/#Partitions) -kr, Gerard. On Thu, Jan 22, 2015 at 4:28 PM, Gerard Maas gerard.m...@gmail.com wrote: So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761 seconds. I think there's evidence that setup costs are quite high in this case

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761 seconds. I think there's evidence that setup costs are quite high in this case and increasing the batch interval is helping. On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Ashic

Re: dynamically change receiver for a spark stream

2015-01-21 Thread Gerard Maas
line, that is to fix the number of streams and change the input and output channels dynamically. But could not make it work (seems that the receiver is not allowing any change in the config after it started). thanks, On Wed, Jan 21, 2015 at 10:49 AM, Gerard Maas gerard.m...@gmail.com wrote

Re: dynamically change receiver for a spark stream

2015-01-21 Thread Gerard Maas
One possible workaround could be to orchestrate launch/stopping of Streaming jobs on demand as long as the number of jobs/streams stay within the boundaries of the resources (cores) you've available. e.g. if you're using Mesos, Marathon offers a REST interface to manage job lifecycle. You will

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2015-01-21 Thread Gerard Maas
Hi Mukesh, How are you creating your receivers? Could you post the (relevant) code? -kr, Gerard. On Wed, Jan 21, 2015 at 9:42 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hello Guys, I've re partitioned my kafkaStream so that it gets evenly distributed among the executors and the results

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-17 Thread Gerard Maas
+1 for TypeSafe config Our practice is to include all spark properties under a 'spark' entry in the config file alongside job-specific configuration: A config file would look like: spark { master = cleaner.ttl = 123456 ... } job { context { src = foo action =

Re: Spark (Streaming?) holding on to Mesos resources

2015-01-29 Thread Gerard Maas
://issues.apache.org/jira/browse/MESOS-1688 On Mon, Jan 26, 2015 at 9:17 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources and not releasing them, resulting in resource starvation to all jobs

Re: how to split key from RDD for compute UV

2015-01-27 Thread Gerard Maas
Hi, Did you try asking this on StackOverflow? http://stackoverflow.com/questions/tagged/apache-spark I'd also suggest adding some sample data to help others understanding your logic. -kr, Gerard. On Tue, Jan 27, 2015 at 1:14 PM, 老赵 laozh...@sina.cn wrote: Hello All, I am writing a simple

Re: Spark (Streaming?) holding on to Mesos Resources

2015-01-26 Thread Gerard Maas
janv. 2015 22:28, Gerard Maas gerard.m...@gmail.com a écrit : (looks like the list didn't like a HTML table on the previous email. My excuses for any duplicates) Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources

Re: KafkaUtils not consuming all the data from all partitions

2015-01-07 Thread Gerard Maas
Hi, Could you add the code where you create the Kafka consumer? -kr, Gerard. On Wed, Jan 7, 2015 at 3:43 PM, francois.garil...@typesafe.com wrote: Hi Mukesh, If my understanding is correct, each Stream only has a single Receiver. So, if you have each receiver consuming 9 partitions, you

Re: Registering custom metrics

2015-01-08 Thread Gerard Maas
) bytes } .saveAsTextFile(text) Is there a way to achieve this with the MetricSystem? ᐧ On Mon, Jan 5, 2015 at 10:24 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi, Yes, I managed to create a register custom metrics by creating an implementation

Re: Join RDDs with DStreams

2015-01-08 Thread Gerard Maas
You are looking for dstream.transform(rdd = rdd.op(otherRdd)) The docs contain an example on how to use transform. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams -kr, Gerard. On Thu, Jan 8, 2015 at 5:50 PM, Asim Jalis asimja...@gmail.com

Re: KafkaUtils not consuming all the data from all partitions

2015-01-07 Thread Gerard Maas
), kafkaStreams); On Wed, Jan 7, 2015 at 8:21 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi, Could you add the code where you create the Kafka consumer? -kr, Gerard. On Wed, Jan 7, 2015 at 3:43 PM, francois.garil...@typesafe.com wrote: Hi Mukesh, If my understanding is correct, each

Re: Writing Spark Streaming Programs

2015-03-19 Thread Gerard Maas
Try writing this Spark Streaming idiom in Java and you'll choose Scala soon enough: dstream.foreachRDD{rdd = rdd.foreachPartition( partition = ) } When deciding between Java and Scala for Spark, IMHO Scala has the upperhand. If you're concerned with readability, have a look at the Scala

Re: Unable to saveToCassandra while cassandraTable works fine

2015-03-12 Thread Gerard Maas
This: java.lang.NoSuchMethodError almost always indicates a version conflict somewhere. It looks like you are using Spark 1.1.1 with the cassandra-spark connector 1.2.0. Try aligning those. Those metrics were introduced recently in the 1.2.0 branch of the cassandra connector. Either upgrade your

Re: Partitioning

2015-03-13 Thread Gerard Maas
In spark-streaming, the consumers will fetch data and put it into 'blocks'. Each block becomes a partition of the rdd generated during that batch interval. The size of each is block controlled by the conf: 'spark.streaming.blockInterval'. That is, the amount of data the consumer can collect in

  1   2   >