Re: Sessionization using updateStateByKey

2015-07-15 Thread Sean McNamara
I would just like to add that we do the very same/similar thing at Webtrends, updateStateByKey has been a life-saver for our sessionization use-cases. Cheers, Sean On Jul 15, 2015, at 11:20 AM, Silvio Fiorito silvio.fior...@granturing.commailto:silvio.fior...@granturing.com wrote: Hi Cody,

Re: Counters in Spark

2015-02-13 Thread Sean McNamara
.map is just a transformation, so no work will actually be performed until something takes action against it. Try adding a .count(), like so: inputRDD.map { x = { counter += 1 } }.count() In case it is helpful, here are the docs on what exactly the transformations and actions are:

Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Sean McNamara
We have gone down a similar path at Webtrends, Spark has worked amazingly well for us in this use case. Our solution goes from REST, directly into spark, and back out to the UI instantly. Here is the resulting product in case you are curious (and please pardon the self promotion):

Re: Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Sean McNamara
Hi Kane- http://spark.apache.org/docs/latest/tuning.html has excellent information that may be helpful. In particular increasing the number of tasks may help, as well as confirming that you don’t have more data than you're expecting landing on a key. Also, if you are using spark 1.2.0,

Re: Kafka version dependency in Spark 1.2

2014-11-10 Thread Sean McNamara
Can the 0.8.1.1 client still talk to 0.8.0 versions of Kafka Yes it can. 0.8.1 is fully compatible with 0.8. It is buried on this page: http://kafka.apache.org/documentation.html In addition to the pom version bump SPARK-2492 would bring the kafka streaming receiver (which was originally

Re: foreachPartition and task status

2014-10-14 Thread Sean McNamara
Are you using spark streaming? On Oct 14, 2014, at 10:35 AM, Salman Haq sal...@revmetrix.com wrote: Hi, In my application, I am successfully using foreachPartition to write large amounts of data into a Cassandra database. What is the recommended practice if the application wants to know

Re: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Sean McNamara
Would you mind sharing the code leading to your createStream? Are you also setting group.id? Thanks, Sean On Oct 10, 2014, at 4:31 PM, Abraham Jacob abe.jac...@gmail.com wrote: Hi Folks, I am seeing some strange behavior when using the Spark Kafka connector in Spark streaming. I

Re: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Sean McNamara
PM, Sean McNamara sean.mcnam...@webtrends.commailto:sean.mcnam...@webtrends.com wrote: Would you mind sharing the code leading to your createStream? Are you also setting group.idhttp://group.id/? Thanks, Sean On Oct 10, 2014, at 4:31 PM, Abraham Jacob abe.jac...@gmail.commailto:abe.jac

Re: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Sean McNamara
it (https://issues.apache.org/jira/browse/SPARK-2492). Thanks Jerry From: Abraham Jacob [mailto:abe.jac...@gmail.commailto:abe.jac...@gmail.com] Sent: Saturday, October 11, 2014 6:57 AM To: Sean McNamara Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Spark Streaming KafkaUtils Issue

Re: balancing RDDs

2014-06-25 Thread Sean McNamara
subsequent queries unless task scheduling time beats spark.locality.wait. Can cause overall low performance for all subsequent tasks. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Jun 24, 2014 at 4:10 AM, Sean

balancing RDDs

2014-06-23 Thread Sean McNamara
We have a use case where we’d like something to execute once on each node and I thought it would be good to ask here. Currently we achieve this by setting the parallelism to the number of nodes and use a mod partitioner: val balancedRdd = sc.parallelize( (0 until Settings.parallelism)