Running Spark in Local Mode vs. Single Node Cluster

2014-09-22 Thread kriskalish
I'm in a situation where I'm running Spark streaming on a single machine right now. The plan is to ultimately run it on a cluster, but for the next couple months it will probably stay on one machine. I tried to do some digging and I can't find any indication of whether it's better to run spark as

Re: Weird aggregation results when reusing objects inside reduceByKey

2014-09-22 Thread kriskalish
Thanks for the insight, I didn't realize there was internal object reuse going on. Is this a mechanism of Scala/Java or is this a mechanism of Spark? I actually just converted the code to use immutable case classes everywhere, so it will be a little tricky to test foldByKey(). I'll try to get to

Weird aggregation results when reusing objects inside reduceByKey

2014-09-15 Thread kriskalish
I have a pretty simple scala spark aggregation job that is summing up number of occurrences of two types of events. I have run into situations where it seems to generate bad values that are clearly incorrect after reviewing the raw data. First I have a Record object which I use to do my

Computing mean and standard deviation by key

2014-08-01 Thread kriskalish
I have what seems like a relatively straightforward task to accomplish, but I cannot seem to figure it out from the Spark documentation or searching the mailing list. I have an RDD[(String, MyClass)] that I would like to group by the key, and calculate the mean and standard deviation of the foo