I'm in a situation where I'm running Spark streaming on a single machine
right now. The plan is to ultimately run it on a cluster, but for the next
couple months it will probably stay on one machine.
I tried to do some digging and I can't find any indication of whether it's
better to run spark as
Thanks for the insight, I didn't realize there was internal object reuse
going on. Is this a mechanism of Scala/Java or is this a mechanism of Spark?
I actually just converted the code to use immutable case classes everywhere,
so it will be a little tricky to test foldByKey(). I'll try to get to
I have a pretty simple scala spark aggregation job that is summing up number
of occurrences of two types of events. I have run into situations where it
seems to generate bad values that are clearly incorrect after reviewing the
raw data.
First I have a Record object which I use to do my
I have what seems like a relatively straightforward task to accomplish, but I
cannot seem to figure it out from the Spark documentation or searching the
mailing list.
I have an RDD[(String, MyClass)] that I would like to group by the key, and
calculate the mean and standard deviation of the foo