Hi All,
After some hair pulling, I've reached the realisation that an operation I
am currently doing via:
myRDD.groupByKey.mapValues(func)
should be done more efficiently using aggregateByKey or combineByKey. Both
of these methods would do, and they seem very similar to me in terms of
their
, mergeCombiners.
Hope this helps!
Liquan
On Sun, Sep 28, 2014 at 11:59 PM, David Rowe davidr...@gmail.com wrote:
Hi All,
After some hair pulling, I've reached the realisation that an operation I
am currently doing via:
myRDD.groupByKey.mapValues(func)
should be done more efficiently using
Hi Andrew,
I can't speak for Theodore, but I would find that incredibly useful.
Dave
On Wed, Sep 24, 2014 at 11:24 AM, Andrew Ash and...@andrewash.com wrote:
Hi Theodore,
What do you mean by module diagram? A high level architecture diagram of
how the classes are organized into packages?
Hi,
I've seen this problem before, and I'm not convinced it's GC.
When spark shuffles it writes a lot of small files to store the data to be
sent to other executors (AFAICT). According to what I've read around the
place the intention is that these files be stored in disk buffers, and
since
I generally call values.stats, e.g.:
val stats = myPairRdd.values.stats
On Fri, Sep 12, 2014 at 4:46 PM, rzykov rzy...@gmail.com wrote:
Is it possible to use DoubleRDDFunctions
https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html
for calculating mean
Oh I see, I think you're trying to do something like (in SQL):
SELECT order, mean(price) FROM orders GROUP BY order
In this case, I'm not aware of a way to use the DoubleRDDFunctions, since
you have a single RDD of pairs where each pair is of type (KeyType,
Iterable[Double]).
It seems to me