Also, you may want to use .lookup() instead of .filter()
def
lookup(key: K): Seq[V]
Return the list of values in the RDD for key key. This operation is done
efficiently if the RDD has a known partitioner by only searching the
partition that the key maps to.
You might want to partition your first
shahabm wrote
I noticed that rdd.cache() is not happening immediately rather due to lazy
feature of Spark, it is happening just at the moment you perform some
map/reduce actions. Is this true?
Yes, .cache() is a transformation (lazy evaluation)
shahabm wrote
If this is the case, how can I
nsareen wrote
1) Does filter function scan every element saved in RDD? if my RDD
represents 10 Million rows, and if i want to work on only 1000 of them,
is
there an efficient way of filtering the subset without having to scan
every element ?
using .take(1000) may be a biased sample.
you
also available is .sample(), which will randomly sample your RDD with or
without replacement, and returns an RDD.
.sample() takes a fraction, so it doesn't return an exact number of
elements.
eg.
rdd.sample(true, .0001, 1)
--
View this message in context:
I'm not sure about .union(), but at least in the case of .join(), as long as
you have hash partitioned the original RDDs and persisted them, calls to
.join() take advantage of already knowing which partition the keys are on,
and will not repartition rdd1.
val rdd1 = log.partitionBy(new
Hi,
In the talk A Deeper Understanding of Spark Internals, it was mentioned
that for some operators, spark can spill to disk across keys (in 1.1 -
.groupByKey(), .reduceByKey(), .sortByKey()), but that as a limitation of
the shuffle at that time, each single key-value pair must fit in memory.