Re: Does filter on an RDD scan every data item ?

2014-12-11 Thread dsiegel
Also, you may want to use .lookup() instead of .filter() def lookup(key: K): Seq[V] Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. You might want to partition your first

Re: How to enforce RDD to be cached?

2014-12-03 Thread dsiegel
shahabm wrote I noticed that rdd.cache() is not happening immediately rather due to lazy feature of Spark, it is happening just at the moment you perform some map/reduce actions. Is this true? Yes, .cache() is a transformation (lazy evaluation) shahabm wrote If this is the case, how can I

Re: Does filter on an RDD scan every data item ?

2014-12-03 Thread dsiegel
nsareen wrote 1) Does filter function scan every element saved in RDD? if my RDD represents 10 Million rows, and if i want to work on only 1000 of them, is there an efficient way of filtering the subset without having to scan every element ? using .take(1000) may be a biased sample. you

Re: Does filter on an RDD scan every data item ?

2014-12-03 Thread dsiegel
also available is .sample(), which will randomly sample your RDD with or without replacement, and returns an RDD. .sample() takes a fraction, so it doesn't return an exact number of elements. eg. rdd.sample(true, .0001, 1) -- View this message in context:

Re: Insert new data into specific partition of an RDD

2014-12-03 Thread dsiegel
I'm not sure about .union(), but at least in the case of .join(), as long as you have hash partitioned the original RDDs and persisted them, calls to .join() take advantage of already knowing which partition the keys are on, and will not repartition rdd1. val rdd1 = log.partitionBy(new

single key-value pair fitting in memory

2014-12-03 Thread dsiegel
Hi, In the talk A Deeper Understanding of Spark Internals, it was mentioned that for some operators, spark can spill to disk across keys (in 1.1 - .groupByKey(), .reduceByKey(), .sortByKey()), but that as a limitation of the shuffle at that time, each single key-value pair must fit in memory.