from:"dsiegel"

Re: Does filter on an RDD scan every data item ?

2014-12-11 Thread dsiegel

Also, you may want to use .lookup() instead of .filter() def lookup(key: K): Seq[V] Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. You might want to partition your first

Re: How to enforce RDD to be cached?

2014-12-03 Thread dsiegel

shahabm wrote I noticed that rdd.cache() is not happening immediately rather due to lazy feature of Spark, it is happening just at the moment you perform some map/reduce actions. Is this true? Yes, .cache() is a transformation (lazy evaluation) shahabm wrote If this is the case, how can I

Re: Does filter on an RDD scan every data item ?

2014-12-03 Thread dsiegel

nsareen wrote 1) Does filter function scan every element saved in RDD? if my RDD represents 10 Million rows, and if i want to work on only 1000 of them, is there an efficient way of filtering the subset without having to scan every element ? using .take(1000) may be a biased sample. you

Re: Does filter on an RDD scan every data item ?

2014-12-03 Thread dsiegel

also available is .sample(), which will randomly sample your RDD with or without replacement, and returns an RDD. .sample() takes a fraction, so it doesn't return an exact number of elements. eg. rdd.sample(true, .0001, 1) -- View this message in context:

Re: Insert new data into specific partition of an RDD

2014-12-03 Thread dsiegel

I'm not sure about .union(), but at least in the case of .join(), as long as you have hash partitioned the original RDDs and persisted them, calls to .join() take advantage of already knowing which partition the keys are on, and will not repartition rdd1. val rdd1 = log.partitionBy(new

single key-value pair fitting in memory

2014-12-03 Thread dsiegel

Hi, In the talk A Deeper Understanding of Spark Internals, it was mentioned that for some operators, spark can spill to disk across keys (in 1.1 - .groupByKey(), .reduceByKey(), .sortByKey()), but that as a limitation of the shuffle at that time, each single key-value pair must fit in memory.

Re: Does filter on an RDD scan every data item ?

Re: How to enforce RDD to be cached?

Re: Does filter on an RDD scan every data item ?

Re: Does filter on an RDD scan every data item ?

Re: Insert new data into specific partition of an RDD

single key-value pair fitting in memory

6 matches

Site Navigation

Mail list logo

Footer information