map vs mapPartitions

2016-08-08 Thread rtijoriwala
Hi All, I am a newbie to spark and want to know if there is any performance difference between map vs mapPartitions if I am doing strictly a per item transformation? For e.g. reversedWords = words.map(w => w.reverse()); vs. reversedWords = words.mapPartitions(pwordsIterator => {

Re: map vs mapPartitions

2015-06-25 Thread Shushant Arora
Then how performance of mapPartitions is faster than map? On Thu, Jun 25, 2015 at 6:40 PM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Spark creates a RecordReader and uses next() on it when you call input.next(). (See

Re: map vs mapPartitions

2015-06-25 Thread Hao Ren
It's not the number of executors that matters, but the # of the CPU cores of your cluster. Each partition will be loaded on a core for computing. e.g. A cluster of 3 nodes has 24 cores, and you divide the RDD in 24 partitions (24 tasks for narrow dependency). Then all the 24 partitions will be

Re: map vs mapPartitions

2015-06-25 Thread Shushant Arora
say source is HDFS,And file is divided in 10 partitions. so what will be input contains. public IterableInteger call(IteratorString input) say I have 10 executors in job each having single partition. will it have some part of partition or complete. And if some when I call input.next() - it

Re: map vs mapPartitions

2015-06-25 Thread Daniel Darabos
Spark creates a RecordReader and uses next() on it when you call input.next(). (See https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L215) How the RecordReader works is an HDFS question, but it's safe to say there is no difference between using

Re: map vs mapPartitions

2015-06-25 Thread Shushant Arora
yes, 1 partition per core and mapPartitions apply function on each partition. Question is Does complete partition loads in memory so that function can be applied to it or its an iterator and iterator.next() loads next record and if yes then how is it efficient than map which also works on 1

Re: map vs mapPartitions

2015-06-25 Thread Corey Nolet
Also, I've noticed that .map() actually creates a MapPartitionsRDD under the hood. SO I think the real difference is just in the API that's being exposed. You can do a map() and not have to think about the partitions at all or you can do a .mapPartitions() and be able to do things like chunking

Fwd: map vs mapPartitions

2015-06-25 Thread Hao Ren
-- Forwarded message -- From: Hao Ren inv...@gmail.com Date: Thu, Jun 25, 2015 at 7:03 PM Subject: Re: map vs mapPartitions To: Shushant Arora shushantaror...@gmail.com In fact, map and mapPartitions produce RDD of the same type: MapPartitionsRDD. Check RDD api source code below

Re: map vs mapPartitions

2015-06-25 Thread Corey Nolet
I don't know exactly what's going on under the hood but I would not assume that just because a whole partition is not being pulled into memory @ one time that that means each record is being pulled at 1 time. That's the beauty of exposing Iterators Iterables in an API rather than collections-

map vs mapPartitions

2015-06-25 Thread Shushant Arora
Does mapPartitions keep complete partitions in memory of executor as iterable. JavaRDDString rdd = jsc.textFile(path); JavaRDDInteger output = rdd.mapPartitions(new FlatMapFunctionIteratorString, Integer() { public IterableInteger call(IteratorString input) throws Exception { ListInteger output

Re: map vs mapPartitions

2015-06-25 Thread Sean Owen
No, or at least, it depends on how the source of the partitions was implemented. On Thu, Jun 25, 2015 at 12:16 PM, Shushant Arora shushantaror...@gmail.com wrote: Does mapPartitions keep complete partitions in memory of executor as iterable. JavaRDDString rdd = jsc.textFile(path);