Re: difference between rdd.collect().toMap to rdd.collectAsMap() ?

Adrian Tanase Tue, 20 Oct 2015 03:06:07 -0700

If you look at the source code you’ll see that this is merely a convenience 
function on PairRDDs - only interesting detail is that it uses a mutable 
HashMap to optimize creating maps with many keys. That being said, .collect() 
is called anyway.


https://github.com/apache/spark/blob/f85aa06464a10f5d1563302fd76465dded475a12/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L741-L753


-adrian




On 10/20/15, 12:35 PM, "kali.tumm...@gmail.com" <kali.tumm...@gmail.com> wrote:

>Hi All, 
>
>Is there any performance impact when I use collectAsMap on my RDD instead of
>rdd.collect().toMap ?
>
>I have a key value rdd and I want to convert to HashMap as far I know
>collect() is not efficient on large data sets as it runs on driver can I use
>collectAsMap instead is there any performance impact ?
>
>Original:-
> val QuoteHashMap=QuoteRDD.collect().toMap
> val QuoteRDDData=QuoteHashMap.values.toSeq
> val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x =>
>x.toString.replace("(","").replace(")","")))
> QuoteRDDSet.saveAsTextFile(Quotepath)
>
>Change:-
> val QuoteHashMap=QuoteRDD.collectAsMap()
> val QuoteRDDData=QuoteHashMap.values.toSeq
> val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x =>
>x.toString.replace("(","").replace(")","")))
> QuoteRDDSet.saveAsTextFile(Quotepath)
>
>
>
>Thanks
>Sri 
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/difference-between-rdd-collect-toMap-to-rdd-collectAsMap-tp25139.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

Re: difference between rdd.collect().toMap to rdd.collectAsMap() ?

Reply via email to