If you look at the source code you’ll see that this is merely a convenience function on PairRDDs - only interesting detail is that it uses a mutable HashMap to optimize creating maps with many keys. That being said, .collect() is called anyway.
https://github.com/apache/spark/blob/f85aa06464a10f5d1563302fd76465dded475a12/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L741-L753 -adrian On 10/20/15, 12:35 PM, "kali.tumm...@gmail.com" <kali.tumm...@gmail.com> wrote: >Hi All, > >Is there any performance impact when I use collectAsMap on my RDD instead of >rdd.collect().toMap ? > >I have a key value rdd and I want to convert to HashMap as far I know >collect() is not efficient on large data sets as it runs on driver can I use >collectAsMap instead is there any performance impact ? > >Original:- > val QuoteHashMap=QuoteRDD.collect().toMap > val QuoteRDDData=QuoteHashMap.values.toSeq > val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x => >x.toString.replace("(","").replace(")",""))) > QuoteRDDSet.saveAsTextFile(Quotepath) > >Change:- > val QuoteHashMap=QuoteRDD.collectAsMap() > val QuoteRDDData=QuoteHashMap.values.toSeq > val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x => >x.toString.replace("(","").replace(")",""))) > QuoteRDDSet.saveAsTextFile(Quotepath) > > > >Thanks >Sri > > > >-- >View this message in context: >http://apache-spark-user-list.1001560.n3.nabble.com/difference-between-rdd-collect-toMap-to-rdd-collectAsMap-tp25139.html >Sent from the Apache Spark User List mailing list archive at Nabble.com. > >--------------------------------------------------------------------- >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >For additional commands, e-mail: user-h...@spark.apache.org >