Two simple suggestions: 1. No need to call zipWithIndex twice. Use the earlier RDD dt. 2. Replace zipWithIndex with zipWithUniqueId which does not trigger a spark job
Below your code with the above changes: var dataRDD = sc.textFile("/test.csv").map(_.split(",")) val dt = dataRDD.*zipWithUniqueId*.map(_.swap) val newCol1 = *dt*.map {case (i, x) => (i, x(1)+x(18)) } val newCol2 = newCol1.join(dt).map(x=> function(.........)) Hope this helps. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165p23203.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org