Hi Oleg/Andrew, Thanks much for the prompt response. We expect thousands of lat/lon pairs for each IP address. And that is my concern with the Cartesian product approach. Currently for a small sample of this data (5000 rows) I am grouping by IP address and then computing the distance between lat/lon coordinates using array manipulation techniques. But I understand this approach is not right when the data volume goes up. My code is as follows:
val dataset:RDD[String] = sc.textFile("x.csv") val data = dataset.map(l=>l.split(",")) val grpData = data.map(r => (r(3),((r(1).toDouble),r(2).toDouble))).groupByKey() Now, I have the data grouped by ipaddress as Array[(String, Iterable[(Double, Double)])] ex.. Array((ip1,ArrayBuffer((lat1,lon1), (lat2,lon2), (lat3,lon3))) Now I have to find the distance between (lat1,lon1) and (lat2,lon2) and then between (lat1,lon1) and (lat3,lon3) and so on for all combinations. This is where I get stuck. Please guide me on this. Thanks Again. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7016.html Sent from the Apache Spark User List mailing list archive at Nabble.com.