Lakshmi, this is orthogonal to your question, but in case it's useful. It sounds like you're trying to determine the home location of a user, or something similar.
If that's the problem statement, the data pattern may suggest a far more computationally efficient approach. For example, first map all (lat,long) pairs into geocells of a desired resolution (e.g., 10m or 100m), then count occurrences of geocells instead. There are simple libraries to map any (lat,long) pairs into a geocell (string) ID very efficiently. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Wed, Jun 4, 2014 at 3:49 AM, lmk <lakshmi.muralikrish...@gmail.com> wrote: > Hi, > I am a new spark user. Pls let me know how to handle the following > scenario: > > I have a data set with the following fields: > 1. DeviceId > 2. latitude > 3. longitude > 4. ip address > 5. Datetime > 6. Mobile application name > > With the above data, I would like to perform the following steps: > 1. Collect all lat and lon for each ipaddress > (ip1,(lat1,lon1),(lat2,lon2)) > (ip2,(lat3,lon3),(lat4,lat5)) > 2. For each IP, > 1.Find the distance between each lat and lon coordinate pair and > all > the other pairs under the same IP > 2.Select those coordinates whose distances fall under a specific > threshold (say 100m) > 3.Find the coordinate pair with the maximum occurrences > > In this case, how can I iterate and compare each coordinate pair with all > the other pairs? > Can this be done in a distributed manner, as this data set is going to have > a few million records? > Can we do this in map/reduce commands? > > Thanks. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >