When you group by IP address in step 1 to this: (ip1,(lat1,lon1),(lat2,lon2)) (ip2,(lat3,lon3),(lat4,lat5))
How many lat/lon locations do you expect for each IP address? avg and max are interesting. Andrew On Wed, Jun 4, 2014 at 5:29 AM, Oleg Proudnikov <oleg.proudni...@gmail.com> wrote: > It is possible if you use a cartesian product to produce all possible > pairs for each IP address and 2 stages of map-reduce: > - first by pairs of points to find the total of each pair and > - second by IP address to find the pair for each IP address with the > maximum count. > > Oleg > > > > On 4 June 2014 11:49, lmk <lakshmi.muralikrish...@gmail.com> wrote: > >> Hi, >> I am a new spark user. Pls let me know how to handle the following >> scenario: >> >> I have a data set with the following fields: >> 1. DeviceId >> 2. latitude >> 3. longitude >> 4. ip address >> 5. Datetime >> 6. Mobile application name >> >> With the above data, I would like to perform the following steps: >> 1. Collect all lat and lon for each ipaddress >> (ip1,(lat1,lon1),(lat2,lon2)) >> (ip2,(lat3,lon3),(lat4,lat5)) >> 2. For each IP, >> 1.Find the distance between each lat and lon coordinate pair and >> all >> the other pairs under the same IP >> 2.Select those coordinates whose distances fall under a specific >> threshold (say 100m) >> 3.Find the coordinate pair with the maximum occurrences >> >> In this case, how can I iterate and compare each coordinate pair with all >> the other pairs? >> Can this be done in a distributed manner, as this data set is going to >> have >> a few million records? >> Can we do this in map/reduce commands? >> >> Thanks. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > > > > -- > Kind regards, > > Oleg > >