Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread lmk
Hi Cheng,
Sorry Again.

In this method, i see that the values for 
  a - positions.iterator 
  b - positions.iterator

always remain the same. I tried to do a  b - positions.iterator.next, it
throws an  error: value filter is not a member of (Double, Double)

Is there something I am missing out here?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread Christopher Nguyen
Lakshmi, this is orthogonal to your question, but in case it's useful.

It sounds like you're trying to determine the home location of a user, or
something similar.

If that's the problem statement, the data pattern may suggest a far more
computationally efficient approach. For example, first map all (lat,long)
pairs into geocells of a desired resolution (e.g., 10m or 100m), then count
occurrences of geocells instead. There are simple libraries to map any
(lat,long) pairs into a geocell (string) ID very efficiently.

--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Wed, Jun 4, 2014 at 3:49 AM, lmk lakshmi.muralikrish...@gmail.com
wrote:

 Hi,
 I am a new spark user. Pls let me know how to handle the following
 scenario:

 I have a data set with the following fields:
 1. DeviceId
 2. latitude
 3. longitude
 4. ip address
 5. Datetime
 6. Mobile application name

 With the above data, I would like to perform the following steps:
 1. Collect all lat and lon for each ipaddress
 (ip1,(lat1,lon1),(lat2,lon2))
 (ip2,(lat3,lon3),(lat4,lat5))
 2. For each IP,
 1.Find the distance between each lat and lon coordinate pair and
 all
 the other pairs under the same IP
 2.Select those coordinates whose distances fall under a specific
 threshold (say 100m)
 3.Find the coordinate pair with the maximum occurrences

 In this case, how can I iterate and compare each coordinate pair with all
 the other pairs?
 Can this be done in a distributed manner, as this data set is going to have
 a few million records?
 Can we do this in map/reduce commands?

 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread lmk
Hi Cheng,
Thanks a lot. That solved my problem.

Thanks again for the quick response and solution.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7047.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread Oleg Proudnikov
 It is possible if you use a cartesian product to produce all possible
pairs for each IP address and 2 stages of map-reduce:
 - first by pairs of points to find the total of each pair and
-  second by IP address to find the pair for each IP address with the
maximum count.

Oleg



On 4 June 2014 11:49, lmk lakshmi.muralikrish...@gmail.com wrote:

 Hi,
 I am a new spark user. Pls let me know how to handle the following
 scenario:

 I have a data set with the following fields:
 1. DeviceId
 2. latitude
 3. longitude
 4. ip address
 5. Datetime
 6. Mobile application name

 With the above data, I would like to perform the following steps:
 1. Collect all lat and lon for each ipaddress
 (ip1,(lat1,lon1),(lat2,lon2))
 (ip2,(lat3,lon3),(lat4,lat5))
 2. For each IP,
 1.Find the distance between each lat and lon coordinate pair and
 all
 the other pairs under the same IP
 2.Select those coordinates whose distances fall under a specific
 threshold (say 100m)
 3.Find the coordinate pair with the maximum occurrences

 In this case, how can I iterate and compare each coordinate pair with all
 the other pairs?
 Can this be done in a distributed manner, as this data set is going to have
 a few million records?
 Can we do this in map/reduce commands?

 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




-- 
Kind regards,

Oleg


Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread Andrew Ash
When you group by IP address in step 1 to this:

(ip1,(lat1,lon1),(lat2,lon2))
(ip2,(lat3,lon3),(lat4,lat5))

How many lat/lon locations do you expect for each IP address?  avg and max
are interesting.

Andrew


On Wed, Jun 4, 2014 at 5:29 AM, Oleg Proudnikov oleg.proudni...@gmail.com
wrote:

  It is possible if you use a cartesian product to produce all possible
 pairs for each IP address and 2 stages of map-reduce:
  - first by pairs of points to find the total of each pair and
 -  second by IP address to find the pair for each IP address with the
 maximum count.

 Oleg



 On 4 June 2014 11:49, lmk lakshmi.muralikrish...@gmail.com wrote:

 Hi,
 I am a new spark user. Pls let me know how to handle the following
 scenario:

 I have a data set with the following fields:
 1. DeviceId
 2. latitude
 3. longitude
 4. ip address
 5. Datetime
 6. Mobile application name

 With the above data, I would like to perform the following steps:
 1. Collect all lat and lon for each ipaddress
 (ip1,(lat1,lon1),(lat2,lon2))
 (ip2,(lat3,lon3),(lat4,lat5))
 2. For each IP,
 1.Find the distance between each lat and lon coordinate pair and
 all
 the other pairs under the same IP
 2.Select those coordinates whose distances fall under a specific
 threshold (say 100m)
 3.Find the coordinate pair with the maximum occurrences

 In this case, how can I iterate and compare each coordinate pair with all
 the other pairs?
 Can this be done in a distributed manner, as this data set is going to
 have
 a few million records?
 Can we do this in map/reduce commands?

 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
 Kind regards,

 Oleg




Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread lmk
Hi Oleg/Andrew,
Thanks much for the prompt response.

We expect thousands of lat/lon pairs for each IP address. And that is my
concern with the Cartesian product approach. 
Currently for a small sample of this data (5000 rows) I am grouping by IP
address and then computing the distance between lat/lon coordinates using
array manipulation techniques. 
But I understand this approach is not right when the data volume goes up.
My code is as follows:

val dataset:RDD[String] = sc.textFile(x.csv)
val data = dataset.map(l=l.split(,))
val grpData = data.map(r =
(r(3),((r(1).toDouble),r(2).toDouble))).groupByKey() 

Now, I have the data grouped by ipaddress as Array[(String,
Iterable[(Double, Double)])]
ex.. 
 Array((ip1,ArrayBuffer((lat1,lon1), (lat2,lon2), (lat3,lon3)))

Now I have to find the distance between (lat1,lon1) and (lat2,lon2) and then
between (lat1,lon1) and (lat3,lon3) and so on for all combinations.

This is where I get stuck. Please guide me on this.

Thanks Again.
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7016.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.