subject:"Re\: Can this be done in map\-reduce technique \(in parallel\)"

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread lmk

Hi Cheng,
Sorry Again.

In this method, i see that the values for 
  a - positions.iterator 
  b - positions.iterator

always remain the same. I tried to do a  b - positions.iterator.next, it
throws an  error: value filter is not a member of (Double, Double)

Is there something I am missing out here?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread Christopher Nguyen

Lakshmi, this is orthogonal to your question, but in case it's useful.

It sounds like you're trying to determine the home location of a user, or
something similar.

If that's the problem statement, the data pattern may suggest a far more
computationally efficient approach. For example, first map all (lat,long)
pairs into geocells of a desired resolution (e.g., 10m or 100m), then count
occurrences of geocells instead. There are simple libraries to map any
(lat,long) pairs into a geocell (string) ID very efficiently.

--
Christopher T. Nguyen
Co-founder CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen

On Wed, Jun 4, 2014 at 3:49 AM, lmk lakshmi.muralikrish...@gmail.com
wrote:

Hi,
I am a new spark user. Pls let me know how to handle the following
scenario:

I have a data set with the following fields:
1. DeviceId
2. latitude
3. longitude
4. ip address
5. Datetime
6. Mobile application name

With the above data, I would like to perform the following steps:
1. Collect all lat and lon for each ipaddress
(ip1,(lat1,lon1),(lat2,lon2))
(ip2,(lat3,lon3),(lat4,lat5))
2. For each IP,
1.Find the distance between each lat and lon coordinate pair and
all
the other pairs under the same IP
2.Select those coordinates whose distances fall under a specific
threshold (say 100m)
3.Find the coordinate pair with the maximum occurrences

In this case, how can I iterate and compare each coordinate pair with all
the other pairs?
Can this be done in a distributed manner, as this data set is going to have
a few million records?
Can we do this in map/reduce commands?

Thanks.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread lmk

Hi Cheng,
Thanks a lot. That solved my problem.

Thanks again for the quick response and solution.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7047.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread Oleg Proudnikov

It is possible if you use a cartesian product to produce all possible
pairs for each IP address and 2 stages of map-reduce:
- first by pairs of points to find the total of each pair and
- second by IP address to find the pair for each IP address with the
maximum count.

Oleg

On 4 June 2014 11:49, lmk lakshmi.muralikrish...@gmail.com wrote:

Hi,
I am a new spark user. Pls let me know how to handle the following
scenario:

I have a data set with the following fields:
1. DeviceId
2. latitude
3. longitude
4. ip address
5. Datetime
6. Mobile application name

Thanks.

--
Kind regards,

Oleg

Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread Andrew Ash

When you group by IP address in step 1 to this:

(ip1,(lat1,lon1),(lat2,lon2))
(ip2,(lat3,lon3),(lat4,lat5))

How many lat/lon locations do you expect for each IP address? avg and max
are interesting.

Andrew

On Wed, Jun 4, 2014 at 5:29 AM, Oleg Proudnikov oleg.proudni...@gmail.com
wrote:

Oleg

On 4 June 2014 11:49, lmk lakshmi.muralikrish...@gmail.com wrote:

Hi,
I am a new spark user. Pls let me know how to handle the following
scenario:

I have a data set with the following fields:
1. DeviceId
2. latitude
3. longitude
4. ip address
5. Datetime
6. Mobile application name

In this case, how can I iterate and compare each coordinate pair with all
the other pairs?
Can this be done in a distributed manner, as this data set is going to
have
a few million records?
Can we do this in map/reduce commands?

Thanks.

--
Kind regards,

Oleg

Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread lmk

Hi Oleg/Andrew,
Thanks much for the prompt response.

We expect thousands of lat/lon pairs for each IP address. And that is my
concern with the Cartesian product approach. 
Currently for a small sample of this data (5000 rows) I am grouping by IP
address and then computing the distance between lat/lon coordinates using
array manipulation techniques. 
But I understand this approach is not right when the data volume goes up.
My code is as follows:

val dataset:RDD[String] = sc.textFile(x.csv)
val data = dataset.map(l=l.split(,))
val grpData = data.map(r =
(r(3),((r(1).toDouble),r(2).toDouble))).groupByKey() 

Now, I have the data grouped by ipaddress as Array[(String,
Iterable[(Double, Double)])]
ex.. 
 Array((ip1,ArrayBuffer((lat1,lon1), (lat2,lon2), (lat3,lon3)))

Now I have to find the distance between (lat1,lon1) and (lat2,lon2) and then
between (lat1,lon1) and (lat3,lon3) and so on for all combinations.

This is where I get stuck. Please guide me on this.

Thanks Again.
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7016.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Can this be done in map-reduce technique (in parallel)

Re: Can this be done in map-reduce technique (in parallel)

Re: Can this be done in map-reduce technique (in parallel)

Re: Can this be done in map-reduce technique (in parallel)

Re: Can this be done in map-reduce technique (in parallel)

Re: Can this be done in map-reduce technique (in parallel)

6 matches

Site Navigation

Mail list logo

Footer information