Re: Joining data using Latitude, Longitude

2015-03-13 Thread Ankur Srivastava
Hi Everyone,

Thank you for your suggestions, based on that I was able to move forward.

I am now generating a Geohash for all the lats and lons in our reference
data and then creating a trie of all the Geohash's

I am then broadcasting that trie and then using it to search the nearest
Geohash for the data I want to join with.

It works well on small data set but with big data sets I am getting OOM. I
have changed the spark.storage.memoryFraction=0.4 but I still see many
partitions are getting recalculated.

What are the different settings that I should look at when broadcasting big
objects of 3-4 gb?

Thanks
Ankur

On Wed, Mar 11, 2015 at 8:58 AM, Ankur Srivastava 
ankur.srivast...@gmail.com wrote:

 Thank you everyone!! I have started implementing the join using the
 geohash and using the first 4 alphabets of the HASH as the key.

 Can I assign a Confidence factor in terms of distance based on number of
 characters matching in the HASH code?

 I will also look at the other options listed here.

 Thanks
 Ankur

 On Wed, Mar 11, 2015, 6:18 AM Manas Kar manasdebashis...@gmail.com
 wrote:

 There are few techniques currently available.
 Geomesa which uses GeoHash also can be proved useful.(https://github.com/
 locationtech/geomesa)

 Other potential candidate is
 https://github.com/Esri/gis-tools-for-hadoop especially
 https://github.com/Esri/geometry-api-java for inner customization.

 If you want to ask questions like nearby me then these are the basic
 steps.
 1) Index your geometry data which uses R-Tree.
 2) Write your joiner logic that takes advantage of the index tree to get
 you faster access.

 Thanks
 Manas


 On Wed, Mar 11, 2015 at 5:55 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:

 Ted Dunning and Ellen Friedman's Time Series Databases has a section
 on this with some approaches to geo-encoding:

 https://www.mapr.com/time-series-databases-new-ways-
 store-and-access-data
 http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf

 On Tue, Mar 10, 2015 at 3:53 PM, John Meehan jnmee...@gmail.com wrote:

 There are some techniques you can use If you geohash
 http://en.wikipedia.org/wiki/Geohash the lat-lngs.  They will
 naturally be sorted by proximity (with some edge cases so watch out).  If
 you go the join route, either by trimming the lat-lngs or geohashing them,
 you’re essentially grouping nearby locations into buckets — but you have to
 consider the borders of the buckets since the nearest location may actually
 be in an adjacent bucket.  Here’s a paper that discusses an implementation:
 http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%
 20open%20box%20query.pdf

 On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Are you using SparkSQL for the join? In that case I'm not quiet sure
 you have a lot of options to join on the nearest co-ordinate. If you are
 using the normal Spark code (by creating key-pair on lat,lon) you can apply
 certain logic like trimming the lat,lon etc. If you want more specific
 computing then you are better off using haversine formula.
 http://www.movable-type.co.uk/scripts/latlong.html







Re: Joining data using Latitude, Longitude

2015-03-12 Thread Andrew Musselman
Ted Dunning and Ellen Friedman's Time Series Databases has a section on
this with some approaches to geo-encoding:

https://www.mapr.com/time-series-databases-new-ways-store-and-access-data
http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf

On Tue, Mar 10, 2015 at 3:53 PM, John Meehan jnmee...@gmail.com wrote:

 There are some techniques you can use If you geohash
 http://en.wikipedia.org/wiki/Geohash the lat-lngs.  They will naturally
 be sorted by proximity (with some edge cases so watch out).  If you go the
 join route, either by trimming the lat-lngs or geohashing them, you’re
 essentially grouping nearby locations into buckets — but you have to
 consider the borders of the buckets since the nearest location may actually
 be in an adjacent bucket.  Here’s a paper that discusses an implementation:
 http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf

 On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com wrote:

 Are you using SparkSQL for the join? In that case I'm not quiet sure you
 have a lot of options to join on the nearest co-ordinate. If you are using
 the normal Spark code (by creating key-pair on lat,lon) you can apply
 certain logic like trimming the lat,lon etc. If you want more specific
 computing then you are better off using haversine formula.
 http://www.movable-type.co.uk/scripts/latlong.html





Re: Joining data using Latitude, Longitude

2015-03-11 Thread Ankur Srivastava
Thank you everyone!! I have started implementing the join using the geohash
and using the first 4 alphabets of the HASH as the key.

Can I assign a Confidence factor in terms of distance based on number of
characters matching in the HASH code?

I will also look at the other options listed here.

Thanks
Ankur

On Wed, Mar 11, 2015, 6:18 AM Manas Kar manasdebashis...@gmail.com wrote:

 There are few techniques currently available.
 Geomesa which uses GeoHash also can be proved useful.(
 https://github.com/locationtech/geomesa)

 Other potential candidate is
 https://github.com/Esri/gis-tools-for-hadoop especially
 https://github.com/Esri/geometry-api-java for inner customization.

 If you want to ask questions like nearby me then these are the basic steps.
 1) Index your geometry data which uses R-Tree.
 2) Write your joiner logic that takes advantage of the index tree to get
 you faster access.

 Thanks
 Manas


 On Wed, Mar 11, 2015 at 5:55 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:

 Ted Dunning and Ellen Friedman's Time Series Databases has a section on
 this with some approaches to geo-encoding:

 https://www.mapr.com/time-series-databases-new-ways-store-and-access-data
 http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf

 On Tue, Mar 10, 2015 at 3:53 PM, John Meehan jnmee...@gmail.com wrote:

 There are some techniques you can use If you geohash
 http://en.wikipedia.org/wiki/Geohash the lat-lngs.  They will
 naturally be sorted by proximity (with some edge cases so watch out).  If
 you go the join route, either by trimming the lat-lngs or geohashing them,
 you’re essentially grouping nearby locations into buckets — but you have to
 consider the borders of the buckets since the nearest location may actually
 be in an adjacent bucket.  Here’s a paper that discusses an implementation:
 http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf

 On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Are you using SparkSQL for the join? In that case I'm not quiet sure you
 have a lot of options to join on the nearest co-ordinate. If you are using
 the normal Spark code (by creating key-pair on lat,lon) you can apply
 certain logic like trimming the lat,lon etc. If you want more specific
 computing then you are better off using haversine formula.
 http://www.movable-type.co.uk/scripts/latlong.html







Re: Joining data using Latitude, Longitude

2015-03-11 Thread Manas Kar
There are few techniques currently available.
Geomesa which uses GeoHash also can be proved useful.(
https://github.com/locationtech/geomesa)

Other potential candidate is
https://github.com/Esri/gis-tools-for-hadoop especially
https://github.com/Esri/geometry-api-java for inner customization.

If you want to ask questions like nearby me then these are the basic steps.
1) Index your geometry data which uses R-Tree.
2) Write your joiner logic that takes advantage of the index tree to get
you faster access.

Thanks
Manas


On Wed, Mar 11, 2015 at 5:55 AM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 Ted Dunning and Ellen Friedman's Time Series Databases has a section on
 this with some approaches to geo-encoding:

 https://www.mapr.com/time-series-databases-new-ways-store-and-access-data
 http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf

 On Tue, Mar 10, 2015 at 3:53 PM, John Meehan jnmee...@gmail.com wrote:

 There are some techniques you can use If you geohash
 http://en.wikipedia.org/wiki/Geohash the lat-lngs.  They will
 naturally be sorted by proximity (with some edge cases so watch out).  If
 you go the join route, either by trimming the lat-lngs or geohashing them,
 you’re essentially grouping nearby locations into buckets — but you have to
 consider the borders of the buckets since the nearest location may actually
 be in an adjacent bucket.  Here’s a paper that discusses an implementation:
 http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf

 On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Are you using SparkSQL for the join? In that case I'm not quiet sure you
 have a lot of options to join on the nearest co-ordinate. If you are using
 the normal Spark code (by creating key-pair on lat,lon) you can apply
 certain logic like trimming the lat,lon etc. If you want more specific
 computing then you are better off using haversine formula.
 http://www.movable-type.co.uk/scripts/latlong.html






Re: Joining data using Latitude, Longitude

2015-03-10 Thread Akhil Das
Are you using SparkSQL for the join? In that case I'm not quiet sure you
have a lot of options to join on the nearest co-ordinate. If you are using
the normal Spark code (by creating key-pair on lat,lon) you can apply
certain logic like trimming the lat,lon etc. If you want more specific
computing then you are better off using haversine formula.
http://www.movable-type.co.uk/scripts/latlong.html


Re: Joining data using Latitude, Longitude

2015-03-10 Thread John Meehan
There are some techniques you can use If you geohash 
http://en.wikipedia.org/wiki/Geohash the lat-lngs.  They will naturally be 
sorted by proximity (with some edge cases so watch out).  If you go the join 
route, either by trimming the lat-lngs or geohashing them, you’re essentially 
grouping nearby locations into buckets — but you have to consider the borders 
of the buckets since the nearest location may actually be in an adjacent 
bucket.  Here’s a paper that discusses an implementation: 
http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf
 
http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf

 On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 
 Are you using SparkSQL for the join? In that case I'm not quiet sure you have 
 a lot of options to join on the nearest co-ordinate. If you are using the 
 normal Spark code (by creating key-pair on lat,lon) you can apply certain 
 logic like trimming the lat,lon etc. If you want more specific computing then 
 you are better off using haversine formula. 
 http://www.movable-type.co.uk/scripts/latlong.html