Re: Joining data using Latitude, Longitude
Hi Everyone, Thank you for your suggestions, based on that I was able to move forward. I am now generating a Geohash for all the lats and lons in our reference data and then creating a trie of all the Geohash's I am then broadcasting that trie and then using it to search the nearest Geohash for the data I want to join with. It works well on small data set but with big data sets I am getting OOM. I have changed the spark.storage.memoryFraction=0.4 but I still see many partitions are getting recalculated. What are the different settings that I should look at when broadcasting big objects of 3-4 gb? Thanks Ankur On Wed, Mar 11, 2015 at 8:58 AM, Ankur Srivastava ankur.srivast...@gmail.com wrote: Thank you everyone!! I have started implementing the join using the geohash and using the first 4 alphabets of the HASH as the key. Can I assign a Confidence factor in terms of distance based on number of characters matching in the HASH code? I will also look at the other options listed here. Thanks Ankur On Wed, Mar 11, 2015, 6:18 AM Manas Kar manasdebashis...@gmail.com wrote: There are few techniques currently available. Geomesa which uses GeoHash also can be proved useful.(https://github.com/ locationtech/geomesa) Other potential candidate is https://github.com/Esri/gis-tools-for-hadoop especially https://github.com/Esri/geometry-api-java for inner customization. If you want to ask questions like nearby me then these are the basic steps. 1) Index your geometry data which uses R-Tree. 2) Write your joiner logic that takes advantage of the index tree to get you faster access. Thanks Manas On Wed, Mar 11, 2015 at 5:55 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Ted Dunning and Ellen Friedman's Time Series Databases has a section on this with some approaches to geo-encoding: https://www.mapr.com/time-series-databases-new-ways- store-and-access-data http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf On Tue, Mar 10, 2015 at 3:53 PM, John Meehan jnmee...@gmail.com wrote: There are some techniques you can use If you geohash http://en.wikipedia.org/wiki/Geohash the lat-lngs. They will naturally be sorted by proximity (with some edge cases so watch out). If you go the join route, either by trimming the lat-lngs or geohashing them, you’re essentially grouping nearby locations into buckets — but you have to consider the borders of the buckets since the nearest location may actually be in an adjacent bucket. Here’s a paper that discusses an implementation: http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with% 20open%20box%20query.pdf On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you using SparkSQL for the join? In that case I'm not quiet sure you have a lot of options to join on the nearest co-ordinate. If you are using the normal Spark code (by creating key-pair on lat,lon) you can apply certain logic like trimming the lat,lon etc. If you want more specific computing then you are better off using haversine formula. http://www.movable-type.co.uk/scripts/latlong.html
Re: Joining data using Latitude, Longitude
Ted Dunning and Ellen Friedman's Time Series Databases has a section on this with some approaches to geo-encoding: https://www.mapr.com/time-series-databases-new-ways-store-and-access-data http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf On Tue, Mar 10, 2015 at 3:53 PM, John Meehan jnmee...@gmail.com wrote: There are some techniques you can use If you geohash http://en.wikipedia.org/wiki/Geohash the lat-lngs. They will naturally be sorted by proximity (with some edge cases so watch out). If you go the join route, either by trimming the lat-lngs or geohashing them, you’re essentially grouping nearby locations into buckets — but you have to consider the borders of the buckets since the nearest location may actually be in an adjacent bucket. Here’s a paper that discusses an implementation: http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you using SparkSQL for the join? In that case I'm not quiet sure you have a lot of options to join on the nearest co-ordinate. If you are using the normal Spark code (by creating key-pair on lat,lon) you can apply certain logic like trimming the lat,lon etc. If you want more specific computing then you are better off using haversine formula. http://www.movable-type.co.uk/scripts/latlong.html
Re: Joining data using Latitude, Longitude
Thank you everyone!! I have started implementing the join using the geohash and using the first 4 alphabets of the HASH as the key. Can I assign a Confidence factor in terms of distance based on number of characters matching in the HASH code? I will also look at the other options listed here. Thanks Ankur On Wed, Mar 11, 2015, 6:18 AM Manas Kar manasdebashis...@gmail.com wrote: There are few techniques currently available. Geomesa which uses GeoHash also can be proved useful.( https://github.com/locationtech/geomesa) Other potential candidate is https://github.com/Esri/gis-tools-for-hadoop especially https://github.com/Esri/geometry-api-java for inner customization. If you want to ask questions like nearby me then these are the basic steps. 1) Index your geometry data which uses R-Tree. 2) Write your joiner logic that takes advantage of the index tree to get you faster access. Thanks Manas On Wed, Mar 11, 2015 at 5:55 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Ted Dunning and Ellen Friedman's Time Series Databases has a section on this with some approaches to geo-encoding: https://www.mapr.com/time-series-databases-new-ways-store-and-access-data http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf On Tue, Mar 10, 2015 at 3:53 PM, John Meehan jnmee...@gmail.com wrote: There are some techniques you can use If you geohash http://en.wikipedia.org/wiki/Geohash the lat-lngs. They will naturally be sorted by proximity (with some edge cases so watch out). If you go the join route, either by trimming the lat-lngs or geohashing them, you’re essentially grouping nearby locations into buckets — but you have to consider the borders of the buckets since the nearest location may actually be in an adjacent bucket. Here’s a paper that discusses an implementation: http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you using SparkSQL for the join? In that case I'm not quiet sure you have a lot of options to join on the nearest co-ordinate. If you are using the normal Spark code (by creating key-pair on lat,lon) you can apply certain logic like trimming the lat,lon etc. If you want more specific computing then you are better off using haversine formula. http://www.movable-type.co.uk/scripts/latlong.html
Re: Joining data using Latitude, Longitude
There are few techniques currently available. Geomesa which uses GeoHash also can be proved useful.( https://github.com/locationtech/geomesa) Other potential candidate is https://github.com/Esri/gis-tools-for-hadoop especially https://github.com/Esri/geometry-api-java for inner customization. If you want to ask questions like nearby me then these are the basic steps. 1) Index your geometry data which uses R-Tree. 2) Write your joiner logic that takes advantage of the index tree to get you faster access. Thanks Manas On Wed, Mar 11, 2015 at 5:55 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Ted Dunning and Ellen Friedman's Time Series Databases has a section on this with some approaches to geo-encoding: https://www.mapr.com/time-series-databases-new-ways-store-and-access-data http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf On Tue, Mar 10, 2015 at 3:53 PM, John Meehan jnmee...@gmail.com wrote: There are some techniques you can use If you geohash http://en.wikipedia.org/wiki/Geohash the lat-lngs. They will naturally be sorted by proximity (with some edge cases so watch out). If you go the join route, either by trimming the lat-lngs or geohashing them, you’re essentially grouping nearby locations into buckets — but you have to consider the borders of the buckets since the nearest location may actually be in an adjacent bucket. Here’s a paper that discusses an implementation: http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you using SparkSQL for the join? In that case I'm not quiet sure you have a lot of options to join on the nearest co-ordinate. If you are using the normal Spark code (by creating key-pair on lat,lon) you can apply certain logic like trimming the lat,lon etc. If you want more specific computing then you are better off using haversine formula. http://www.movable-type.co.uk/scripts/latlong.html
Re: Joining data using Latitude, Longitude
Are you using SparkSQL for the join? In that case I'm not quiet sure you have a lot of options to join on the nearest co-ordinate. If you are using the normal Spark code (by creating key-pair on lat,lon) you can apply certain logic like trimming the lat,lon etc. If you want more specific computing then you are better off using haversine formula. http://www.movable-type.co.uk/scripts/latlong.html
Re: Joining data using Latitude, Longitude
There are some techniques you can use If you geohash http://en.wikipedia.org/wiki/Geohash the lat-lngs. They will naturally be sorted by proximity (with some edge cases so watch out). If you go the join route, either by trimming the lat-lngs or geohashing them, you’re essentially grouping nearby locations into buckets — but you have to consider the borders of the buckets since the nearest location may actually be in an adjacent bucket. Here’s a paper that discusses an implementation: http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you using SparkSQL for the join? In that case I'm not quiet sure you have a lot of options to join on the nearest co-ordinate. If you are using the normal Spark code (by creating key-pair on lat,lon) you can apply certain logic like trimming the lat,lon etc. If you want more specific computing then you are better off using haversine formula. http://www.movable-type.co.uk/scripts/latlong.html