Hi Everyone,

Thank you for your suggestions, based on that I was able to move forward.

I am now generating a Geohash for all the lats and lons in our reference
data and then creating a trie of all the Geohash's

I am then broadcasting that trie and then using it to search the nearest
Geohash for the data I want to join with.

It works well on small data set but with big data sets I am getting OOM. I
have changed the "spark.storage.memoryFraction=0.4" but I still see many
partitions are getting recalculated.

What are the different settings that I should look at when broadcasting big
objects of 3-4 gb?

Thanks
Ankur

On Wed, Mar 11, 2015 at 8:58 AM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Thank you everyone!! I have started implementing the join using the
> geohash and using the first 4 alphabets of the HASH as the key.
>
> Can I assign a Confidence factor in terms of distance based on number of
> characters matching in the HASH code?
>
> I will also look at the other options listed here.
>
> Thanks
> Ankur
>
> On Wed, Mar 11, 2015, 6:18 AM Manas Kar <manasdebashis...@gmail.com>
> wrote:
>
>> There are few techniques currently available.
>> Geomesa which uses GeoHash also can be proved useful.(https://github.com/
>> locationtech/geomesa)
>>
>> Other potential candidate is
>> https://github.com/Esri/gis-tools-for-hadoop especially
>> https://github.com/Esri/geometry-api-java for inner customization.
>>
>> If you want to ask questions like nearby me then these are the basic
>> steps.
>> 1) Index your geometry data which uses R-Tree.
>> 2) Write your joiner logic that takes advantage of the index tree to get
>> you faster access.
>>
>> Thanks
>> Manas
>>
>>
>> On Wed, Mar 11, 2015 at 5:55 AM, Andrew Musselman <
>> andrew.mussel...@gmail.com> wrote:
>>
>>> Ted Dunning and Ellen Friedman's "Time Series Databases" has a section
>>> on this with some approaches to geo-encoding:
>>>
>>> https://www.mapr.com/time-series-databases-new-ways-
>>> store-and-access-data
>>> http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf
>>>
>>> On Tue, Mar 10, 2015 at 3:53 PM, John Meehan <jnmee...@gmail.com> wrote:
>>>
>>>> There are some techniques you can use If you geohash
>>>> <http://en.wikipedia.org/wiki/Geohash> the lat-lngs.  They will
>>>> naturally be sorted by proximity (with some edge cases so watch out).  If
>>>> you go the join route, either by trimming the lat-lngs or geohashing them,
>>>> you’re essentially grouping nearby locations into buckets — but you have to
>>>> consider the borders of the buckets since the nearest location may actually
>>>> be in an adjacent bucket.  Here’s a paper that discusses an implementation:
>>>> http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%
>>>> 20open%20box%20query.pdf
>>>>
>>>> On Mar 9, 2015, at 11:42 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>>> wrote:
>>>>
>>>> Are you using SparkSQL for the join? In that case I'm not quiet sure
>>>> you have a lot of options to join on the nearest co-ordinate. If you are
>>>> using the normal Spark code (by creating key-pair on lat,lon) you can apply
>>>> certain logic like trimming the lat,lon etc. If you want more specific
>>>> computing then you are better off using haversine formula.
>>>> <http://www.movable-type.co.uk/scripts/latlong.html>
>>>>
>>>>
>>>>
>>>
>>

Reply via email to