Re: best way to join?

Ted Dunning Mon, 27 Aug 2012 14:52:58 -0700

Mahout is getting some very fast knn code in version 0.8.

The basic work flow is that you would first do a large-scale clustering of
the data.  Then you would make a second pass using the clustering to
facilitate fast search for nearby points.

The clustering will require two map-reduce jobs, one to find the cluster
definitions and the second map-only to assign points to clusters in a form
to be used by the second pass.  The second pass is a map-only process as
well.

In order to make this as efficient as possible, it is desirable to use a
distribution of Hadoop that allows you to directly map the cluster data
structures into shared memory.  IF you have NFS access to your cluster,
this is easy.  Otherwise, it is considerably trickier.

On Mon, Aug 27, 2012 at 4:15 PM, dexter morgan <dextermorga...@gmail.com>wrote:

> Dear list,
>
> Lets say i have a file, like this:
> id \t at,tlng <-- structure
>
> 1\t40.123,-50.432
> 2\t41.431,-43.32
> ...
> ...
> lets call it: 'points.txt'
> I'm trying to build a map-reduce job that runs over this BIG points file
> and it should output
> a file, that will look like:
> id[lat,lng] \t [list of points in JSON standard] <--- structure
>
> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
> 2[41.431,-43.32]\t[[40.123,-50.432],...[,]]
> ...
>
> Basically it should run on ITSELF, and grab for each point the N (it will
> be an argument for the job) CLOSEST points (the mappers should calculate
> the distance)..
>
> Distributed cache is not an option, what else?  not sure if to classify it
> as a map-join , reduce-join or both?
> Would you do this in HIVE some how?
> Is it feasible in a single job?
>
> Would LOVE to hear your suggestions, code (if you think its not that hard)
> or what not.
> BTW using CDH3 - rev 3 (20.23)
>
> Thanks!
>

Re: best way to join?

Reply via email to