Hi Sean,

Great!

Is there any sample code implementing Locality Sensitive Hashing with
Spark, in either scala or python?

"However if your rule is really like "must match column A and B and
then closest value in column C then just ordering everything by A, B,
C lets you pretty much read off the answer from the result set
directly. Everything is closest to one of its two neighbors."

This is interesting since we can use Lead/Lag Windowing function if we have
only one continuous column. However,
our rule is "must match column A and B and then closest values in column C
and D - for any ID with column E = 0, and the closest ID with Column E = 1".
The distance metric between ID1 (with Column E =0) and ID2 (with Column E
=1) is defined as
abs( C1/C1 - C2/C1 ) + abs (D1/D1 - D2/D1)
One cannot do
abs( (C1/C1 + D1/D1) - (C2/C1 + D2/ D1) )


Any further tips?

Best,
Rex



On Tue, Sep 13, 2016 at 11:09 AM, Sean Owen <so...@cloudera.com> wrote:

> The key is really to specify the distance metric that defines
> "closeness" for you. You have features that aren't on the same scale,
> and some that aren't continuous. You might look to clustering for
> ideas here, though mostly you just want to normalize the scale of
> dimensions to make them comparable.
>
> You can find nearest neighbors by brute force. If speed really matters
> you can consider locality sensitive hashing, which isn't that hard to
> implement and can give a lot of speed for a small cost in accuracy.
>
> However if your rule is really like "must match column A and B and
> then closest value in column C then just ordering everything by A, B,
> C lets you pretty much read off the answer from the result set
> directly. Everything is closest to one of its two neighbors.
>
> On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX <aoi...@gmail.com> wrote:
> > Given a table
> >
> >>     $cat data.csv
> >>
> >>     ID,State,City,Price,Number,Flag
> >>     1,CA,A,100,1000,0
> >>     2,CA,A,96,1010,1
> >>     3,CA,A,195,1010,1
> >>     4,NY,B,124,2000,0
> >>     5,NY,B,128,2001,1
> >>     6,NY,C,24,30000,0
> >>     7,NY,C,27,30100,1
> >>     8,NY,C,29,30200,0
> >>     9,NY,C,39,33000,1
> >
> >
> > Expected Result:
> >
> >     ID0, ID1
> >     1,2
> >     4,5
> >     6,7
> >     8,7
> >
> > for each ID with Flag=0 above, we want to find another ID from Flag=1,
> with
> > the same "State" and "City", and the nearest Price and Number normalized
> by
> > the corresponding values of that ID with Flag=0.
> >
> > For example, ID = 1 and ID=2, has the same State and City, but different
> > FLAG.
> > After normalized the Price and Number (Price divided by 100, Number
> divided
> > by 1000), the distance between ID=1 and ID=2 is defined as :
> > abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05
> >
> >
> > What's the best way to find such nearest neighbor? Any valuable tips
> will be
> > greatly appreciated!
> >
> >
>

Reply via email to