The key is really to specify the distance metric that defines
"closeness" for you. You have features that aren't on the same scale,
and some that aren't continuous. You might look to clustering for
ideas here, though mostly you just want to normalize the scale of
dimensions to make them comparable.

You can find nearest neighbors by brute force. If speed really matters
you can consider locality sensitive hashing, which isn't that hard to
implement and can give a lot of speed for a small cost in accuracy.

However if your rule is really like "must match column A and B and
then closest value in column C then just ordering everything by A, B,
C lets you pretty much read off the answer from the result set
directly. Everything is closest to one of its two neighbors.

On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX <aoi...@gmail.com> wrote:
> Given a table
>
>>     $cat data.csv
>>
>>     ID,State,City,Price,Number,Flag
>>     1,CA,A,100,1000,0
>>     2,CA,A,96,1010,1
>>     3,CA,A,195,1010,1
>>     4,NY,B,124,2000,0
>>     5,NY,B,128,2001,1
>>     6,NY,C,24,30000,0
>>     7,NY,C,27,30100,1
>>     8,NY,C,29,30200,0
>>     9,NY,C,39,33000,1
>
>
> Expected Result:
>
>     ID0, ID1
>     1,2
>     4,5
>     6,7
>     8,7
>
> for each ID with Flag=0 above, we want to find another ID from Flag=1, with
> the same "State" and "City", and the nearest Price and Number normalized by
> the corresponding values of that ID with Flag=0.
>
> For example, ID = 1 and ID=2, has the same State and City, but different
> FLAG.
> After normalized the Price and Number (Price divided by 100, Number divided
> by 1000), the distance between ID=1 and ID=2 is defined as :
> abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05
>
>
> What's the best way to find such nearest neighbor? Any valuable tips will be
> greatly appreciated!
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to