Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

Sean Owen Tue, 13 Sep 2016 12:55:38 -0700

Given the nature of your metric, I don't think you can use things like
LSH which more or less depend on a continuous metric space. This is
too specific to fit into a general framework usefully I think, but, I
think you can solve this directly with some code without much trouble.


On Tue, Sep 13, 2016 at 8:45 PM, Mobius ReX <aoi...@gmail.com> wrote:
> Hi Sean,
>
> Great!
>
> Is there any sample code implementing Locality Sensitive Hashing with Spark,
> in either scala or python?
>
> "However if your rule is really like "must match column A and B and
> then closest value in column C then just ordering everything by A, B,
> C lets you pretty much read off the answer from the result set
> directly. Everything is closest to one of its two neighbors."
>
> This is interesting since we can use Lead/Lag Windowing function if we have
> only one continuous column. However,
> our rule is "must match column A and B and then closest values in column C
> and D - for any ID with column E = 0, and the closest ID with Column E = 1".
> The distance metric between ID1 (with Column E =0) and ID2 (with Column E
> =1) is defined as
> abs( C1/C1 - C2/C1 ) + abs (D1/D1 - D2/D1)
> One cannot do
> abs( (C1/C1 + D1/D1) - (C2/C1 + D2/ D1) )
>
>
> Any further tips?
>
> Best,
> Rex
>
>
>
> On Tue, Sep 13, 2016 at 11:09 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> The key is really to specify the distance metric that defines
>> "closeness" for you. You have features that aren't on the same scale,
>> and some that aren't continuous. You might look to clustering for
>> ideas here, though mostly you just want to normalize the scale of
>> dimensions to make them comparable.
>>
>> You can find nearest neighbors by brute force. If speed really matters
>> you can consider locality sensitive hashing, which isn't that hard to
>> implement and can give a lot of speed for a small cost in accuracy.
>>
>> However if your rule is really like "must match column A and B and
>> then closest value in column C then just ordering everything by A, B,
>> C lets you pretty much read off the answer from the result set
>> directly. Everything is closest to one of its two neighbors.
>>
>> On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX <aoi...@gmail.com> wrote:
>> > Given a table
>> >
>> >>     $cat data.csv
>> >>
>> >>     ID,State,City,Price,Number,Flag
>> >>     1,CA,A,100,1000,0
>> >>     2,CA,A,96,1010,1
>> >>     3,CA,A,195,1010,1
>> >>     4,NY,B,124,2000,0
>> >>     5,NY,B,128,2001,1
>> >>     6,NY,C,24,30000,0
>> >>     7,NY,C,27,30100,1
>> >>     8,NY,C,29,30200,0
>> >>     9,NY,C,39,33000,1
>> >
>> >
>> > Expected Result:
>> >
>> >     ID0, ID1
>> >     1,2
>> >     4,5
>> >     6,7
>> >     8,7
>> >
>> > for each ID with Flag=0 above, we want to find another ID from Flag=1,
>> > with
>> > the same "State" and "City", and the nearest Price and Number normalized
>> > by
>> > the corresponding values of that ID with Flag=0.
>> >
>> > For example, ID = 1 and ID=2, has the same State and City, but different
>> > FLAG.
>> > After normalized the Price and Number (Price divided by 100, Number
>> > divided
>> > by 1000), the distance between ID=1 and ID=2 is defined as :
>> > abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05
>> >
>> >
>> > What's the best way to find such nearest neighbor? Any valuable tips
>> > will be
>> > greatly appreciated!
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

Reply via email to