Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

Mobius ReX Tue, 13 Sep 2016 21:45:07 -0700

Yes, that's it! Thank you, Ayan.



On Tue, Sep 13, 2016 at 5:50 PM, ayan guha <guha.a...@gmail.com> wrote:

> >>> df.show()
>
> +----+---+---+-----+-----+-----+
> |city|flg| id|  nbr|price|state|
> +----+---+---+-----+-----+-----+
> |  CA|  0|  1| 1000|  100|    A|
> |  CA|  1|  2| 1010|   96|    A|
> |  CA|  1|  3| 1010|  195|    A|
> |  NY|  0|  4| 2000|  124|    B|
> |  NY|  1|  5| 2001|  128|    B|
> |  NY|  0|  6|30000|   24|    C|
> |  NY|  1|  7|30100|   27|    C|
> |  NY|  0|  8|30200|   29|    C|
> |  NY|  1|  9|33000|   39|    C|
> +----+---+---+-----+-----+-----+
>
>
> >>> flg0 = df.filter(df.flg==0)
> >>> flg1 = df.filter(df.flg!=0)
> >>> flg0.registerTempTable("t_flg0")
> >>> flg1.registerTempTable("t_flg1")
>
> >>> j = sqlContext.sql("select *, rank() over (partition by id0 order by
> dist) r from (select *,x.id as id0,y.id as id1, abs(x.nbr/1000 -
> y.nbr/1000) + abs(x.price/100 - y.price/100) as dist from t_flg0 x inner
> join t_flg1 y on (x.city=y.city and x.state=y.state))x ")
>
>
> >>> j.show()
>
> city flg  id   nbr price state city flg  id   nbr price state id0 id1
> dist   r
>   *CA* *0* *1* *1000* *100* *    A* *  CA* *1* *2* *1010* *96* *    A* *1*
> *2* *0.05* *1*
>   CA 0 1 1000 100     A   CA 1 3 1010 195     A 1 3 0.96 2
>   *NY* *0* *4* *2000* *124* *    B* *  NY* *1* *5* *2001* *128* *    B*
> *4* *5* *0.041* *1*
>  * NY* *0* *6* *30000* *24* *    C* *  NY* *1* *7* *30100* *27* *    C*
> *6* *7* *0.13* *1*
>   NY 0 6 30000 24     C   NY 1 9 33000 39     C 6 9 3.15 2
>   *NY* *0* *8* *30200* *29* *    C* *  NY* *1* *7* *30100* *27* *    C*
> *8* *7* *0.12* *1*
>   NY 0 8 30200 29     C   NY 1 9 33000 39     C 8 9 2.9 2
> Wherever r=1, you got a match.
>
>
>
> On Wed, Sep 14, 2016 at 5:45 AM, Mobius ReX <aoi...@gmail.com> wrote:
>
>> Hi Sean,
>>
>> Great!
>>
>> Is there any sample code implementing Locality Sensitive Hashing with
>> Spark, in either scala or python?
>>
>> "However if your rule is really like "must match column A and B and
>> then closest value in column C then just ordering everything by A, B,
>> C lets you pretty much read off the answer from the result set
>> directly. Everything is closest to one of its two neighbors."
>>
>> This is interesting since we can use Lead/Lag Windowing function if we
>> have only one continuous column. However,
>> our rule is "must match column A and B and then closest values in column
>> C and D - for any ID with column E = 0, and the closest ID with Column E = 
>> 1".
>> The distance metric between ID1 (with Column E =0) and ID2 (with Column E
>> =1) is defined as
>> abs( C1/C1 - C2/C1 ) + abs (D1/D1 - D2/D1)
>> One cannot do
>> abs( (C1/C1 + D1/D1) - (C2/C1 + D2/ D1) )
>>
>>
>> Any further tips?
>>
>> Best,
>> Rex
>>
>>
>>
>> On Tue, Sep 13, 2016 at 11:09 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> The key is really to specify the distance metric that defines
>>> "closeness" for you. You have features that aren't on the same scale,
>>> and some that aren't continuous. You might look to clustering for
>>> ideas here, though mostly you just want to normalize the scale of
>>> dimensions to make them comparable.
>>>
>>> You can find nearest neighbors by brute force. If speed really matters
>>> you can consider locality sensitive hashing, which isn't that hard to
>>> implement and can give a lot of speed for a small cost in accuracy.
>>>
>>> However if your rule is really like "must match column A and B and
>>> then closest value in column C then just ordering everything by A, B,
>>> C lets you pretty much read off the answer from the result set
>>> directly. Everything is closest to one of its two neighbors.
>>>
>>> On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX <aoi...@gmail.com> wrote:
>>> > Given a table
>>> >
>>> >>     $cat data.csv
>>> >>
>>> >>     ID,State,City,Price,Number,Flag
>>> >>     1,CA,A,100,1000,0
>>> >>     2,CA,A,96,1010,1
>>> >>     3,CA,A,195,1010,1
>>> >>     4,NY,B,124,2000,0
>>> >>     5,NY,B,128,2001,1
>>> >>     6,NY,C,24,30000,0
>>> >>     7,NY,C,27,30100,1
>>> >>     8,NY,C,29,30200,0
>>> >>     9,NY,C,39,33000,1
>>> >
>>> >
>>> > Expected Result:
>>> >
>>> >     ID0, ID1
>>> >     1,2
>>> >     4,5
>>> >     6,7
>>> >     8,7
>>> >
>>> > for each ID with Flag=0 above, we want to find another ID from Flag=1,
>>> with
>>> > the same "State" and "City", and the nearest Price and Number
>>> normalized by
>>> > the corresponding values of that ID with Flag=0.
>>> >
>>> > For example, ID = 1 and ID=2, has the same State and City, but
>>> different
>>> > FLAG.
>>> > After normalized the Price and Number (Price divided by 100, Number
>>> divided
>>> > by 1000), the distance between ID=1 and ID=2 is defined as :
>>> > abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05
>>> >
>>> >
>>> > What's the best way to find such nearest neighbor? Any valuable tips
>>> will be
>>> > greatly appreciated!
>>> >
>>> >
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

Reply via email to