Yes, that's it! Thank you, Ayan.
On Tue, Sep 13, 2016 at 5:50 PM, ayan guha <guha.a...@gmail.com> wrote: > >>> df.show() > > +----+---+---+-----+-----+-----+ > |city|flg| id| nbr|price|state| > +----+---+---+-----+-----+-----+ > | CA| 0| 1| 1000| 100| A| > | CA| 1| 2| 1010| 96| A| > | CA| 1| 3| 1010| 195| A| > | NY| 0| 4| 2000| 124| B| > | NY| 1| 5| 2001| 128| B| > | NY| 0| 6|30000| 24| C| > | NY| 1| 7|30100| 27| C| > | NY| 0| 8|30200| 29| C| > | NY| 1| 9|33000| 39| C| > +----+---+---+-----+-----+-----+ > > > >>> flg0 = df.filter(df.flg==0) > >>> flg1 = df.filter(df.flg!=0) > >>> flg0.registerTempTable("t_flg0") > >>> flg1.registerTempTable("t_flg1") > > >>> j = sqlContext.sql("select *, rank() over (partition by id0 order by > dist) r from (select *,x.id as id0,y.id as id1, abs(x.nbr/1000 - > y.nbr/1000) + abs(x.price/100 - y.price/100) as dist from t_flg0 x inner > join t_flg1 y on (x.city=y.city and x.state=y.state))x ") > > > >>> j.show() > > city flg id nbr price state city flg id nbr price state id0 id1 > dist r > *CA* *0* *1* *1000* *100* * A* * CA* *1* *2* *1010* *96* * A* *1* > *2* *0.05* *1* > CA 0 1 1000 100 A CA 1 3 1010 195 A 1 3 0.96 2 > *NY* *0* *4* *2000* *124* * B* * NY* *1* *5* *2001* *128* * B* > *4* *5* *0.041* *1* > * NY* *0* *6* *30000* *24* * C* * NY* *1* *7* *30100* *27* * C* > *6* *7* *0.13* *1* > NY 0 6 30000 24 C NY 1 9 33000 39 C 6 9 3.15 2 > *NY* *0* *8* *30200* *29* * C* * NY* *1* *7* *30100* *27* * C* > *8* *7* *0.12* *1* > NY 0 8 30200 29 C NY 1 9 33000 39 C 8 9 2.9 2 > Wherever r=1, you got a match. > > > > On Wed, Sep 14, 2016 at 5:45 AM, Mobius ReX <aoi...@gmail.com> wrote: > >> Hi Sean, >> >> Great! >> >> Is there any sample code implementing Locality Sensitive Hashing with >> Spark, in either scala or python? >> >> "However if your rule is really like "must match column A and B and >> then closest value in column C then just ordering everything by A, B, >> C lets you pretty much read off the answer from the result set >> directly. Everything is closest to one of its two neighbors." >> >> This is interesting since we can use Lead/Lag Windowing function if we >> have only one continuous column. However, >> our rule is "must match column A and B and then closest values in column >> C and D - for any ID with column E = 0, and the closest ID with Column E = >> 1". >> The distance metric between ID1 (with Column E =0) and ID2 (with Column E >> =1) is defined as >> abs( C1/C1 - C2/C1 ) + abs (D1/D1 - D2/D1) >> One cannot do >> abs( (C1/C1 + D1/D1) - (C2/C1 + D2/ D1) ) >> >> >> Any further tips? >> >> Best, >> Rex >> >> >> >> On Tue, Sep 13, 2016 at 11:09 AM, Sean Owen <so...@cloudera.com> wrote: >> >>> The key is really to specify the distance metric that defines >>> "closeness" for you. You have features that aren't on the same scale, >>> and some that aren't continuous. You might look to clustering for >>> ideas here, though mostly you just want to normalize the scale of >>> dimensions to make them comparable. >>> >>> You can find nearest neighbors by brute force. If speed really matters >>> you can consider locality sensitive hashing, which isn't that hard to >>> implement and can give a lot of speed for a small cost in accuracy. >>> >>> However if your rule is really like "must match column A and B and >>> then closest value in column C then just ordering everything by A, B, >>> C lets you pretty much read off the answer from the result set >>> directly. Everything is closest to one of its two neighbors. >>> >>> On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX <aoi...@gmail.com> wrote: >>> > Given a table >>> > >>> >> $cat data.csv >>> >> >>> >> ID,State,City,Price,Number,Flag >>> >> 1,CA,A,100,1000,0 >>> >> 2,CA,A,96,1010,1 >>> >> 3,CA,A,195,1010,1 >>> >> 4,NY,B,124,2000,0 >>> >> 5,NY,B,128,2001,1 >>> >> 6,NY,C,24,30000,0 >>> >> 7,NY,C,27,30100,1 >>> >> 8,NY,C,29,30200,0 >>> >> 9,NY,C,39,33000,1 >>> > >>> > >>> > Expected Result: >>> > >>> > ID0, ID1 >>> > 1,2 >>> > 4,5 >>> > 6,7 >>> > 8,7 >>> > >>> > for each ID with Flag=0 above, we want to find another ID from Flag=1, >>> with >>> > the same "State" and "City", and the nearest Price and Number >>> normalized by >>> > the corresponding values of that ID with Flag=0. >>> > >>> > For example, ID = 1 and ID=2, has the same State and City, but >>> different >>> > FLAG. >>> > After normalized the Price and Number (Price divided by 100, Number >>> divided >>> > by 1000), the distance between ID=1 and ID=2 is defined as : >>> > abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05 >>> > >>> > >>> > What's the best way to find such nearest neighbor? Any valuable tips >>> will be >>> > greatly appreciated! >>> > >>> > >>> >> >> > > > -- > Best Regards, > Ayan Guha >