Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?
Yes, that's it! Thank you, Ayan. On Tue, Sep 13, 2016 at 5:50 PM, ayan guha wrote: > >>> df.show() > > ++---+---+-+-+-+ > |city|flg| id| nbr|price|state| > ++---+---+-+-+-+ > | CA| 0| 1| 1000| 100|A| > | CA| 1| 2| 1010| 96|A| > | CA| 1| 3| 1010| 195|A| > | NY| 0| 4| 2000| 124|B| > | NY| 1| 5| 2001| 128|B| > | NY| 0| 6|3| 24|C| > | NY| 1| 7|30100| 27|C| > | NY| 0| 8|30200| 29|C| > | NY| 1| 9|33000| 39|C| > ++---+---+-+-+-+ > > > >>> flg0 = df.filter(df.flg==0) > >>> flg1 = df.filter(df.flg!=0) > >>> flg0.registerTempTable("t_flg0") > >>> flg1.registerTempTable("t_flg1") > > >>> j = sqlContext.sql("select *, rank() over (partition by id0 order by > dist) r from (select *,x.id as id0,y.id as id1, abs(x.nbr/1000 - > y.nbr/1000) + abs(x.price/100 - y.price/100) as dist from t_flg0 x inner > join t_flg1 y on (x.city=y.city and x.state=y.state))x ") > > > >>> j.show() > > city flg id nbr price state city flg id nbr price state id0 id1 > dist r > *CA* *0* *1* *1000* *100* *A* * CA* *1* *2* *1010* *96* *A* *1* > *2* *0.05* *1* > CA 0 1 1000 100 A CA 1 3 1010 195 A 1 3 0.96 2 > *NY* *0* *4* *2000* *124* *B* * NY* *1* *5* *2001* *128* *B* > *4* *5* *0.041* *1* > * NY* *0* *6* *3* *24* *C* * NY* *1* *7* *30100* *27* *C* > *6* *7* *0.13* *1* > NY 0 6 3 24 C NY 1 9 33000 39 C 6 9 3.15 2 > *NY* *0* *8* *30200* *29* *C* * NY* *1* *7* *30100* *27* *C* > *8* *7* *0.12* *1* > NY 0 8 30200 29 C NY 1 9 33000 39 C 8 9 2.9 2 > Wherever r=1, you got a match. > > > > On Wed, Sep 14, 2016 at 5:45 AM, Mobius ReX wrote: > >> Hi Sean, >> >> Great! >> >> Is there any sample code implementing Locality Sensitive Hashing with >> Spark, in either scala or python? >> >> "However if your rule is really like "must match column A and B and >> then closest value in column C then just ordering everything by A, B, >> C lets you pretty much read off the answer from the result set >> directly. Everything is closest to one of its two neighbors." >> >> This is interesting since we can use Lead/Lag Windowing function if we >> have only one continuous column. However, >> our rule is "must match column A and B and then closest values in column >> C and D - for any ID with column E = 0, and the closest ID with Column E = >> 1". >> The distance metric between ID1 (with Column E =0) and ID2 (with Column E >> =1) is defined as >> abs( C1/C1 - C2/C1 ) + abs (D1/D1 - D2/D1) >> One cannot do >> abs( (C1/C1 + D1/D1) - (C2/C1 + D2/ D1) ) >> >> >> Any further tips? >> >> Best, >> Rex >> >> >> >> On Tue, Sep 13, 2016 at 11:09 AM, Sean Owen wrote: >> >>> The key is really to specify the distance metric that defines >>> "closeness" for you. You have features that aren't on the same scale, >>> and some that aren't continuous. You might look to clustering for >>> ideas here, though mostly you just want to normalize the scale of >>> dimensions to make them comparable. >>> >>> You can find nearest neighbors by brute force. If speed really matters >>> you can consider locality sensitive hashing, which isn't that hard to >>> implement and can give a lot of speed for a small cost in accuracy. >>> >>> However if your rule is really like "must match column A and B and >>> then closest value in column C then just ordering everything by A, B, >>> C lets you pretty much read off the answer from the result set >>> directly. Everything is closest to one of its two neighbors. >>> >>> On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX wrote: >>> > Given a table >>> > >>> >> $cat data.csv >>> >> >>> >> ID,State,City,Price,Number,Flag >>> >> 1,CA,A,100,1000,0 >>> >> 2,CA,A,96,1010,1 >>> >> 3,CA,A,195,1010,1 >>> >> 4,NY,B,124,2000,0 >>> >> 5,NY,B,128,2001,1 >>> >> 6,NY,C,24,3,0 >>> >> 7,NY,C,27,30100,1 >>> >> 8,NY,C,29,30200,0 >>> >> 9,NY,C,39,33000,1 >>> > >>> > >>> > Expected Result: >>> > >>> > ID0, ID1 >>> > 1,2 >>> > 4,5 >>> > 6,7 >>> > 8,7 >>> > >>> > for each ID with Flag=0 above, we want to find another ID from Flag=1, >>> with >>> > the same "State" and "City", and the nearest Price and Number >>> normalized by >>> > the corresponding values of that ID with Flag=0. >>> > >>> > For example, ID = 1 and ID=2, has the same State and City, but >>> different >>> > FLAG. >>> > After normalized the Price and Number (Price divided by 100, Number >>> divided >>> > by 1000), the distance between ID=1 and ID=2 is defined as : >>> > abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05 >>> > >>> > >>> > What's the best way to find such nearest neighbor? Any valuable tips >>> will be >>> > greatly appreciated! >>> > >>> > >>> >> >> > > > -- > Best Regards, > Ayan Guha >
Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?
Hi Sean, Now let's assume we have column C and column D normalized, and the metric is simplified to abs( C1 - C2 ) + abs (D1 - D2) Can we benefit the performance from LSH? Thank you again! Best, Rex On Tue, Sep 13, 2016 at 12:47 PM, Sean Owen wrote: > Given the nature of your metric, I don't think you can use things like > LSH which more or less depend on a continuous metric space. This is > too specific to fit into a general framework usefully I think, but, I > think you can solve this directly with some code without much trouble. > > On Tue, Sep 13, 2016 at 8:45 PM, Mobius ReX wrote: > > Hi Sean, > > > > Great! > > > > Is there any sample code implementing Locality Sensitive Hashing with > Spark, > > in either scala or python? > > > > "However if your rule is really like "must match column A and B and > > then closest value in column C then just ordering everything by A, B, > > C lets you pretty much read off the answer from the result set > > directly. Everything is closest to one of its two neighbors." > > > > This is interesting since we can use Lead/Lag Windowing function if we > have > > only one continuous column. However, > > our rule is "must match column A and B and then closest values in column > C > > and D - for any ID with column E = 0, and the closest ID with Column E = > 1". > > The distance metric between ID1 (with Column E =0) and ID2 (with Column E > > =1) is defined as > > abs( C1/C1 - C2/C1 ) + abs (D1/D1 - D2/D1) > > One cannot do > > abs( (C1/C1 + D1/D1) - (C2/C1 + D2/ D1) ) > > > > > > Any further tips? > > > > Best, > > Rex > > > > > > > > On Tue, Sep 13, 2016 at 11:09 AM, Sean Owen wrote: > >> > >> The key is really to specify the distance metric that defines > >> "closeness" for you. You have features that aren't on the same scale, > >> and some that aren't continuous. You might look to clustering for > >> ideas here, though mostly you just want to normalize the scale of > >> dimensions to make them comparable. > >> > >> You can find nearest neighbors by brute force. If speed really matters > >> you can consider locality sensitive hashing, which isn't that hard to > >> implement and can give a lot of speed for a small cost in accuracy. > >> > >> However if your rule is really like "must match column A and B and > >> then closest value in column C then just ordering everything by A, B, > >> C lets you pretty much read off the answer from the result set > >> directly. Everything is closest to one of its two neighbors. > >> > >> On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX wrote: > >> > Given a table > >> > > >> >> $cat data.csv > >> >> > >> >> ID,State,City,Price,Number,Flag > >> >> 1,CA,A,100,1000,0 > >> >> 2,CA,A,96,1010,1 > >> >> 3,CA,A,195,1010,1 > >> >> 4,NY,B,124,2000,0 > >> >> 5,NY,B,128,2001,1 > >> >> 6,NY,C,24,3,0 > >> >> 7,NY,C,27,30100,1 > >> >> 8,NY,C,29,30200,0 > >> >> 9,NY,C,39,33000,1 > >> > > >> > > >> > Expected Result: > >> > > >> > ID0, ID1 > >> > 1,2 > >> > 4,5 > >> > 6,7 > >> > 8,7 > >> > > >> > for each ID with Flag=0 above, we want to find another ID from Flag=1, > >> > with > >> > the same "State" and "City", and the nearest Price and Number > normalized > >> > by > >> > the corresponding values of that ID with Flag=0. > >> > > >> > For example, ID = 1 and ID=2, has the same State and City, but > different > >> > FLAG. > >> > After normalized the Price and Number (Price divided by 100, Number > >> > divided > >> > by 1000), the distance between ID=1 and ID=2 is defined as : > >> > abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = > 0.05 > >> > > >> > > >> > What's the best way to find such nearest neighbor? Any valuable tips > >> > will be > >> > greatly appreciated! > >> > > >> > > > > > >
Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?
Hi Sean, Great! Is there any sample code implementing Locality Sensitive Hashing with Spark, in either scala or python? "However if your rule is really like "must match column A and B and then closest value in column C then just ordering everything by A, B, C lets you pretty much read off the answer from the result set directly. Everything is closest to one of its two neighbors." This is interesting since we can use Lead/Lag Windowing function if we have only one continuous column. However, our rule is "must match column A and B and then closest values in column C and D - for any ID with column E = 0, and the closest ID with Column E = 1". The distance metric between ID1 (with Column E =0) and ID2 (with Column E =1) is defined as abs( C1/C1 - C2/C1 ) + abs (D1/D1 - D2/D1) One cannot do abs( (C1/C1 + D1/D1) - (C2/C1 + D2/ D1) ) Any further tips? Best, Rex On Tue, Sep 13, 2016 at 11:09 AM, Sean Owen wrote: > The key is really to specify the distance metric that defines > "closeness" for you. You have features that aren't on the same scale, > and some that aren't continuous. You might look to clustering for > ideas here, though mostly you just want to normalize the scale of > dimensions to make them comparable. > > You can find nearest neighbors by brute force. If speed really matters > you can consider locality sensitive hashing, which isn't that hard to > implement and can give a lot of speed for a small cost in accuracy. > > However if your rule is really like "must match column A and B and > then closest value in column C then just ordering everything by A, B, > C lets you pretty much read off the answer from the result set > directly. Everything is closest to one of its two neighbors. > > On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX wrote: > > Given a table > > > >> $cat data.csv > >> > >> ID,State,City,Price,Number,Flag > >> 1,CA,A,100,1000,0 > >> 2,CA,A,96,1010,1 > >> 3,CA,A,195,1010,1 > >> 4,NY,B,124,2000,0 > >> 5,NY,B,128,2001,1 > >> 6,NY,C,24,3,0 > >> 7,NY,C,27,30100,1 > >> 8,NY,C,29,30200,0 > >> 9,NY,C,39,33000,1 > > > > > > Expected Result: > > > > ID0, ID1 > > 1,2 > > 4,5 > > 6,7 > > 8,7 > > > > for each ID with Flag=0 above, we want to find another ID from Flag=1, > with > > the same "State" and "City", and the nearest Price and Number normalized > by > > the corresponding values of that ID with Flag=0. > > > > For example, ID = 1 and ID=2, has the same State and City, but different > > FLAG. > > After normalized the Price and Number (Price divided by 100, Number > divided > > by 1000), the distance between ID=1 and ID=2 is defined as : > > abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05 > > > > > > What's the best way to find such nearest neighbor? Any valuable tips > will be > > greatly appreciated! > > > > >
What's the best way to find the nearest neighbor in Spark? Any windowing function?
Given a table > $cat data.csv > > ID,State,City,Price,Number,Flag > 1,CA,A,100,1000,0 > 2,CA,A,96,1010,1 > 3,CA,A,195,1010,1 > 4,NY,B,124,2000,0 > 5,NY,B,128,2001,1 > 6,NY,C,24,3,0 > 7,NY,C,27,30100,1 > 8,NY,C,29,30200,0 > 9,NY,C,39,33000,1 Expected Result: ID0, ID1 1,2 4,5 6,7 8,7 for each ID with Flag=0 above, we want to find another ID from Flag=1, with the same "State" and "City", and the nearest Price and Number normalized by the corresponding values of that ID with Flag=0. For example, ID = 1 and ID=2, has the same State and City, but different FLAG. After normalized the Price and Number (Price divided by 100, Number divided by 1000), the distance between ID=1 and ID=2 is defined as : abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05 What's the best way to find such nearest neighbor? Any valuable tips will be greatly appreciated!
What's the best way to detect and remove outliers in a table?
Given a table with hundreds of columns mixed with both categorical and numerical attributes, and the distribution of values is unknown, what's the best way to detect outliers? For example, given a table Category Price A 1 A 1.3 A 100 C 1 If category C above appears rarely, for example less than 0.1%, then we should remove all rows with Category=C. Assuming continuous distribution, if Price of Category A is rarely above 1000, then 100 above is another outlier. What's the best scalable way to remove all outliers? It would be laborious to plot the distribution curve for each numerical column, and histogram for each categorical column. Any tips would be greatly appreciated! Regards Rex