Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
2000* *124* *B* * NY* *1* *5* *2001* *128* *B* > *4* *5* *0.041* *1* > * NY* *0* *6* *3* *24* *C* * NY* *1* *7* *30100* *27* *C* > *6* *7* *0.13* *1* > NY 0 6 3 24 C NY 1 9 33000 39 C 6 9 3.15 2 > *NY* *0* *8* *30200* *29* *C* * NY* *1* *7

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
ue, Sep 13, 2016 at 8:45 PM, Mobius ReX <aoi...@gmail.com> wrote: > > Hi Sean, > > > > Great! > > > > Is there any sample code implementing Locality Sensitive Hashing with > Spark, > > in either scala or python? > > > > "However if your

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
ive a lot of speed for a small cost in accuracy. > > However if your rule is really like "must match column A and B and > then closest value in column C then just ordering everything by A, B, > C lets you pretty much read off the answer from the result set > directly. Everything

What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
Given a table > $cat data.csv > > ID,State,City,Price,Number,Flag > 1,CA,A,100,1000,0 > 2,CA,A,96,1010,1 > 3,CA,A,195,1010,1 > 4,NY,B,124,2000,0 > 5,NY,B,128,2001,1 > 6,NY,C,24,3,0 > 7,NY,C,27,30100,1 > 8,NY,C,29,30200,0 > 9,NY,C,39,33000,1

What's the best way to detect and remove outliers in a table?

2016-09-01 Thread Mobius ReX
Given a table with hundreds of columns mixed with both categorical and numerical attributes, and the distribution of values is unknown, what's the best way to detect outliers? For example, given a table Category Price A 1 A 1.3 A 100 C