2000* *124* *B* * NY* *1* *5* *2001* *128* *B*
> *4* *5* *0.041* *1*
> * NY* *0* *6* *3* *24* *C* * NY* *1* *7* *30100* *27* *C*
> *6* *7* *0.13* *1*
> NY 0 6 3 24 C NY 1 9 33000 39 C 6 9 3.15 2
> *NY* *0* *8* *30200* *29* *C* * NY* *1* *7
ue, Sep 13, 2016 at 8:45 PM, Mobius ReX <aoi...@gmail.com> wrote:
> > Hi Sean,
> >
> > Great!
> >
> > Is there any sample code implementing Locality Sensitive Hashing with
> Spark,
> > in either scala or python?
> >
> > "However if your
ive a lot of speed for a small cost in accuracy.
>
> However if your rule is really like "must match column A and B and
> then closest value in column C then just ordering everything by A, B,
> C lets you pretty much read off the answer from the result set
> directly. Everything
Given a table
> $cat data.csv
>
> ID,State,City,Price,Number,Flag
> 1,CA,A,100,1000,0
> 2,CA,A,96,1010,1
> 3,CA,A,195,1010,1
> 4,NY,B,124,2000,0
> 5,NY,B,128,2001,1
> 6,NY,C,24,3,0
> 7,NY,C,27,30100,1
> 8,NY,C,29,30200,0
> 9,NY,C,39,33000,1
Given a table with hundreds of columns mixed with both categorical and
numerical attributes, and the distribution of values is unknown, what's the
best way to detect outliers?
For example, given a table
Category Price
A 1
A 1.3
A 100
C