What's the best way to find the nearest neighbor in Hive? Any windowing function?

2016-09-13 Thread Mobius ReX
Given a table

> $cat data.csv
>
> ID,State,City,Price,Number,Flag
> 1,CA,A,100,1000,0
> 2,CA,A,96,1010,1
> 3,CA,A,195,1010,1
> 4,NY,B,124,2000,0
> 5,NY,B,128,2001,1
> 6,NY,C,24,3,0
> 7,NY,C,27,30100,1
> 8,NY,C,29,30200,0
> 9,NY,C,39,33000,1


Expected Result:

ID0, ID1
1,2
4,5
6,7
8,7

for each ID with Flag=0 above, we want to find another ID from Flag=1, with
the same "State" and "City", and the nearest Price and Number normalized by
the corresponding values of that ID with Flag=0.

For example, ID = 1 and ID=2, has the same State and City, but different
FLAG.
After normalized the Price and Number (Price divided by 100, Number divided
by 1000), the distance between ID=1 and ID=2 is:
abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05


What's the best way to find such nearest neighbor in Hive? Can we use
Lead/Lag or Rank for this case? Any valuable tips will be greatly
appreciated!


What's the best way to detect and remove outliers in a table?

2016-09-01 Thread Mobius ReX
Given a table with hundreds of columns mixed with both categorical and
numerical attributes, and the distribution of values is unknown, what's the
best way to detect outliers?

For example, given a table
Category  Price
A 1
A 1.3
A 100
C  1

If category C above appears rarely, for example less than 0.1%, then we
should remove all rows with Category=C.

Assuming continuous distribution, if Price of Category A is rarely above
1000, then 100 above is another outlier.

What's the best scalable way to remove all outliers? It would be laborious
to plot the distribution curve for each numerical column, and histogram for
each categorical column.

Any tips would be greatly appreciated!

Regards
Rex