What's the best way to detect and remove outliers in a table?

Mobius ReX Thu, 01 Sep 2016 10:58:13 -0700

Given a table with hundreds of columns mixed with both categorical and
numerical attributes, and the distribution of values is unknown, what's the
best way to detect outliers?


For example, given a table
Category  Price
A                 1
A                 1.3
A                 1000000
C                  1

If category C above appears rarely, for example less than 0.1%, then we
should remove all rows with Category=C.

Assuming continuous distribution, if Price of Category A is rarely above
1000, then 1000000 above is another outlier.

What's the best scalable way to remove all outliers? It would be laborious
to plot the distribution curve for each numerical column, and histogram for
each categorical column.

Any tips would be greatly appreciated!

Regards
Rex

What's the best way to detect and remove outliers in a table?

Reply via email to