Given a table with hundreds of columns mixed with both categorical and numerical attributes, and the distribution of values is unknown, what's the best way to detect outliers?
For example, given a table Category Price A 1 A 1.3 A 1000000 C 1 If category C above appears rarely, for example less than 0.1%, then we should remove all rows with Category=C. Assuming continuous distribution, if Price of Category A is rarely above 1000, then 1000000 above is another outlier. What's the best scalable way to remove all outliers? It would be laborious to plot the distribution curve for each numerical column, and histogram for each categorical column. Any tips would be greatly appreciated! Regards Rex