zhengruifeng created SPARK-30202: ------------------------------------ Summary: impl QuantileTransform Key: SPARK-30202 URL: https://issues.apache.org/jira/browse/SPARK-30202 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng
Recently, I encountered some practice senarinos to map the data to another distribution. Then I found that QuantileTransformer in sklearn is what I needed, I locally fitted a model on sampled dataset and broadcast it to transform the whole dataset in pyspark. After that I impled QuantileTransform as a new Estimator atop Spark, the impl followed scikit-learn' s impl, however there still are sereral differences: 1, use QuantileSummaries for approximation, no matter the size of dataset; 2, use linear interpolate, the logic is similar to existing IsotonicRegression, while scikit-learn use a bi-directional interpolate; 3, when skipZero=true, treat sparse vectors just like dense ones, while scikit-learn have two different logics for sparse and dense datasets. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org