[ https://issues.apache.org/jira/browse/SPARK-25782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Erik Erlandson updated SPARK-25782: ----------------------------------- Target Version/s: 3.0.0 Component/s: ML Issue Type: New Feature (was: Improvement) > Add PCA Aggregator to support grouping > -------------------------------------- > > Key: SPARK-25782 > URL: https://issues.apache.org/jira/browse/SPARK-25782 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib > Affects Versions: 2.3.2 > Reporter: Matt Saunders > Priority: Minor > > I built an Aggregator that computes PCA on grouped datasets. I wanted to use > the PCA functions provided by MLlib, but they only work on a full dataset, > and I needed to do it on a grouped dataset (like a RelationalGroupedDataset). > So I built a little Aggregator that can do that, here's an example of how > it's called: > {noformat} > val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn > // For each grouping, compute a PCA matrix/vector > val pcaModels = inputData > .groupBy(keys:_*) > .agg(pcaAggregation.as(pcaOutput)){noformat} > I used the same algorithms under the hood as > RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works > directly on Datasets without converting to RDD first. > I've seen others who wanted this ability (for example on Stack Overflow) so > I'd like to contribute it if it would be a benefit to the larger community. > If there is interest, I will prepare the code for a pull request. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org