Ophir Yoktan created SPARK-34160: ------------------------------------ Summary: pyspark.ml.stat.Summarizer should allow sparse vector results Key: SPARK-34160 URL: https://issues.apache.org/jira/browse/SPARK-34160 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 3.0.1 Reporter: Ophir Yoktan
currently pyspark.ml.stat.Summarizer will return a dense vector, even if the input is sparse. the Summarizer should either deduce the relevant type from the input, or add a parameter that forces sparse output code to reproduce: {{import pyspark}} {{from pyspark.sql.functions import col}} {{from pyspark.ml.stat import Summarizer}} {{from pyspark.ml.linalg import SparseVector, DenseVector}}{{sc = pyspark.SparkContext.getOrCreate()}} {{sql_context = pyspark.SQLContext(sc)}}{{df = sc.parallelize([ ( SparseVector(100, \{1: 1.0}),)]).toDF(['v'])}} {{print(df.head())}} {{print(df.select(Summarizer.mean(col('v'))).head())}} ouput: {{Row(v=SparseVector(100, \{1: 1.0})) }} {{Row(mean(v)=DenseVector([0.0, 1.0,}} {{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]))}} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org