[ https://issues.apache.org/jira/browse/SPARK-34160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489518#comment-17489518 ]
zhengruifeng commented on SPARK-34160: -------------------------------------- you can get a sparse vector by calling vector.{color:#ffc66d}compressed{color} > pyspark.ml.stat.Summarizer should allow sparse vector results > ------------------------------------------------------------- > > Key: SPARK-34160 > URL: https://issues.apache.org/jira/browse/SPARK-34160 > Project: Spark > Issue Type: New Feature > Components: ML > Affects Versions: 3.0.1 > Reporter: Ophir Yoktan > Priority: Major > > currently pyspark.ml.stat.Summarizer will return a dense vector, even if the > input is sparse. > the Summarizer should either deduce the relevant type from the input, or add > a parameter that forces sparse output > code to reproduce: > {{import pyspark}} > {{from pyspark.sql.functions import col}} > {{from pyspark.ml.stat import Summarizer}} > {{from pyspark.ml.linalg import SparseVector, DenseVector}}{{sc = > pyspark.SparkContext.getOrCreate()}} > {{sql_context = pyspark.SQLContext(sc)}}{{df = sc.parallelize([ ( > SparseVector(100, \{1: 1.0}),)]).toDF(['v'])}} > {{print(df.head())}} > {{print(df.select(Summarizer.mean(col('v'))).head())}} > ouput: > {{Row(v=SparseVector(100, \{1: 1.0})) }} > {{Row(mean(v)=DenseVector([0.0, 1.0,}} > {{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]))}} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org