Ophir Yoktan created SPARK-34160:
------------------------------------

             Summary: pyspark.ml.stat.Summarizer should allow sparse vector 
results
                 Key: SPARK-34160
                 URL: https://issues.apache.org/jira/browse/SPARK-34160
             Project: Spark
          Issue Type: New Feature
          Components: ML
    Affects Versions: 3.0.1
            Reporter: Ophir Yoktan


currently pyspark.ml.stat.Summarizer will return a dense vector, even if the 
input is sparse.

the Summarizer should either deduce the relevant type from the input, or add a 
parameter that forces sparse output

code to reproduce:

{{import pyspark}}
{{from pyspark.sql.functions import col}}
{{from pyspark.ml.stat import Summarizer}}
{{from pyspark.ml.linalg import SparseVector, DenseVector}}{{sc = 
pyspark.SparkContext.getOrCreate()}}
{{sql_context = pyspark.SQLContext(sc)}}{{df = sc.parallelize([ ( 
SparseVector(100, \{1: 1.0}),)]).toDF(['v'])}}
{{print(df.head())}}
{{print(df.select(Summarizer.mean(col('v'))).head())}}

ouput:

{{Row(v=SparseVector(100, \{1: 1.0})) }}
{{Row(mean(v)=DenseVector([0.0, 1.0,}}
{{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
0.0, 0.0, 0.0]))}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to