[ https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823701#comment-15823701 ]
Nick Pentreath commented on SPARK-19217: ---------------------------------------- I don't understand why bq. You can't save these DataFrames to storage without converting the vector columns to array columns Which storage exactly? Because using standard DF writers (such as parquet) works: {code} df = spark.createDataFrame([(1, Vectors.dense(1, 2)), (2, Vectors.dense(3, 4)), (3, Vectors.dense(5, 6))], ["id", "vector"]) df.write.parquet("/tmp/spark/vecs") df2 = spark.read.parquet("/tmp/spark/vecs/") {code} > Offer easy cast from vector to array > ------------------------------------ > > Key: SPARK-19217 > URL: https://issues.apache.org/jira/browse/SPARK-19217 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark, SQL > Affects Versions: 2.1.0 > Reporter: Nicholas Chammas > Priority: Minor > > Working with ML often means working with DataFrames with vector columns. You > can't save these DataFrames to storage without converting the vector columns > to array columns, and there doesn't appear to an easy way to make that > conversion. > This is a common enough problem that it is [documented on Stack > Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions > to making the conversion from a vector column to an array column are: > # Convert the DataFrame to an RDD and back > # Use a UDF > Both approaches work fine, but it really seems like you should be able to do > something like this instead: > {code} > (le_data > .select( > col('features').cast('array').alias('features') > )) > {code} > We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears > that {{cast()}} doesn't support this conversion. > Would this be an appropriate thing to add? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org