[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-19217:
-------------------------------------
    Description: 
Working with ML often means working with DataFrames with vector columns. You 
can't save these DataFrames to storage (edit: at least as ORC) without 
converting the vector columns to array columns, and there doesn't appear to an 
easy way to make that conversion.

This is a common enough problem that it is [documented on Stack 
Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions to 
making the conversion from a vector column to an array column are:
# Convert the DataFrame to an RDD and back
# Use a UDF

Both approaches work fine, but it really seems like you should be able to do 
something like this instead:

{code}
(le_data
    .select(
        col('features').cast('array').alias('features')
    ))
{code}

We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears that 
{{cast()}} doesn't support this conversion.

Would this be an appropriate thing to add?

  was:
Working with ML often means working with DataFrames with vector columns. You 
can't save these DataFrames to storage without converting the vector columns to 
array columns, and there doesn't appear to an easy way to make that conversion.

This is a common enough problem that it is [documented on Stack 
Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions to 
making the conversion from a vector column to an array column are:
# Convert the DataFrame to an RDD and back
# Use a UDF

Both approaches work fine, but it really seems like you should be able to do 
something like this instead:

{code}
(le_data
    .select(
        col('features').cast('array').alias('features')
    ))
{code}

We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears that 
{{cast()}} doesn't support this conversion.

Would this be an appropriate thing to add?


> Offer easy cast from vector to array
> ------------------------------------
>
>                 Key: SPARK-19217
>                 URL: https://issues.apache.org/jira/browse/SPARK-19217
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, PySpark, SQL
>    Affects Versions: 2.1.0
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage (edit: at least as ORC) without 
> converting the vector columns to array columns, and there doesn't appear to 
> an easy way to make that conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
>     .select(
>         col('features').cast('array').alias('features')
>     ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to