[jira] [Comment Edited] (SPARK-19217) Offer easy cast from vector to array

Nicholas Chammas (JIRA) Mon, 16 Jan 2017 07:43:18 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824175#comment-15824175
 ]


Nicholas Chammas edited comment on SPARK-19217 at 1/16/17 3:41 PM:
-------------------------------------------------------------------

[~mlnick] - I'm seeing this when I try to write as ORC. The error I get is:

{code}
java.lang.ClassCastException: org.apache.spark.ml.linalg.VectorUDT cannot be 
cast to org.apache.spark.sql.types.StructType
{code}

The vector column in question is a column of raw probabilities appended to the 
source DataFrame by {{LogisticRegressionModel.transform()}}.

It looks like Parquet can indeed write this to storage without issue. But given 
the problems with ORC (and perhaps other formats?), plus the simple API problem 
of needing a UDF to make such a direct conversion, I think this issue stands 
valid.


was (Author: nchammas):
[~mlnick] - I'm seeing this when I try to write as ORC. The error I get is:

{code}
java.lang.ClassCastException: org.apache.spark.ml.linalg.VectorUDT cannot be 
cast to org.apache.spark.sql.types.StructType
{code}

The vector column in question is a column of raw probabilities appended to the 
source DataFrame by {{LogisticRegressionModel.transform()}}.

It looks like Parquet can indeed write this to storage without issue. But given 
the problems with ORC (and perhaps other formats?), plus the simple API problem 
of needing a UDF to make such a direct conversion, I think issue stands valid.

> Offer easy cast from vector to array
> ------------------------------------
>
>                 Key: SPARK-19217
>                 URL: https://issues.apache.org/jira/browse/SPARK-19217
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, PySpark, SQL
>    Affects Versions: 2.1.0
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage without converting the vector columns 
> to array columns, and there doesn't appear to an easy way to make that 
> conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
>     .select(
>         col('features').cast('array').alias('features')
>     ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19217) Offer easy cast from vector to array

Reply via email to