[ https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824175#comment-15824175 ]
Nicholas Chammas edited comment on SPARK-19217 at 1/16/17 3:41 PM: ------------------------------------------------------------------- [~mlnick] - I'm seeing this when I try to write as ORC. The error I get is: {code} java.lang.ClassCastException: org.apache.spark.ml.linalg.VectorUDT cannot be cast to org.apache.spark.sql.types.StructType {code} The vector column in question is a column of raw probabilities appended to the source DataFrame by {{LogisticRegressionModel.transform()}}. It looks like Parquet can indeed write this to storage without issue. But given the problems with ORC (and perhaps other formats?), plus the simple API problem of needing a UDF to make such a direct conversion, I think this issue stands valid. was (Author: nchammas): [~mlnick] - I'm seeing this when I try to write as ORC. The error I get is: {code} java.lang.ClassCastException: org.apache.spark.ml.linalg.VectorUDT cannot be cast to org.apache.spark.sql.types.StructType {code} The vector column in question is a column of raw probabilities appended to the source DataFrame by {{LogisticRegressionModel.transform()}}. It looks like Parquet can indeed write this to storage without issue. But given the problems with ORC (and perhaps other formats?), plus the simple API problem of needing a UDF to make such a direct conversion, I think issue stands valid. > Offer easy cast from vector to array > ------------------------------------ > > Key: SPARK-19217 > URL: https://issues.apache.org/jira/browse/SPARK-19217 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark, SQL > Affects Versions: 2.1.0 > Reporter: Nicholas Chammas > Priority: Minor > > Working with ML often means working with DataFrames with vector columns. You > can't save these DataFrames to storage without converting the vector columns > to array columns, and there doesn't appear to an easy way to make that > conversion. > This is a common enough problem that it is [documented on Stack > Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions > to making the conversion from a vector column to an array column are: > # Convert the DataFrame to an RDD and back > # Use a UDF > Both approaches work fine, but it really seems like you should be able to do > something like this instead: > {code} > (le_data > .select( > col('features').cast('array').alias('features') > )) > {code} > We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears > that {{cast()}} doesn't support this conversion. > Would this be an appropriate thing to add? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org