[ https://issues.apache.org/jira/browse/SPARK-22320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212640#comment-16212640 ]
Dongjoon Hyun commented on SPARK-22320: --------------------------------------- Thank you, [~hyukjin.kwon]! > ORC should support VectorUDT/MatrixUDT > -------------------------------------- > > Key: SPARK-22320 > URL: https://issues.apache.org/jira/browse/SPARK-22320 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.2, 2.1.2, 2.2.0 > Reporter: zhengruifeng > > I save dataframe containing vectors in ORC format, when I read it back, the > format is changed. > {code} > scala> import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.linalg._ > scala> val data = Seq((1,Vectors.dense(1.0,2.0)), (2,Vectors.sparse(8, > Array(4), Array(1.0)))) > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((1,[1.0,2.0]), > (2,(8,[4],[1.0]))) > scala> val df = data.toDF("i", "vec") > df: org.apache.spark.sql.DataFrame = [i: int, vec: vector] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,false), > StructField(vec,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > scala> df.write.orc("/tmp/123") > scala> val df2 = spark.sqlContext.read.orc("/tmp/123") > df2: org.apache.spark.sql.DataFrame = [i: int, vec: struct<type: tinyint, > size: int ... 2 more fields>] > scala> df2.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(i,IntegerType,true), > StructField(vec,StructType(StructField(type,ByteType,true), > StructField(size,IntegerType,true), > StructField(indices,ArrayType(IntegerType,true),true), > StructField(values,ArrayType(DoubleType,true),true)),true)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org