Steffen Herbold created SPARK-18301: ---------------------------------------
Summary: VectorAssembler does not support StructTypes Key: SPARK-18301 URL: https://issues.apache.org/jira/browse/SPARK-18301 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.0.1 Environment: Windows Standalone Mode, Java Reporter: Steffen Herbold Priority: Minor I tried to transform a structured type using the VectorAssembler as follows: {code:java} VectorAssembler va = new VectorAssembler().setInputCols(new String[] { "metrics.Line", "metrics.McCC" }).setOutputCol("features"); dataframe= va.transform(dataframe); {code} This yields the following exception: {code:java} Exception in thread "main" java.lang.IllegalArgumentException: Field "metrics.McCC" does not exist. at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228) at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228) at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) at scala.collection.AbstractMap.getOrElse(Map.scala:59) at org.apache.spark.sql.types.StructType.apply(StructType.scala:227) at org.apache.spark.ml.feature.VectorAssembler$$anonfun$5.apply(VectorAssembler.scala:116) at org.apache.spark.ml.feature.VectorAssembler$$anonfun$5.apply(VectorAssembler.scala:116) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:116) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70) at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54) at de.ugoe.cs.smartshark.jobs.DefectPredictionExample.main(DefectPredictionExample.java:53) {code} The schema of the dataframe is: {noformat} |-- metrics: struct (nullable = true) | |-- Line: double (nullable = true) | |-- McCC: double (nullable = true) ... {noformat} The transfomation works, if I first use withColumn to make "metrics.Line" and "metrics.McCC" into columns of the dataframe: {code:java} dataframe.withColumn("Line", data.col("metrics.Line") dataframe.withColumn("McCC", data.col("metrics.McCC") VectorAssembler va = new VectorAssembler().setInputCols(new String[] { "metrics.McCC", "metrics.NL" }).setOutputCol("features"); fileState = va.transform(dataframe); {code} However, this workaround is quite costly and direct support to access the nested values would be very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org