[ https://issues.apache.org/jira/browse/SPARK-18884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755778#comment-15755778 ]
Dongjoon Hyun commented on SPARK-18884: --------------------------------------- +1 for the idea! > Support Array[_] in ScalaUDF > ---------------------------- > > Key: SPARK-18884 > URL: https://issues.apache.org/jira/browse/SPARK-18884 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.2 > Reporter: Takeshi Yamamuro > Priority: Minor > > Throw an exception if we use `Array[_]` in `ScalaUDF`; > {code} > scala> import org.apache.spark.sql.execution.debug._ > scala> Seq((0, 1)).toDF("a", "b").select(array($"a", > $"b").as("ar")).write.mode("overwrite").parquet("/Users/maropu/Desktop/data/") > scala> val df = spark.read.load("/Users/maropu/Desktop/data/") > scala> val df = Seq((0, 1)).toDF("a", "b").select(array($"a", $"b").as("ar")) > scala> val testArrayUdf = udf { (ar: Array[Int]) => ar.sum } > scala> df.select(testArrayUdf($"ar")).show > Caused by: java.lang.ClassCastException: > scala.collection.mutable.WrappedArray$ofRef cannot be cast to [I > at $anonfun$1.apply(<console>:23) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:89) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1069) > ... 99 more > {code} > On the other hand, the query below is passed; > {code} > scala> val testSeqUdf = udf { (ar: Seq[Int]) => ar.sum } > scala> df.select(testSeqUdf($"ar")).show > +-------+ > |UDF(ar)| > +-------+ > | 1| > +-------+ > {code} > I'm not sure this behivour is an expected one. The curernt implementation > querys argument types (`DataType`) by reflection > (`ScalaReflection.schemaFor`) in `sql.functions.udf`, and then creates type > converters (`CatalystTypeConverters`) from the types. `Seq[_]` and `Array[_]` > are represented as `ArrayType` in `DataType` and both types are handled by > using `ArrayConverter. However, since we cannot tell a difference between > both types in `DataType`, ISTM it's difficut to support the two array types > based on this current design. One idea (of curse, it's not the best) I have > is to create type converters directly from `TypeTag` in `sql.functions.udf`. > This is a prototype > (https://github.com/apache/spark/compare/master...maropu:ArrayTypeUdf#diff-89643554d9757dd3e91abff1cc6096c7R740) > to support both array types in `ScalaUDF`. I'm not sure this is acceptable > and welcome any suggestion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org