Feng Liu created SPARK-22003: -------------------------------- Summary: vectorized reader does not work with UDF when the column is array Key: SPARK-22003 URL: https://issues.apache.org/jira/browse/SPARK-22003 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Feng Liu
The UDF needs to deserialize the UnsafeRow. When the column type is Array, the `get` method from the ColumnVector, which is used by the vectorized reader, is called, but this method is not implemented, unfortunately. Code to reproduce the issue: {code:java} val fileName = "testfile" val str = """{ "choices": ["key1", "key2", "key3"] }""" val rdd = sc.parallelize(Seq(str)) val df = spark.read.json(rdd) df.write.mode("overwrite").parquet(s"file:///tmp/$fileName ") import org.apache.spark.sql._ import org.apache.spark.sql.functions._ spark.udf.register("acf", (rows: Seq[Row]) => Option[String](null)) spark.read.parquet(s"file:///tmp/$fileName ").select(expr("""acf(choices)""")).show {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org