[jira] [Assigned] (SPARK-22003) vectorized reader does not work with UDF when the column is array
[ https://issues.apache.org/jira/browse/SPARK-22003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22003: Assignee: Apache Spark > vectorized reader does not work with UDF when the column is array > - > > Key: SPARK-22003 > URL: https://issues.apache.org/jira/browse/SPARK-22003 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Feng Liu >Assignee: Apache Spark > > The UDF needs to deserialize the UnsafeRow. When the column type is Array, > the `get` method from the ColumnVector, which is used by the vectorized > reader, is called, but this method is not implemented, unfortunately. > Code to reproduce the issue: > {code:java} > val fileName = "testfile" > val str = """{ "choices": ["key1", "key2", "key3"] }""" > val rdd = sc.parallelize(Seq(str)) > val df = spark.read.json(rdd) > df.write.mode("overwrite").parquet(s"file:///tmp/$fileName ") > import org.apache.spark.sql._ > import org.apache.spark.sql.functions._ > spark.udf.register("acf", (rows: Seq[Row]) => Option[String](null)) > spark.read.parquet(s"file:///tmp/$fileName > ").select(expr("""acf(choices)""")).show > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22003) vectorized reader does not work with UDF when the column is array
[ https://issues.apache.org/jira/browse/SPARK-22003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22003: Assignee: (was: Apache Spark) > vectorized reader does not work with UDF when the column is array > - > > Key: SPARK-22003 > URL: https://issues.apache.org/jira/browse/SPARK-22003 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Feng Liu > > The UDF needs to deserialize the UnsafeRow. When the column type is Array, > the `get` method from the ColumnVector, which is used by the vectorized > reader, is called, but this method is not implemented, unfortunately. > Code to reproduce the issue: > {code:java} > val fileName = "testfile" > val str = """{ "choices": ["key1", "key2", "key3"] }""" > val rdd = sc.parallelize(Seq(str)) > val df = spark.read.json(rdd) > df.write.mode("overwrite").parquet(s"file:///tmp/$fileName ") > import org.apache.spark.sql._ > import org.apache.spark.sql.functions._ > spark.udf.register("acf", (rows: Seq[Row]) => Option[String](null)) > spark.read.parquet(s"file:///tmp/$fileName > ").select(expr("""acf(choices)""")).show > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org