Xianjin YE created SPARK-27775: ---------------------------------- Summary: Support multiple return values for udf Key: SPARK-27775 URL: https://issues.apache.org/jira/browse/SPARK-27775 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3, 2.3.3 Reporter: Xianjin YE
Hi, I'd like to proposal one improvement to Spark SQL, namely multi alias for udf, which is inspired by one of our internal SQL systems. Current Spark SQL and Hive don't support multiple return values for one udf. One alternative would be returning StructType for UDF, and then select corresponding fields. Two downsides about that approach: * The SQL is more complex than multi alias, quite unreadable for multiple similar UDFs. * the UDF code is evaluated multiple times, one time per Projection. for example, suppose one udf is defined as below: {code:java} // Scala def myFunc: (String => (String, String)) = { s => println("xx"); (s.toLowerCase, s.toUpperCase)} val myUDF = udf(myFunc) {code} To select multiple fields of myUDF, I have to do: {code:java} // Scala spark.sql("select id, myUDF(id)._1, myUDF(id)._2 from t1").explain() == Physical Plan == *(1) Project [cast(id#12L as string) AS id#14, UDF(cast(id#12L as string))._1 AS UDF(id)._1#163, UDF(cast(id#12L as string))._2 AS UDF(id)._2#164] +- *(1) Range (0, 10, step=1, splits=48) {code} or {code:java} // Scala spark.sql("select id, id1._1, id1._2 from (select id, myUDF(id) as id1 from t1) t2").explain() == Physical Plan == *(1) Project [cast(id#12L as string) AS id#14, UDF(cast(id#12L as string))._1 AS _1#155, UDF(cast(id#12L as string))._2 AS _2#156] +- *(1) Range (0, 10, step=1, splits=48) {code} The udf `myUDF` has to be evaluated twice for two projection. If we can support multi alias for structure returned udf, we can simply do this, and extract multiple return values with only one evaluation of udf. {code:java} // Scala spark.sql("select id, myUDF(id) as (x, y) from t1"){code} [SPARK-5383|https://issues.apache.org/jira/browse/SPARK-5383] adds multi alias support for udtfs, the support for udfs is not. cc [~scwf] and [~cloud_fan] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org