Xianjin YE created SPARK-27775:
----------------------------------

             Summary: Support multiple return values for udf
                 Key: SPARK-27775
                 URL: https://issues.apache.org/jira/browse/SPARK-27775
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.3, 2.3.3
            Reporter: Xianjin YE


Hi, I'd like to proposal one improvement to Spark SQL, namely multi alias for 
udf, which is inspired by one of our internal SQL systems.

 

Current Spark SQL and Hive don't support multiple return values  for one udf. 
One alternative would be returning StructType for UDF, and then select 
corresponding fields. Two downsides about that approach:
 * The SQL is more complex than multi alias, quite unreadable for multiple 
similar UDFs.
 * the UDF code is evaluated multiple times, one time per Projection.

for example, suppose one udf is defined as below:
{code:java}
// Scala
def myFunc: (String => (String, String)) = { s => println("xx"); 
(s.toLowerCase, s.toUpperCase)}
val myUDF = udf(myFunc)
{code}
To select multiple fields of myUDF,  I have to do:
{code:java}
// Scala
spark.sql("select id, myUDF(id)._1, myUDF(id)._2 from t1").explain()
== Physical Plan ==
*(1) Project [cast(id#12L as string) AS id#14, UDF(cast(id#12L as string))._1 
AS UDF(id)._1#163, UDF(cast(id#12L as string))._2 AS UDF(id)._2#164]
+- *(1) Range (0, 10, step=1, splits=48)
{code}
or 
{code:java}
// Scala
spark.sql("select id, id1._1, id1._2 from (select id, myUDF(id) as id1 from t1) 
t2").explain()
== Physical Plan == *(1) Project [cast(id#12L as string) AS id#14, 
UDF(cast(id#12L as string))._1 AS _1#155, UDF(cast(id#12L as string))._2 AS 
_2#156] +- *(1) Range (0, 10, step=1, splits=48)
{code}
 The udf `myUDF` has to be evaluated twice for two projection.

If we can support multi alias for structure returned udf, we can simply do 
this, and extract multiple return values with only one evaluation of udf.
{code:java}
// Scala
spark.sql("select id, myUDF(id) as (x, y) from t1"){code}
 

[SPARK-5383|https://issues.apache.org/jira/browse/SPARK-5383] adds multi alias 
support for udtfs, the support for udfs is not. cc [~scwf] and [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to