[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

HyukjinKwon Fri, 12 Jan 2018 22:44:00 -0800

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20163#discussion_r161365496
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
    @@ -144,6 +145,7 @@ object EvaluatePython {
         }
     
         case StringType => (obj: Any) => nullSafeConvert(obj) {
    +      case _: Calendar => null
           case _ => UTF8String.fromString(obj.toString)
    --- End diff --
    
    @cloud-fan, how about something like this then?
    
    ```scala
        case StringType => (obj: Any) => nullSafeConvert(obj) {
          // Shortcut for string conversion
          case c: String => UTF8String.fromString(c)
    
          // Here, we return null for 'array', 'tuple', 'dict', 'list', 
'datetime.datetime',
          // 'datetime.date' and 'datetime.time' because those string 
conversions are
          // not quite consistent with SQL string representation of data.
          case _: java.util.Calendar | _: net.razorvine.pickle.objects.Time |
               _: java.util.List[_] | _: java.util.Map[_, _] =>
            null
          case c if c.getClass.isArray => null
    
          // Here, we keep the string conversion fall back for compatibility.
          // TODO: We should revisit this and rewrite the type conversion logic 
in Spark 3.x.
          case other => UTF8String.fromString(other.toString)
        }
    ```
    
    My few tests:
    
    `datetime.time`:
    
    ```
    from pyspark.sql.functions import udf
    from datetime import time
    
    f = udf(lambda x: time(0, 0), "string")
    spark.range(1).select(f("id")).show()
    ```
    
    ```
    +--------------------+
    |        <lambda>(id)|
    +--------------------+
    |Time: 0 hours, 0 ...|
    +--------------------+
    ```
    
    `array`:
    
    ```
    from pyspark.sql.functions import udf
    import array
    
    f = udf(lambda x: array.array("c", "aaa"), "string")
    spark.range(1).select(f("id")).show()
    ```
    
    ```
    +------------+
    |<lambda>(id)|
    +------------+
    | [C@11618d9e|
    +------------+
    ```
    
    `tuple`:
    
    ```
    from pyspark.sql.functions import udf
    
    f = udf(lambda x: (x,), "string")
    spark.range(1).select(f("id")).show()
    ```
    
    ```
    +--------------------+
    |        <lambda>(id)|
    +--------------------+
    |[Ljava.lang.Objec...|
    +--------------------+
    ```
    
    `list`:
    
    ```
    from pyspark.sql.functions import udf
    from datetime import datetime
    
    f = udf(lambda x: [datetime(1990, 1, 1)], "string")
    spark.range(1).select(f("id")).show()
    ```
    
    ```
    +--------------------+
    |        <lambda>(id)|
    +--------------------+
    |[java.util.Gregor...|
    +--------------------+
    ```




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

Reply via email to