[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20163 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r161677327 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala --- @@ -144,6 +145,7 @@ object EvaluatePython { } case StringType => (obj: Any) => nullSafeConvert(obj) { + case _: Calendar => null case _ => UTF8String.fromString(obj.toString) --- End diff -- btw, the array case seems a bit weird? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r161455317 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala --- @@ -144,6 +145,7 @@ object EvaluatePython { } case StringType => (obj: Any) => nullSafeConvert(obj) { + case _: Calendar => null case _ => UTF8String.fromString(obj.toString) --- End diff -- looks good --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r161365496 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala --- @@ -144,6 +145,7 @@ object EvaluatePython { } case StringType => (obj: Any) => nullSafeConvert(obj) { + case _: Calendar => null case _ => UTF8String.fromString(obj.toString) --- End diff -- @cloud-fan, how about something like this then? ```scala case StringType => (obj: Any) => nullSafeConvert(obj) { // Shortcut for string conversion case c: String => UTF8String.fromString(c) // Here, we return null for 'array', 'tuple', 'dict', 'list', 'datetime.datetime', // 'datetime.date' and 'datetime.time' because those string conversions are // not quite consistent with SQL string representation of data. case _: java.util.Calendar | _: net.razorvine.pickle.objects.Time | _: java.util.List[_] | _: java.util.Map[_, _] => null case c if c.getClass.isArray => null // Here, we keep the string conversion fall back for compatibility. // TODO: We should revisit this and rewrite the type conversion logic in Spark 3.x. case other => UTF8String.fromString(other.toString) } ``` My few tests: `datetime.time`: ``` from pyspark.sql.functions import udf from datetime import time f = udf(lambda x: time(0, 0), "string") spark.range(1).select(f("id")).show() ``` ``` ++ |(id)| ++ |Time: 0 hours, 0 ...| ++ ``` `array`: ``` from pyspark.sql.functions import udf import array f = udf(lambda x: array.array("c", "aaa"), "string") spark.range(1).select(f("id")).show() ``` ``` ++ |(id)| ++ | [C@11618d9e| ++ ``` `tuple`: ``` from pyspark.sql.functions import udf f = udf(lambda x: (x,), "string") spark.range(1).select(f("id")).show() ``` ``` ++ |(id)| ++ |[Ljava.lang.Objec...| ++ ``` `list`: ``` from pyspark.sql.functions import udf from datetime import datetime f = udf(lambda x: [datetime(1990, 1, 1)], "string") spark.range(1).select(f("id")).show() ``` ``` ++ |(id)| ++ |[java.util.Gregor...| ++ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r161363813 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala --- @@ -144,6 +145,7 @@ object EvaluatePython { } case StringType => (obj: Any) => nullSafeConvert(obj) { + case _: Calendar => null case _ => UTF8String.fromString(obj.toString) --- End diff -- I think there is no perfect solution .. I think https://github.com/apache/spark/pull/20163#discussion_r161363004 sounds good enough as a fix for this issue for now .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r161363630 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala --- @@ -144,6 +145,7 @@ object EvaluatePython { } case StringType => (obj: Any) => nullSafeConvert(obj) { + case _: Calendar => null case _ => UTF8String.fromString(obj.toString) --- End diff -- BTW, seems there is another hole when we actually do the internal conversion with unexpected types: ```python >>> from pyspark.sql.functions import udf >>> f = udf(lambda x: x, "date") >>> spark.range(1).select(f("id")).show() ``` ``` org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "./python/pyspark/worker.py", line 229, in main process() File "./python/pyspark/worker.py", line 224, in process serializer.dump_stream(func(split_index, iterator), outfile) File "./python/pyspark/worker.py", line 149, in func = lambda _, it: map(mapper, it) File "", line 1, in File "./python/pyspark/worker.py", line 72, in return lambda *a: toInternal(f(*a)) File "/.../pyspark/sql/types.py", line 175, in toInternal return d.toordinal() - self.EPOCH_ORDINAL AttributeError: 'int' object has no attribute 'toordinal' ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r161363023 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala --- @@ -144,6 +145,7 @@ object EvaluatePython { } case StringType => (obj: Any) => nullSafeConvert(obj) { + case _: Calendar => null case _ => UTF8String.fromString(obj.toString) --- End diff -- Oh, I didn't see the comment above when I write my comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r161363004 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala --- @@ -144,6 +145,7 @@ object EvaluatePython { } case StringType => (obj: Any) => nullSafeConvert(obj) { + case _: Calendar => null case _ => UTF8String.fromString(obj.toString) --- End diff -- So, for now .. I think it's fine as a small fix as is ... We are going to document that the return type and return value should be matched anyway .. So, expected return values will be: ```python # Mapping Python types to Spark SQL DataType _type_mappings = { type(None): NullType, bool: BooleanType, int: LongType, float: DoubleType, str: StringType, bytearray: BinaryType, decimal.Decimal: DecimalType, datetime.date: DateType, datetime.datetime: TimestampType, datetime.time: TimestampType, } ``` Seems, we can also check if the string conversion looks reasonable and then blacklist `net.razorvine.pickle.objects.Time` if not ... How does this sound to you @cloud-fan and @rednaxelafx? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r161362994 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala --- @@ -144,6 +145,7 @@ object EvaluatePython { } case StringType => (obj: Any) => nullSafeConvert(obj) { + case _: Calendar => null case _ => UTF8String.fromString(obj.toString) --- End diff -- > check if the string conversion looks reasonably consistent by obj.toString. If not, we add it in the blacklist. hmm, this seems weird as the type mismatch now is defined by Pyrolite object's `toString` behavior... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r161362902 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala --- @@ -144,6 +145,7 @@ object EvaluatePython { } case StringType => (obj: Any) => nullSafeConvert(obj) { + case _: Calendar => null case _ => UTF8String.fromString(obj.toString) --- End diff -- For the perfectness, I think we should check all the types, https://github.com/irmen/Pyrolite, ``` PYTHON> JAVA -- Nonenull boolboolean int int longlong or BigInteger (depending on size) string String unicode String complex net.razorvine.pickle.objects.ComplexNumber datetime.date java.util.Calendar datetime.datetime java.util.Calendar datetime.time net.razorvine.pickle.objects.Time datetime.timedelta net.razorvine.pickle.objects.TimeDelta float double (float isn't used) array.array array of appropriate primitive type (char, int, short, long, float, double) listjava.util.List tuple Object[] set java.util.Set dictjava.util.Map bytes byte[] bytearray byte[] decimal BigDecimal custom classMap(dict with class attributes including its name in "__class__") Pyro4.core.URI net.razorvine.pyro.PyroURI Pyro4.core.Proxynet.razorvine.pyro.PyroProxy Pyro4.errors.* net.razorvine.pyro.PyroException Pyro4.utils.flame.FlameBuiltin net.razorvine.pyro.FlameBuiltin Pyro4.utils.flame.FlameModule net.razorvine.pyro.FlameModule Pyro4.utils.flame.RemoteInteractiveConsole net.razorvine.pyro.FlameRemoteConsole ``` and then check if the string conversion looks reasonably consistent by `obj.toString`. If not, we add it in the blacklist. Another possibility is to whitelist `String`, but then I guess this is rather a radical change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...
Github user rednaxelafx commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r161287717 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala --- @@ -144,6 +145,7 @@ object EvaluatePython { } case StringType => (obj: Any) => nullSafeConvert(obj) { + case _: Calendar => null case _ => UTF8String.fromString(obj.toString) --- End diff -- I was pounding on that yesterday, too... somehow I have this feeling that no matter which direction we take, there's no good answer to type mismatch situations. Let's say if we blacklist more types, should we document the list so that it's clear what will definitely NOT work? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r161268321 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala --- @@ -144,6 +145,7 @@ object EvaluatePython { } case StringType => (obj: Any) => nullSafeConvert(obj) { + case _: Calendar => null case _ => UTF8String.fromString(obj.toString) --- End diff -- shall we blacklist more types? e.g. if a udf returns decimal and mark the return type as string type, is it a mismatch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org