[spark] branch master updated: [SPARK-28454][PYTHON] Validate LongType in `createDataFrame(verifySchema=True)`

gurwls223 Wed, 07 Aug 2019 19:49:11 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new bda5b51  [SPARK-28454][PYTHON] Validate LongType in 
`createDataFrame(verifySchema=True)`
bda5b51 is described below

commit bda5b51576e525724315d4892e34c8fa7e27f0c7
Author: Anton Yanchenko <simplyl...@gmail.com>
AuthorDate: Thu Aug 8 11:47:25 2019 +0900

    [SPARK-28454][PYTHON] Validate LongType in 
`createDataFrame(verifySchema=True)`
    
    ## What changes were proposed in this pull request?
    
    Add missing validation for `LongType` in 
`pyspark.sql.types._make_type_verifier`.
    
    ## How was this patch tested?
    
    Doctests / unittests / manual tests.
    
    Unpatched version:
    ```
    In [23]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', 
LongType())])).collect()
    Out[23]: [Row(x=None)]
    ```
    
    Patched:
    ```
    In [5]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', 
LongType())])).collect()
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-5-c1740fcadbf9> in <module>
    ----> 1 s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', 
LongType())])).collect()
    
    /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in 
createDataFrame(self, data, schema, samplingRatio, verifySchema)
        689             rdd, schema = self._createFromRDD(data.map(prepare), 
schema, samplingRatio)
        690         else:
    --> 691             rdd, schema = self._createFromLocal(map(prepare, data), 
schema)
        692         jrdd = 
self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
        693         jdf = 
self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
    
    /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in 
_createFromLocal(self, data, schema)
        405         # make sure data could consumed multiple times
        406         if not isinstance(data, list):
    --> 407             data = list(data)
        408
        409         if schema is None or isinstance(schema, (list, tuple)):
    
    /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in 
prepare(obj)
        671
        672             def prepare(obj):
    --> 673                 verify_func(obj)
        674                 return obj
        675         elif isinstance(schema, DataType):
    
    /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj)
       1427     def verify(obj):
       1428         if not verify_nullability(obj):
    -> 1429             verify_value(obj)
       1430
       1431     return verify
    
    /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in 
verify_struct(obj)
       1397             if isinstance(obj, dict):
       1398                 for f, verifier in verifiers:
    -> 1399                     verifier(obj.get(f))
       1400             elif isinstance(obj, Row) and getattr(obj, 
"__from_dict__", False):
       1401                 # the order in obj could be different than 
dataType.fields
    
    /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj)
       1427     def verify(obj):
       1428         if not verify_nullability(obj):
    -> 1429             verify_value(obj)
       1430
       1431     return verify
    
    /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in 
verify_long(obj)
       1356             if obj < -9223372036854775808 or obj > 
9223372036854775807:
       1357                 raise ValueError(
    -> 1358                     new_msg("object of LongType out of range, got: 
%s" % obj))
       1359
       1360         verify_value = verify_long
    
    ValueError: field x: object of LongType out of range, got: 
18446744073709551616
    ```
    
    Closes #25117 from simplylizz/master.
    
    Authored-by: Anton Yanchenko <simplyl...@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls...@apache.org>
---
 docs/sql-migration-guide-upgrade.md    |  2 ++
 python/pyspark/sql/tests/test_types.py |  3 ++-
 python/pyspark/sql/types.py            | 14 ++++++++++++++
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/docs/sql-migration-guide-upgrade.md 
b/docs/sql-migration-guide-upgrade.md
index efd88d0..b2bd8ce 100644
--- a/docs/sql-migration-guide-upgrade.md
+++ b/docs/sql-migration-guide-upgrade.md
@@ -149,6 +149,8 @@ license: |
 
   - Since Spark 3.0, if files or subdirectories disappear during recursive 
directory listing (i.e. they appear in an intermediate listing but then cannot 
be read or listed during later phases of the recursive directory listing, due 
to either concurrent file deletions or object store consistency issues) then 
the listing will fail with an exception unless 
`spark.sql.files.ignoreMissingFiles` is `true` (default `false`). In previous 
versions, these missing files or subdirectories would be i [...]
 
+  - Since Spark 3.0, `createDataFrame(..., verifySchema=True)` validates 
`LongType` as well in PySpark. Previously, `LongType` was not verified and 
resulted in `None` in case the value overflows. To restore this behavior, 
`verifySchema` can be set to `False` to disable the validation.
+
   - Since Spark 3.0, substitution order of nested WITH clauses is changed and 
an inner CTE definition takes precedence over an outer. In version 2.4 and 
earlier, `WITH t AS (SELECT 1), t2 AS (WITH t AS (SELECT 2) SELECT * FROM t) 
SELECT * FROM t2` returns `1` while in version 3.0 it returns `2`. The previous 
behaviour can be restored by setting `spark.sql.legacy.ctePrecedence.enabled` 
to `true`.
 
   - Since Spark 3.0, the `add_months` function does not adjust the resulting 
date to a last day of month if the original date is a last day of months. For 
example, `select add_months(DATE'2019-02-28', 1)` results `2019-03-28`. In 
Spark version 2.4 and earlier, the resulting date is adjusted when the original 
date is a last day of months. For example, adding a month to `2019-02-28` 
results in `2019-03-31`.
diff --git a/python/pyspark/sql/tests/test_types.py 
b/python/pyspark/sql/tests/test_types.py
index 5132eec..1cd84e0 100644
--- a/python/pyspark/sql/tests/test_types.py
+++ b/python/pyspark/sql/tests/test_types.py
@@ -830,7 +830,8 @@ class DataTypeVerificationTests(unittest.TestCase):
             (2**31 - 1, IntegerType()),
 
             # Long
-            (2**64, LongType()),
+            (-(2**63), LongType()),
+            (2**63 - 1, LongType()),
 
             # Float & Double
             (1.0, FloatType()),
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index da84fc1..0c7f4ce 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -1211,6 +1211,10 @@ def _make_type_verifier(dataType, nullable=True, 
name=None):
     >>> _make_type_verifier(StructType([]))(None)
     >>> _make_type_verifier(StringType())("")
     >>> _make_type_verifier(LongType())(0)
+    >>> _make_type_verifier(LongType())(1 << 64) # doctest: 
+IGNORE_EXCEPTION_DETAIL
+    Traceback (most recent call last):
+        ...
+    ValueError:...
     >>> _make_type_verifier(ArrayType(ShortType()))(list(range(3)))
     >>> _make_type_verifier(ArrayType(StringType()))(set()) # doctest: 
+IGNORE_EXCEPTION_DETAIL
     Traceback (most recent call last):
@@ -1319,6 +1323,16 @@ def _make_type_verifier(dataType, nullable=True, 
name=None):
 
         verify_value = verify_integer
 
+    elif isinstance(dataType, LongType):
+        def verify_long(obj):
+            assert_acceptable_types(obj)
+            verify_acceptable_types(obj)
+            if obj < -9223372036854775808 or obj > 9223372036854775807:
+                raise ValueError(
+                    new_msg("object of LongType out of range, got: %s" % obj))
+
+        verify_value = verify_long
+
     elif isinstance(dataType, ArrayType):
         element_verifier = _make_type_verifier(
             dataType.elementType, dataType.containsNull, name="element in 
array %s" % name)


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28454][PYTHON] Validate LongType in `createDataFrame(verifySchema=True)`

Reply via email to