This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 1217996  [SPARK-27995][PYTHON] Note the difference between str of 
Python 2 and 3 at Arrow optimized
1217996 is described below

commit 1217996f1574f758d8cccc1c4e3846452d24b35b
Author: HyukjinKwon <gurwls...@apache.org>
AuthorDate: Tue Jun 11 18:43:59 2019 +0900

    [SPARK-27995][PYTHON] Note the difference between str of Python 2 and 3 at 
Arrow optimized
    
    ## What changes were proposed in this pull request?
    
    When Arrow optimization is enabled in Python 2.7,
    
    ```python
    import pandas
    pdf = pandas.DataFrame(["test1", "test2"])
    df = spark.createDataFrame(pdf)
    df.show()
    ```
    
    I got the following output:
    
    ```
    +----------------+
    |               0|
    +----------------+
    |[74 65 73 74 31]|
    |[74 65 73 74 32]|
    +----------------+
    ```
    
    This looks because Python's `str` and `byte` are same. it does look right:
    
    ```python
    >>> str == bytes
    True
    >>> isinstance("a", bytes)
    True
    ```
    
    To cut it short:
    
    1. Python 2 treats `str` as `bytes`.
    2. PySpark added some special codes and hacks to recognizes `str` as string 
types.
    3. PyArrow / Pandas followed Python 2 difference
    
    To fix, we have two options:
    
    1. Fix it to match the behaviour to PySpark's
    2. Note the differences
    
     but Python 2 is deprecated anyway. I think it's better to just note it and 
for go option 2.
    
    ## How was this patch tested?
    
    Manually tested.
    
    Doc was checked too:
    
    ![Screen Shot 2019-06-11 at 6 40 07 
PM](https://user-images.githubusercontent.com/6477701/59261402-59ad3b00-8c78-11e9-94a6-3236a2c338d4.png)
    
    Closes #24838 from HyukjinKwon/SPARK-27995.
    
    Authored-by: HyukjinKwon <gurwls...@apache.org>
    Signed-off-by: HyukjinKwon <gurwls...@apache.org>
---
 python/pyspark/sql/session.py | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index 72d2d99..cdab840 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -653,6 +653,11 @@ class SparkSession(object):
 
         .. note:: Usage with spark.sql.execution.arrow.pyspark.enabled=True is 
experimental.
 
+        .. note:: When Arrow optimization is enabled, strings inside Pandas 
DataFrame in Python
+            2 are converted into bytes as they are bytes in Python 2 whereas 
regular strings are
+            left as strings. When using strings in Python 2, use unicode `u""` 
as Python standard
+            practice.
+
         >>> l = [('Alice', 1)]
         >>> spark.createDataFrame(l).collect()
         [Row(_1=u'Alice', _2=1)]


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to