[GitHub] spark pull request #18378: [SPARK-21163][SQL] DataFrame.toPandas should resp...

edlee123 Sat, 14 Apr 2018 15:55:50 -0700

Github user edlee123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18378#discussion_r181565606
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -1750,6 +1761,24 @@ def _to_scala_map(sc, jm):
         return sc._jvm.PythonUtils.toScalaMap(jm)
     
     
    +def _to_corrected_pandas_type(dt):
    +    """
    +    When converting Spark SQL records to Pandas DataFrame, the inferred 
data type may be wrong.
    +    This method gets the corrected data type for Pandas if that type may 
be inferred uncorrectly.
    +    """
    +    import numpy as np
    +    if type(dt) == ByteType:
    +        return np.int8
    +    elif type(dt) == ShortType:
    +        return np.int16
    +    elif type(dt) == IntegerType:
    +        return np.int32
    +    elif type(dt) == FloatType:
    +        return np.float32
    +    else:
    --- End diff --
    
    Had a question: in Spark 2.2.1, if I do a .toPandas on a Spark DataFrame 
with column integer type, the dtypes in pandas is int64.  Whereas in in Spark 
2.3.0 they ints are converted to int32.   I ran the below in Spark 2.2.1 and 
2.3.0: 
    
    ```
    df = spark.sparkContext.parallelize([(i, ) for i in [1, 2, 
3]]).toDF(["a"]).select(sf.col('a').cast('int')).toPandas()
    df.dtypes
    ```
    Is this intended?  We ran into as we have unit tests in a project that 
passed in Spark 2.2.1 that fail in Spark 2.3.0




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18378: [SPARK-21163][SQL] DataFrame.toPandas should resp...

Reply via email to