[GitHub] spark pull request #14198: Fix bugs about types that result an array of null...

zasdfgbnm Thu, 14 Jul 2016 01:58:21 -0700

GitHub user zasdfgbnm opened a pull request:

    https://github.com/apache/spark/pull/14198


    Fix bugs about types that result an array of null when creating dataframe 
using python

    ## What changes were proposed in this pull request?
    
    Fix bugs about types that result an array of null when creating dataframe 
using python.
    Python's array.array have richer type than python itself, e.g. we can have 
array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take this 
into consideration which might cause a problem that you get an array of null 
values when you have array('f') in your rows.
    
    A simple code to reproduce this is:
    
    `from pyspark import SparkContext`
    `from pyspark.sql import SQLContext,Row,DataFrame`
    `from array import array`
    
    `sc = SparkContext()`
    `sqlContext = SQLContext(sc)`
    
    `row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))`
    `rows = sc.parallelize([ row1 ])`
    `df = sqlContext.createDataFrame(rows)`
    `df.show()`
    
    which have output
    `+---------------+------------------+`
    `|    doublearray|        floatarray|`
    `+---------------+------------------+`
    `|[1.0, 2.0, 3.0]|[null, null, null]|`
    `+---------------+------------------+`
    
    
    ## How was this patch tested?
    tested manually
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zasdfgbnm/spark fix_array_infer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14198.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14198
    
----
commit a127486d59528eae452dcbcc2ccfb68fdd7769b7
Author: Xiang Gao <qasdfgtyu...@gmail.com>
Date:   2016-07-09T00:58:14Z

    use array.typecode to infer type
    
    Python's array has more type than python it self, for example
    python only has float while array support 'f' (float) and 'd' (double)
    Switching to array.typecode helps spark make a better inference
    
    For example, for the code:
    
    from pyspark.sql.types import _infer_type
    from array import array
    a = array('f',[1,2,3,4,5,6])
    _infer_type(a)
    
    We will get ArrayType(DoubleType,true) before change,
    but ArrayType(FloatType,true) after change

commit 70131f3b81575edf9073d5be72553730d6316bd6
Author: Xiang Gao <qasdfgtyu...@gmail.com>
Date:   2016-07-09T06:21:31Z

    Merge branch 'master' into fix_array_infer

commit 505e819f415c2f754b5147908516ace6f6ddfe78
Author: Xiang Gao <qasdfgtyu...@gmail.com>
Date:   2016-07-13T12:53:18Z

    sync with upstream

commit 05979ca6eabf723cf3849ec2bf6f6e9de26cb138
Author: Xiang Gao <qasdfgtyu...@gmail.com>
Date:   2016-07-14T08:07:12Z

    add case (c: Float, FloatType) to fromJava

commit 5cd817a4e7ec68a693ee2a878a2e36b09b1965b6
Author: Xiang Gao <qasdfgtyu...@gmail.com>
Date:   2016-07-14T08:09:25Z

    sync with upstream

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14198: Fix bugs about types that result an array of null...

Reply via email to