GitHub user JasonMWhite opened a pull request:

    https://github.com/apache/spark/pull/9392

    [SPARK-11437] [PySpark] Don't .take when converting RDD to DataFrame with 
provided schema

    When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
`.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
affected cases where a schema was not provided.
    
    Verifying the first 10 rows is of limited utility and causes the DAG to be 
executed non-lazily. If necessary, I believe this verification should be done 
lazily on all rows. However, since the caller is providing a schema to follow, 
I think it's acceptable to simply fail if the schema is incorrect.
    
    @marmbrus We chatted about this at SparkSummitEU. @davies you made a 
similar change for the infer-schema path in 
https://github.com/apache/spark/pull/6606

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JasonMWhite/spark createDataFrame_without_take

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9392.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9392
    
----
commit a7c395f9fe7bf43b1e63af060b425fa6047b25f9
Author: Jason White <jason.wh...@shopify.com>
Date:   2015-10-31T21:17:37Z

    don't .take when converting RDD to DataFrame with provided schema

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to