[ https://issues.apache.org/jira/browse/SPARK-17360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15455238#comment-15455238 ]
Apache Spark commented on SPARK-17360: -------------------------------------- User 'Stibbons' has created a pull request for this issue: https://github.com/apache/spark/pull/14918 > PySpark can create dataframe from a Python generator > ---------------------------------------------------- > > Key: SPARK-17360 > URL: https://issues.apache.org/jira/browse/SPARK-17360 > Project: Spark > Issue Type: Improvement > Reporter: Semet > Priority: Trivial > > It looks like one can create a dataframe from a Python generator, which might > be more efficient that by creating the list of row and use createDataframe: > {code} > >>> # On Python 3, you want to use "range" on the following line > >>> d = ({'name': 'Alice-{}'.format(i), 'age': i} for i in xrange(0, > >>> 10000000)) > >>> d # Please note that 'd' is a generator and not a structure with the > >>> 10000000 elements. > <generator object <genexpr> at 0x7f1234b92af0> > >>> sqlContext.createDataFrame(d).take(5) > [Row(age=1, name=u'Alice-1')] > [Row(age=2, name=u'Alice-2')] > [Row(age=3, name=u'Alice-3')] > [Row(age=4, name=u'Alice-4')] > [Row(age=5, name=u'Alice-5')] > {code} > Looking at the code, there is nothing important to change in the code, only > doc and unit tests -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org