Semet created SPARK-17360: ----------------------------- Summary: PySpark can create dataframe from a Python generator Key: SPARK-17360 URL: https://issues.apache.org/jira/browse/SPARK-17360 Project: Spark Issue Type: Improvement Reporter: Semet Priority: Trivial
It looks like one can create a dataframe from a Python generator, which might be more efficient that by creating the list of row and use createDataframe: {code} >>> # On Python 3, you want to use "range" on the following line >>> d = ({'name': 'Alice-{}'.format(i), 'age': i} for i in xrange(0, 10000000)) >>> d # Please note that 'd' is a generator and not a structure with the >>> 10000000 elements. <generator object <genexpr> at 0x7f1234b92af0> >>> sqlContext.createDataFrame(d).take(5) [Row(age=1, name=u'Alice-1')] [Row(age=2, name=u'Alice-2')] [Row(age=3, name=u'Alice-3')] [Row(age=4, name=u'Alice-4')] [Row(age=5, name=u'Alice-5')] {code} Looking at the code, there is nothing important to change in the code, only doc and unit tests -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org