Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/21546 ### Performance Tests - createDataFrame Tests run on a 4 node standalone cluster with 32 cores total, 14.04.1-Ubuntu and OpenJDK 8 measured wall clock time to execute `createDataFrame()` and get the first record. Took the average best time of 5 runs/5 loops each. Test code ```python def run(): pdf = pd.DataFrame(np.random.rand(10000000, 10)) spark.createDataFrame(pdf).first() for i in range(6): start = time.time() run() elapsed = time.time() - start gc.collect() print("Run %d: %f" % (i, elapsed)) ``` | Current Master | This PR | |---------------------|------------| 6.234608 | 5.665641 6.32144 | 5.3475 6.527859 | 5.370803 6.95089 | 5.479151 6.235046 | 5.529167 | Avg Master | Avg This PR -----------|-------------- 6.4539686 | 5.4784524 #### Speedup of 1.178064192
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org