Hi, A common pattern in my work is querying large tables in Spark DataFrames and then needing to do more detailed analysis locally when the data can fit into memory. However, i've hit a few blockers. In Scala no well developed DataFrame library exists and in Python the `toPandas` function is very slow. As Pandas is one of the best DataFrame libraries out there is may be worth spending some time into making the `toPandas` method more efficient.
Having a quick look at the code it looks like a lot of iteration is occurring on the Python side. Python is really slow at iterating over large loop and this should be avoided. If iteration does have to occur its best done in Cython. Has anyone looked at Cythonising the process? Or even better serialising directly to Numpy arrays instead of the intermediate lists of Rows. Here are some links to the current code: topandas: https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L1342 collect: https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L233 _load_from_socket: https://github.com/apache/spark/blob/a60f91284ceee64de13f04559ec19c13a820a133/python/pyspark/rdd.py#L123 Josh Levy-Kramer Data Scientist @ Starcount