hi all,

I recently did an analysis of the performance of toPandas

summary: http://wesmckinney.com/blog/pandas-and-apache-arrow/
ipython notebook: https://gist.github.com/wesm/0cb5531b1c2e346a0007

One solution I'm planning for this is an alternate serializer for
Spark DataFrames, with an optimized (C++ / Cython) conversion to a
pandas.DataFrame on the Python side:

https://issues.apache.org/jira/browse/SPARK-13534

I'm happy to discuss in more detail with those interested. The basic
idea is that deserializing binary data directly into NumPy arrays is
what you want, but you need some array-oriented / columnar memory
representation to push over the wire. Apache Arrow is designed
specifically for this use case.

best,
Wes

On Tue, Mar 22, 2016 at 7:11 AM, Mark Vervuurt <m.a.vervu...@gmail.com> wrote:
> Hi Josh,
>
> The work around we figured out to solve network latency and out of memory
> problems with the toPandas method was to create Pandas DataFrames or Numpy
> Arrays using MapPartitions for each partition. Maybe a standard solution
> around this line of thought could be built. The integration is quite tedious
> ;)
>
> I hope this helps.
>
> Regards,
> Mark
>
> On 22 Mar 2016, at 13:40, Josh Levy-Kramer <j...@starcount.com> wrote:
>
> Hi,
>
> A common pattern in my work is querying large tables in Spark DataFrames and
> then needing to do more detailed analysis locally when the data can fit into
> memory. However, i've hit a few blockers. In Scala no well developed
> DataFrame library exists and in Python the `toPandas` function is very slow.
> As Pandas is one of the best DataFrame libraries out there is may be worth
> spending some time into making the `toPandas` method more efficient.
>
> Having a quick look at the code it looks like a lot of iteration is
> occurring on the Python side. Python is really slow at iterating over large
> loop and this should be avoided. If iteration does have to occur its best
> done in Cython. Has anyone looked at Cythonising the process? Or even better
> serialising directly to Numpy arrays instead of the intermediate lists of
> Rows.
>
> Here are some links to the current code:
>
> topandas:
> https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L1342
>
> collect:
> https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L233
>
> _load_from_socket:
> https://github.com/apache/spark/blob/a60f91284ceee64de13f04559ec19c13a820a133/python/pyspark/rdd.py#L123
>
> Josh Levy-Kramer
> Data Scientist @ Starcount
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to