Hi all,

Wez, I read your thread earlier today after I sent this message and its
exciting someone of your caliber working on the issue :)

For a short term solution i've created a Gist which performs the toPandas
operation using the mapPartitions method suggested by Mark:
https://gist.github.com/joshlk/871d58e01417478176e7

Regards,
Josh

On 22 March 2016 at 14:55, Wes McKinney <w...@cloudera.com> wrote:

> hi all,
>
> I recently did an analysis of the performance of toPandas
>
> summary: http://wesmckinney.com/blog/pandas-and-apache-arrow/
> ipython notebook: https://gist.github.com/wesm/0cb5531b1c2e346a0007
>
> One solution I'm planning for this is an alternate serializer for
> Spark DataFrames, with an optimized (C++ / Cython) conversion to a
> pandas.DataFrame on the Python side:
>
> https://issues.apache.org/jira/browse/SPARK-13534
>
> I'm happy to discuss in more detail with those interested. The basic
> idea is that deserializing binary data directly into NumPy arrays is
> what you want, but you need some array-oriented / columnar memory
> representation to push over the wire. Apache Arrow is designed
> specifically for this use case.
>
> best,
> Wes
>
> On Tue, Mar 22, 2016 at 7:11 AM, Mark Vervuurt <m.a.vervu...@gmail.com>
> wrote:
> > Hi Josh,
> >
> > The work around we figured out to solve network latency and out of memory
> > problems with the toPandas method was to create Pandas DataFrames or
> Numpy
> > Arrays using MapPartitions for each partition. Maybe a standard solution
> > around this line of thought could be built. The integration is quite
> tedious
> > ;)
> >
> > I hope this helps.
> >
> > Regards,
> > Mark
> >
> > On 22 Mar 2016, at 13:40, Josh Levy-Kramer <j...@starcount.com> wrote:
> >
> > Hi,
> >
> > A common pattern in my work is querying large tables in Spark DataFrames
> and
> > then needing to do more detailed analysis locally when the data can fit
> into
> > memory. However, i've hit a few blockers. In Scala no well developed
> > DataFrame library exists and in Python the `toPandas` function is very
> slow.
> > As Pandas is one of the best DataFrame libraries out there is may be
> worth
> > spending some time into making the `toPandas` method more efficient.
> >
> > Having a quick look at the code it looks like a lot of iteration is
> > occurring on the Python side. Python is really slow at iterating over
> large
> > loop and this should be avoided. If iteration does have to occur its best
> > done in Cython. Has anyone looked at Cythonising the process? Or even
> better
> > serialising directly to Numpy arrays instead of the intermediate lists of
> > Rows.
> >
> > Here are some links to the current code:
> >
> > topandas:
> >
> https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L1342
> >
> > collect:
> >
> https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L233
> >
> > _load_from_socket:
> >
> https://github.com/apache/spark/blob/a60f91284ceee64de13f04559ec19c13a820a133/python/pyspark/rdd.py#L123
> >
> > Josh Levy-Kramer
> > Data Scientist @ Starcount
> >
> >
>



-- 


*Josh Levy-Kramer*Data Scientist


+44 (0) 203 770 7554
+44 (0) 781 797 0736
Henry Wood House, 2 Riding House Street, W1W 7FA London

www.starcount.com

*Confidentiality*

The information contained in this e-mail is confidential, may be privileged
and is intended solely for the use of the named addressee. Access to this
e-mail by any other person is not authorised. If you are not the intended
recipient, you should not disclose, copy, distribute, take any action or
rely on it and you should please notify the sender by reply. Any opinions
expressed are not necessarily those of the company.
We may monitor all incoming and outgoing emails in line with current
legislation. We have taken steps to ensure that this email and attachments
are free from any virus, but it remains your responsibility to ensure that
viruses do not adversely affect you.

Reply via email to