Re: Python UDF performance at large scale

2015-06-25 Thread Davies Liu
I'm thinking that the batched synchronous version will be too slow (with small batch size) or easy to OOM with large (batch size). If it's not that hard, you can give it a try. On Wed, Jun 24, 2015 at 4:39 PM, Justin Uang justin.u...@gmail.com wrote: Correct, I was running with a batch size of

Re: Python UDF performance at large scale

2015-06-25 Thread Justin Uang
Sweet, filed here: https://issues.apache.org/jira/browse/SPARK-8632 On Thu, Jun 25, 2015 at 3:05 AM Davies Liu dav...@databricks.com wrote: I'm thinking that the batched synchronous version will be too slow (with small batch size) or easy to OOM with large (batch size). If it's not that hard,

Re: Python UDF performance at large scale

2015-06-24 Thread Davies Liu
Fare points, I also like simpler solutions. The overhead of Python task could be a few of milliseconds, which means we also should eval them as batches (one Python task per batch). Decreasing the batch size for UDF sounds reasonable to me, together with other tricks to reduce the data in

Re: Python UDF performance at large scale

2015-06-24 Thread Punyashloka Biswal
Hi Davies, In general, do we expect people to use CPython only for heavyweight UDFs that invoke an external library? Are there any examples of using Jython, especially performance comparisons to Java/Scala and CPython? When using Jython, do you expect the driver to send code to the executor as a

Re: Python UDF performance at large scale

2015-06-24 Thread Justin Uang
Correct, I was running with a batch size of about 100 when I did the tests, because I was worried about deadlocks. Do you have any concerns regarding the batched synchronous version of communication between the Java and Python processes, and if not, should I file a ticket and starting writing it?

Re: Python UDF performance at large scale

2015-06-24 Thread Davies Liu
From you comment, the 2x improvement only happens when you have the batch size as 1, right? On Wed, Jun 24, 2015 at 12:11 PM, Justin Uang justin.u...@gmail.com wrote: FYI, just submitted a PR to Pyrolite to remove their StopException. https://github.com/irmen/Pyrolite/pull/30 With my

Re: Python UDF performance at large scale

2015-06-23 Thread Justin Uang
// + punya Thanks for your quick response! I'm not sure that using an unbounded buffer is a good solution to the locking problem. For example, in the situation where I had 500 columns, I am in fact storing 499 extra columns on the java side, which might make me OOM if I have to store many rows.

Python UDF performance at large scale

2015-06-23 Thread Justin Uang
BLUF: BatchPythonEvaluation's implementation is unusable at large scale, but I have a proof-of-concept implementation that avoids caching the entire dataset. Hi, We have been running into performance problems using Python UDFs with DataFrames at large scale. From the implementation of

Re: Python UDF performance at large scale

2015-06-23 Thread Davies Liu
Thanks for looking into it, I'd like the idea of having ForkingIterator. If we have unlimited buffer in it, then will not have the problem of deadlock, I think. The writing thread will be blocked by Python process, so there will be not much rows be buffered(still be a reason to OOM). At least,