[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-29 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-144187766 Thanks for the reminder! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-29 Thread justinuang
Github user justinuang closed the pull request at: https://github.com/apache/spark/pull/8662 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-29 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-144180936 @justinuang #8835 is already merged, can you close this PR? thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-19 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141724819 I think this should work? https://github.com/apache/spark/pull/8835 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-18 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141623393 I took the idea from @rxin to simplify CachedOnceRDD, created #8833 , @justinuang please help to review it. --- If your project is set up for it, you can reply to this

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140994101 Actually never mind - it does. This seems a lot more complicated still - we would need to handle spilling, and then in that case it would strictly be slower due to the

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140993762 If I'm reading it correctly, this is not "appends all the rows into an array when compute() is called for the first time, then pull and remove the rows when compute() is

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140989401 Here are some benchmark with different approaches: ``` Without fix: Number of udfs: 0 - 0.142887830734 Number of udfs: 1 - 0.948309898376 Number of

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141004821 OK I finally understood what's going on here. If I understand your intention correctly, you are assuming: 1. The 1st read goes into Python 2. The 2nd read can

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141005883 BTW with all of these - I'm starting to think we should just do a quick fix for 1.5.1, and then for 1.6 we would need to rewrite a lot of these in order to use local

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141121225 @rxin what do you mean by local iterators =) I feel like i'm missing some context that you guys have --- If your project is set up for it, you can reply to this

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141117878 The solution with the iterator wrapper was my first approach that I prototyped

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141151543 This is the ticket: https://issues.apache.org/jira/browse/SPARK-9983 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141158021 @rxin The reason CachedOnceRDD looks a little bit complicated than expected is that the order of two caller (zip and writer thread) is undefined (they are in two

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141232211 I'm not sure there is a solution that satisfies all the requirements. I can say that this approach addresses 1,2,4 by design. Would you guys support a 1.6.0

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140892897 [Test build #1765 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1765/console) for PR 8662 at commit

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-16 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140925636 @justinuang This patch works pretty well on multiple UDFs, but I have two concerns before review the details: 1) it have some overhead for each batch, cause some

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-16 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140930587 Hye @davies - I was not involved with Python UDFs earlier, but why does calling udfs require caching? Isn't it really bad if the partition is large? It doesn't

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140941158 [Test build #1766 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1766/console) for PR 8662 at commit

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-16 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140920743 @davies how do I have a private class in python? In addition, is it possible that the failing unit test is flaky? I ran ./run-tests

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-16 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140934813 After offline discussion with @rxin @marmbrus , we can try another approach by passing the entire row into python worker (it increase the IO between JVM and python, but

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-16 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140936346 Hey davies, I think the performance regression for a single UDF may be because there were multiple threads per task that could potentially be taking up CPU time. I

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140922077 [Test build #1766 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1766/consoleFull) for PR 8662 at commit

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140839951 [Test build #1765 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1765/consoleFull) for PR 8662 at commit

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-15 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140413982 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-15 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140416126 @rxin or @davies why is this automatically not retriggering when i push a new commit? Also, looks like the "retest this please" only works with committers. --- If

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-14 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140223207 Looks like your intuition was right. The second time it's slightly faster, so I ran the loop twice and took the 2nd's numbers Here are the updated numbers

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-14 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140187364 @justinuang Thanks for numbers, they looks interesting. The result of no udfs looks strange, could you run each query twice and use the best (or second) result? --- If

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-14 Thread punya
Github user punya commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140181697 :+1: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-14 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140181530 Sorry for the delay, here is the code I ran and here are the results from pyspark.sql.functions import udf from pyspark.sql.types import

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-10 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139330946 @justinuang Thanks for working on this, do you have some number about the performance improvements on this? --- If your project is set up for it, you can reply to this

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-10 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139331466 Hey davies, I don't have any numbers. Are there any benchmarks that we can just rerun? --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-10 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139332197 Unfortunately, we haven't do any benchmark for Python UDF yet, can you do one or two simple case? --- If your project is set up for it, you can reply to this email and

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-10 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139330955 [Test build #1737 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1737/consoleFull) for PR 8662 at commit

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-10 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139345675 I think something like this could be enough: ``` from pyspark.sql.functions import udf tofloat = udf(lambda x: float(x), DoubleType) df =

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-10 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139372040 [Test build #1737 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1737/console) for PR 8662 at commit

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-10 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139332688 Is there an example of another benchmark? I'm not sure where they're stored for python --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-09 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-138797144 Jenkins, ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-09 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139023500 Should the build have started by now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-09 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139024190 Yes - Mr. Jenkins doesn't like me anymore. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-09 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139028639 [Test build #1731 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1731/console) for PR 8662 at commit

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-09 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139027986 [Test build #1731 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1731/consoleFull) for PR 8662 at commit

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-138758512 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-08 Thread justinuang
GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/8662 [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF performance because of R… …DD caching - I wanted to reuse most of the logic from PythonRDD, so I pulled out two methods,

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-08 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-138758861 @davies @JoshRosen @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not