Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-144187766
Thanks for the reminder!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user justinuang closed the pull request at:
https://github.com/apache/spark/pull/8662
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-144180936
@justinuang #8835 is already merged, can you close this PR? thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-141724819
I think this should work? https://github.com/apache/spark/pull/8835
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-141623393
I took the idea from @rxin to simplify CachedOnceRDD, created #8833 ,
@justinuang please help to review it.
---
If your project is set up for it, you can reply to this
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140994101
Actually never mind - it does. This seems a lot more complicated still - we
would need to handle spilling, and then in that case it would strictly be
slower due to the
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140993762
If I'm reading it correctly, this is not "appends all the rows into an
array when compute() is called for the first time, then pull and remove the
rows when compute() is
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140989401
Here are some benchmark with different approaches:
```
Without fix:
Number of udfs: 0 - 0.142887830734
Number of udfs: 1 - 0.948309898376
Number of
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-141004821
OK I finally understood what's going on here. If I understand your
intention correctly, you are assuming:
1. The 1st read goes into Python
2. The 2nd read can
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-141005883
BTW with all of these - I'm starting to think we should just do a quick fix
for 1.5.1, and then for 1.6 we would need to rewrite a lot of these in order to
use local
Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-141121225
@rxin what do you mean by local iterators =) I feel like i'm missing some
context that you guys have
---
If your project is set up for it, you can reply to this
Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-141117878
The solution with the iterator wrapper was my first approach that I
prototyped
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-141151543
This is the ticket: https://issues.apache.org/jira/browse/SPARK-9983
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-141158021
@rxin The reason CachedOnceRDD looks a little bit complicated than expected
is that the order of two caller (zip and writer thread) is undefined (they are
in two
Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-141232211
I'm not sure there is a solution that satisfies all the requirements. I can
say that this approach addresses 1,2,4 by design.
Would you guys support a 1.6.0
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140892897
[Test build #1765 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1765/console)
for PR 8662 at commit
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140925636
@justinuang This patch works pretty well on multiple UDFs, but I have two
concerns before review the details: 1) it have some overhead for each batch,
cause some
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140930587
Hye @davies - I was not involved with Python UDFs earlier, but why does
calling udfs require caching? Isn't it really bad if the partition is large?
It doesn't
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140941158
[Test build #1766 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1766/console)
for PR 8662 at commit
Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140920743
@davies how do I have a private class in python?
In addition, is it possible that the failing unit test is flaky? I ran
./run-tests
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140934813
After offline discussion with @rxin @marmbrus , we can try another approach
by passing the entire row into python worker (it increase the IO between JVM
and python, but
Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140936346
Hey davies, I think the performance regression for a single UDF may be
because there were multiple threads per task that could potentially be taking
up CPU time. I
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140922077
[Test build #1766 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1766/consoleFull)
for PR 8662 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140839951
[Test build #1765 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1765/consoleFull)
for PR 8662 at commit
Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140413982
Jenkins, retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140416126
@rxin or @davies why is this automatically not retriggering when i push a
new commit? Also, looks like the "retest this please" only works with
committers.
---
If
Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140223207
Looks like your intuition was right. The second time it's slightly faster,
so I ran the loop twice and took the 2nd's numbers
Here are the updated numbers
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140187364
@justinuang Thanks for numbers, they looks interesting. The result of no
udfs looks strange, could you run each query twice and use the best (or second)
result?
---
If
Github user punya commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140181697
:+1:
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140181530
Sorry for the delay, here is the code I ran and here are the results
from pyspark.sql.functions import udf
from pyspark.sql.types import
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-139330946
@justinuang Thanks for working on this, do you have some number about the
performance improvements on this?
---
If your project is set up for it, you can reply to this
Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-139331466
Hey davies, I don't have any numbers. Are there any benchmarks that we can
just rerun?
---
If your project is set up for it, you can reply to this email and have
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-139332197
Unfortunately, we haven't do any benchmark for Python UDF yet, can you do
one or two simple case?
---
If your project is set up for it, you can reply to this email and
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-139330955
[Test build #1737 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1737/consoleFull)
for PR 8662 at commit
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-139345675
I think something like this could be enough:
```
from pyspark.sql.functions import udf
tofloat = udf(lambda x: float(x), DoubleType)
df =
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-139372040
[Test build #1737 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1737/console)
for PR 8662 at commit
Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-139332688
Is there an example of another benchmark? I'm not sure where they're stored
for python
---
If your project is set up for it, you can reply to this email and have
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-138797144
Jenkins, ok to test.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-139023500
Should the build have started by now?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-139024190
Yes - Mr. Jenkins doesn't like me anymore.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-139028639
[Test build #1731 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1731/console)
for PR 8662 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-139027986
[Test build #1731 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1731/consoleFull)
for PR 8662 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-138758512
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
GitHub user justinuang opened a pull request:
https://github.com/apache/spark/pull/8662
[SPARK-8632] [SQL] [PYSPARK] Poor Python UDF performance because of Râ¦
â¦DD caching
- I wanted to reuse most of the logic from PythonRDD, so I pulled out
two methods,
Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-138758861
@davies @JoshRosen @rxin
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
45 matches
Mail list logo