Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@wzhfy @scwf Thanks for comment.
Until we have a way to figure out how to avoid the defect, I will close
this.
---
If your project is set up for it, you can reply to this email and have
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@wzhfy Thanks for comment.
I meant that we can just deal with the cases we are confident that the all
rows in all partitions must be much larger than the limit number. I am not sure
if cbo
Github user wzhfy commented on the issue:
https://github.com/apache/spark/pull/16633
Hi @viirya , the main concern of @scwf is that, we can't afford performance
regression in any customer scenarios. I think you can understand that :)
I went through the discussion above, it
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
@viirya i suggest fix the 2 in this pr, let's wait some comment on 1. /cc
@rxin and @wzhfy who may comment on the first case.
---
If your project is set up for it, you can reply to this email and
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
we don't need accurate number. we can have a confident margin.
the bad with broken rdd chain is re-processing the rows. anything else?
I don't think it is worth changing core and
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
For 1, my idea is not use the proposal in this PR,
1. how you determine `total rows in all partitions are (much) more than
limit number.` and then go into this code path and how to decide the
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
Ok. I think it is clearer now. We have two cases needed to solve:
1. After local limit, total rows in all partitions are (much) more than
limit number.
2. After local limit, total rows
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
all partitions after local limit are about/nearly 100,000,000 rows
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
Do you mean totally rows in all partitions after local limit are
about/nearly 100,000,000 rows? Or each partition after local limit has
about/nearly 100,000,000 rows?
---
If your project is set up
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
Again, to clean, I am against the performance regression in flowing case
0. limit num is 100,000,000
1. the original table rows is very big, much larger than 100,000,000 rows
2. after
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
That is why I propose to avoid shuffling to single partition. We can save
shuffling and keep parallelism. So I don't know what you are against?
---
If your project is set up for it, you can reply
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
I think shuffle is ok, but shuffle to one partition leads to the
performance issue.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@scwf So sounds like it is the problem of shuffling.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
Assume local limit output 100,000,000 rows, then in global limit it will
be take in a single partition, so it is very slow and can not use other free
cores to improve the parallelism.
---
If your
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@scwf I am not sure if you really think about this. Can you describe the
single partition issue based on your understanding?
---
If your project is set up for it, you can reply to this email and
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
@viirya my team member post the mail list, actually we mean the case i
listed above, the main issue is the single partition issue in global limit,
if in that case you fall back to old global limit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
That case only happens when the all row counts in all partitions are less
than or (nearly) equal to the limit number. So it needs to scan (almost) all
partitions.
One possible way to deal
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
I think the local limit cost is important, we assume recompute partions
number: m, all the partitions: n
m = 1, n =100 is a positive case, but there also cases that m very close to
n(even m = n).
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
Your proposal avoid the cost of all partitions compute and shuffle for
local limit but introduce some partitions recompute for local limit stage.
We can not decide which cost is cheaper(in
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@scwf You still don't get the point. Although few partitions need to
recompute in local limit, most of other partitions are saved from computation.
In most cases, it is worth. You can refer to my
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@scwf it is fair. anyway, i don't think a proposal can't improve any point
of the issues is worth so many requested changing...
---
If your project is set up for it, you can reply to this email and
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
I think before compare our proposals , we should first make sure our
proposal will not bring performance regression.
---
If your project is set up for it, you can reply to this email and have your
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@scwf I understand your point. But the main issue is, you can't save the
local limit cost and the shuffling cost. You still need to process all rows in
all partitions and shuffle (some of) them to
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
Not get you, but let me explain more,
If we use map output statistics to decide each global limit should take how
many element.
1. local limit shuffle with the maillist partitioner and return
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@scwf Even it works finally, I don't think it is better in performance.
Simply calculate it. Assume the limit number is `n`, partition number is
`N`, and each partition has `n / r` rows in
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
need define a new map output statistics to do this
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@scwf I don't think it would work. map output statistics is just
approximate number of output bytes. You can't use it to get correct row number.
---
If your project is set up for it, you can reply
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
Yes, you are right, we can not ensure the uniform distribution for global
limit.
An idea is not use a special partitioner, after the shuffle we should get
the mapoutput statistics for row num of
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16633
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71633/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16633
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16633
**[Test build #71633 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71633/testReport)**
for PR 16633 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@scwf No. A simple example: if there are 5 local limit which produce 1, 2,
1, 1, 1 rows when limit is 10. If you shuffle to 5 partitions, the
distributions for each local limit look like:
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
refer to the maillist
>One issue left is how to decide shuffle partition number.
We can have a config of the maximum number of elements for each GlobalLimit
task to process,
then do a
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16633
**[Test build #71633 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71633/testReport)**
for PR 16633 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@scwf
> it use a special partitioner to do this, the partitioner like the
row_numer in sql it give each row a uniform partitionid, so in the reduce task,
each task handle num of rows very
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
To clear, now we have these issues:
1. local limit compute all partitions, that means it launch many tasks
but actually maybe very small tasks is enough.
2. global limit single partition
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16633
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71627/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16633
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16633
**[Test build #71627 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71627/testReport)**
for PR 16633 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@scwf The main issue the user posted in the mailing list is, the limit is
big enough or partition number is big enough to cause performance bottleneck in
shuffling the data of local limit. But
Github user scwf commented on the issue:
https://github.com/apache/spark/pull/16633
@viirya @rxin i support the idea of @wzhfy in the maillist
http://apache-spark-developers-list.1001551.n3.nabble.com/Limit-Query-Performance-Suggestion-td20570.html,
it solved the single partition
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@rxin even it breaks the RDD job chain. I think it is still useful in some
cases, for example, the number of partitions is big and you only need to get
one or few partitions to satisfy the limit.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@rxin ok. I see what you mean breaking the RDD job chain.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16633
**[Test build #71627 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71627/testReport)**
for PR 16633 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@rxin Can you explain it more? I don't get it. Why it breaks the RDD job
chain?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/16633
This breaks the RDD job chain doesn't it?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16633
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16633
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71605/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16633
**[Test build #71605 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71605/testReport)**
for PR 16633 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16633
**[Test build #71605 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71605/testReport)**
for PR 16633 at commit
50 matches
Mail list logo