[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-20 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @wzhfy @scwf Thanks for comment. Until we have a way to figure out how to avoid the defect, I will close this. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-20 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @wzhfy Thanks for comment. I meant that we can just deal with the cases we are confident that the all rows in all partitions must be much larger than the limit number. I am not sure if cbo

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/16633 Hi @viirya , the main concern of @scwf is that, we can't afford performance regression in any customer scenarios. I think you can understand that :) I went through the discussion above, it

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 @viirya i suggest fix the 2 in this pr, let's wait some comment on 1. /cc @rxin and @wzhfy who may comment on the first case. --- If your project is set up for it, you can reply to this email and

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 we don't need accurate number. we can have a confident margin. the bad with broken rdd chain is re-processing the rows. anything else? I don't think it is worth changing core and

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 For 1, my idea is not use the proposal in this PR, 1. how you determine `total rows in all partitions are (much) more than limit number.` and then go into this code path and how to decide the

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 Ok. I think it is clearer now. We have two cases needed to solve: 1. After local limit, total rows in all partitions are (much) more than limit number. 2. After local limit, total rows

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 all partitions after local limit are about/nearly 100,000,000 rows --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 Do you mean totally rows in all partitions after local limit are about/nearly 100,000,000 rows? Or each partition after local limit has about/nearly 100,000,000 rows? --- If your project is set up

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 Again, to clean, I am against the performance regression in flowing case 0. limit num is 100,000,000 1. the original table rows is very big, much larger than 100,000,000 rows 2. after

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 That is why I propose to avoid shuffling to single partition. We can save shuffling and keep parallelism. So I don't know what you are against? --- If your project is set up for it, you can reply

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 I think shuffle is ok, but shuffle to one partition leads to the performance issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf So sounds like it is the problem of shuffling. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 Assume local limit output 100,000,000 rows, then in global limit it will be take in a single partition, so it is very slow and can not use other free cores to improve the parallelism. --- If your

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf I am not sure if you really think about this. Can you describe the single partition issue based on your understanding? --- If your project is set up for it, you can reply to this email and

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 @viirya my team member post the mail list, actually we mean the case i listed above, the main issue is the single partition issue in global limit, if in that case you fall back to old global limit

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 That case only happens when the all row counts in all partitions are less than or (nearly) equal to the limit number. So it needs to scan (almost) all partitions. One possible way to deal

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 I think the local limit cost is important, we assume recompute partions number: m, all the partitions: n m = 1, n =100 is a positive case, but there also cases that m very close to n(even m = n).

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 Your proposal avoid the cost of all partitions compute and shuffle for local limit but introduce some partitions recompute for local limit stage. We can not decide which cost is cheaper(in

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf You still don't get the point. Although few partitions need to recompute in local limit, most of other partitions are saved from computation. In most cases, it is worth. You can refer to my

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf it is fair. anyway, i don't think a proposal can't improve any point of the issues is worth so many requested changing... --- If your project is set up for it, you can reply to this email and

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 I think before compare our proposals , we should first make sure our proposal will not bring performance regression. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf I understand your point. But the main issue is, you can't save the local limit cost and the shuffling cost. You still need to process all rows in all partitions and shuffle (some of) them to

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 Not get you, but let me explain more, If we use map output statistics to decide each global limit should take how many element. 1. local limit shuffle with the maillist partitioner and return

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf Even it works finally, I don't think it is better in performance. Simply calculate it. Assume the limit number is `n`, partition number is `N`, and each partition has `n / r` rows in

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 need define a new map output statistics to do this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf I don't think it would work. map output statistics is just approximate number of output bytes. You can't use it to get correct row number. --- If your project is set up for it, you can reply

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 Yes, you are right, we can not ensure the uniform distribution for global limit. An idea is not use a special partitioner, after the shuffle we should get the mapoutput statistics for row num of

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16633 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71633/ Test PASSed. ---

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16633 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16633 **[Test build #71633 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71633/testReport)** for PR 16633 at commit

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf No. A simple example: if there are 5 local limit which produce 1, 2, 1, 1, 1 rows when limit is 10. If you shuffle to 5 partitions, the distributions for each local limit look like:

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 refer to the maillist >One issue left is how to decide shuffle partition number. We can have a config of the maximum number of elements for each GlobalLimit task to process, then do a

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16633 **[Test build #71633 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71633/testReport)** for PR 16633 at commit

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf > it use a special partitioner to do this, the partitioner like the row_numer in sql it give each row a uniform partitionid, so in the reduce task, each task handle num of rows very

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 To clear, now we have these issues: 1. local limit compute all partitions, that means it launch many tasks but actually maybe very small tasks is enough. 2. global limit single partition

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16633 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71627/ Test FAILed. ---

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16633 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16633 **[Test build #71627 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71627/testReport)** for PR 16633 at commit

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf The main issue the user posted in the mailing list is, the limit is big enough or partition number is big enough to cause performance bottleneck in shuffling the data of local limit. But

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 @viirya @rxin i support the idea of @wzhfy in the maillist http://apache-spark-developers-list.1001551.n3.nabble.com/Limit-Query-Performance-Suggestion-td20570.html, it solved the single partition

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @rxin even it breaks the RDD job chain. I think it is still useful in some cases, for example, the number of partitions is big and you only need to get one or few partitions to satisfy the limit.

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @rxin ok. I see what you mean breaking the RDD job chain. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16633 **[Test build #71627 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71627/testReport)** for PR 16633 at commit

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @rxin Can you explain it more? I don't get it. Why it breaks the RDD job chain? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16633 This breaks the RDD job chain doesn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16633 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16633 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71605/ Test FAILed. ---

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16633 **[Test build #71605 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71605/testReport)** for PR 16633 at commit

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16633 **[Test build #71605 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71605/testReport)** for PR 16633 at commit