[GitHub] spark pull request #21444: Branch 2.3

2018-06-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21444


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21444: Branch 2.3

2018-05-28 Thread mozammal
GitHub user mozammal opened a pull request:

https://github.com/apache/spark/pull/21444

Branch 2.3

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21444.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21444


commit f5f21e8c4261c0dfe8e3e788a30b38b188a18f67
Author: Glen Takahashi 
Date:   2018-01-31T17:14:01Z

[SPARK-23249][SQL] Improved block merging logic for partitions

## What changes were proposed in this pull request?

Change DataSourceScanExec so that when grouping blocks together into 
partitions, also checks the end of the sorted list of splits to more 
efficiently fill out partitions.

## How was this patch tested?

Updated old test to reflect the new logic, which causes the # of partitions 
to drop from 4 -> 3
Also, a current test exists to test large non-splittable files at 
https://github.com/glentakahashi/spark/blob/c575977a5952bf50b605be8079c9be1e30f3bd36/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala#L346

## Rationale

The current bin-packing method of next-fit descending for blocks into 
partitions is sub-optimal in a lot of cases and will result in extra 
partitions, un-even distribution of block-counts across partitions, and un-even 
distribution of partition sizes.

As an example, 128 files ranging from 1MB, 2MB,...127MB,128MB. will result 
in 82 partitions with the current algorithm, but only 64 using this algorithm. 
Also in this example, the max # of blocks per partition in NFD is 13, while in 
this algorithm is is 2.

More generally, running a simulation of 1000 runs using 128MB blocksize, 
between 1-1000 normally distributed file sizes between 1-500Mb, you can see an 
improvement of approx 5% reduction of partition counts, and a large reduction 
in standard deviation of blocks per partition.

This algorithm also runs in O(n) time as NFD does, and in every case is 
strictly better results than NFD.

Overall, the more even distribution of blocks across partitions and 
therefore reduced partition counts should result in a small but significant 
performance increase across the board

Author: Glen Takahashi 

Closes #20372 from glentakahashi/feature/improved-block-merging.

(cherry picked from commit 8c21170decfb9ca4d3233e1ea13bd1b6e3199ed9)
Signed-off-by: Wenchen Fan 

commit 8ee3a71c9c1b8ed51c5916635d008fdd49cf891a
Author: Dilip Biswal 
Date:   2018-01-31T21:52:47Z

[SPARK-23281][SQL] Query produces results in incorrect order when a 
composite order by clause refers to both original columns and aliases

## What changes were proposed in this pull request?
Here is the test snippet.
``` SQL
scala> Seq[(Integer, Integer)](
 | (1, 1),
 | (1, 3),
 | (2, 3),
 | (3, 3),
 | (4, null),
 | (5, null)
 |   ).toDF("key", "value").createOrReplaceTempView("src")

scala> sql(
 | """
 |   |SELECT MAX(value) as value, key as col2
 |   |FROM src
 |   |GROUP BY key
 |   |ORDER BY value desc, key
 | """.stripMargin).show
+-++
|value|col2|
+-++
|3|   3|
|3|   2|
|3|   1|
| null|   5|
| null|   4|
+-++
```SQL
Here is the explain output :

```SQL
== Parsed Logical Plan ==
'Sort ['value DESC NULLS LAST, 'key ASC NULLS FIRST], true
+- 'Aggregate ['key], ['MAX('value) AS value#9, 'key AS col2#10]
   +- 'UnresolvedRelation `src`

== Analyzed Logical Plan ==
value: int, col2: int
Project [value#9, col2#10]
+- Sort [value#9 DESC NULLS LAST, col2#10 DESC NULLS LAST], true
   +- Aggregate [key#5], [max(value#6) AS value#9, key#5 AS col2#10]
  +- SubqueryAlias src
 +- Project [_1#2 AS key#5, _2#3 AS value#6]
+- LocalRelation [_1#2, _2#3]
``` SQL
The sort direction is being wrongly changed from ASC to DSC while resolving 
```Sort``` in