from:"wangyum"

[GitHub] spark issue #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicates and R...

2018-12-06 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23211
  
This file generated by 
[TPCDSQueryOptimizerTracker.scala](https://github.com/wangyum/spark/blob/SPARK-25872/sql/core/src/test/scala/org/apache/spark/sql/TPCDSQueryOptimizerTracker.scala).
 runtimes can generated by 
[TPCDSQueryBenchmark.scala](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23199: [SPARK-26245][SQL] Add Float literal

2018-12-06 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/23199


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...

2018-12-06 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22683
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23211: [SPARK-19712][SQL] Move PullupCorrelatedPredicates and R...

2018-12-04 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23211
  
I generated the TPC-DS plans to compare the differences after this patch to 
help review: 

https://github.com/wangyum/spark/commit/7e7a1fe24e8970830c67f80604ce238caa035b85#diff-1a4e6beba801fa647e1dcbd61ed7e5bf


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22721: [SPARK-19784][SPARK-25403][SQL] Refresh the table...

2018-12-04 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22721#discussion_r238891454
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
 ---
@@ -45,6 +45,8 @@ object CommandUtils extends Logging {
   } else {
 catalog.alterTableStats(table.identifier, None)
   }
+} else {
+  catalog.refreshTable(table.identifier)
--- End diff --

Sure. move to DDLs is better. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23216: [SPARK-26264][CORE]It is better to add @transient to fie...

2018-12-04 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23216
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23219: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-04 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23219
  
@srowen Sorry.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23219: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-04 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/23219


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23219: [SPARK-26266][BUILD] Update to Scala 2.12.8

2018-12-04 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/23219

[SPARK-26266][BUILD] Update to Scala 2.12.8

## What changes were proposed in this pull request?
This pr update to Scala 2.12.8. It fixes two regressions that appeared in 
2.12.7:
```
Don't reject views with result types which are TypeVars (#7295)
Don't emit static forwarders (which simplify the use of methods in 
top-level objects from Java) for bridge methods (#7469)
```
More details: https://github.com/scala/scala/releases/tag/v2.12.8
## How was this patch tested?

Existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-26266

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23219.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23219


commit 94f76e543c1b146d4d25d3e15b6efd4777af7652
Author: Yuming Wang 
Date:   2018-12-04T15:03:26Z

Upgrade Scala to 2.12.8




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22600: [SPARK-25578][BUILD] Update to Scala 2.12.7

2018-12-04 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22600
  
2.12.8 is out. Do we need to upgrade to 2.12.8?
2.12.8 fixes two regressions that appeared in 2.12.7:
```
Don't reject views with result types which are TypeVars (#7295)
Don't emit static forwarders (which simplify the use of methods in 
top-level objects from Java) for bridge methods (#7469)
```
More details: https://github.com/scala/scala/releases/tag/v2.12.8


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23164: [SPARK-26198][SQL] Fix Metadata serialize null values th...

2018-12-02 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23164
  
I think we should  back-port because we can 
[`putNull`](https://github.com/apache/spark/blob/630e25e35506c02a0b1e202ef82b1b0f69e50966/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala#L246),
 but can not serialize it without this patch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23199: [SPARK-26245][SQL] Add Float literal

2018-12-02 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23199
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23199: [SPARK-26245][SQL] Add Float literal

2018-12-02 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23199
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23199: [SPARK-26245][SQL] Add Float literal

2018-12-02 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/23199

[SPARK-26245][SQL] Add Float literal

## What changes were proposed in this pull request?

This PR adds parser support for `Float` literals.

Hive support this feature:

![image](https://user-images.githubusercontent.com/5399861/49337788-e20cac80-f653-11e8-89c4-0fafa4037a82.png)

## How was this patch tested?

unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-26245

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23199.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23199


commit 57faa4160365e4ec7ef9847861882b469d4953d6
Author: Yuming Wang 
Date:   2018-12-02T08:56:56Z

Add Float literal




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23164: [SPARK-26198][SQL] Fix Metadata serialize null values th...

2018-12-01 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23164
  
I used it here: 
https://github.com/apache/spark/compare/master...wangyum:default-value?expand=1#diff-9847f5cef7cf7fbc5830fbc6b779ee10R1827


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...

2018-12-01 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22683
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...

2018-12-01 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22683
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23164: [SPARK-26198][SQL] Fix Metadata serialize null values th...

2018-11-29 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23164
  
cc @srowen


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23170: [SPARK-24423][FOLLOW-UP][SQL] Fix error example

2018-11-29 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23170
  
It's not a regression. The first check exists in 
[2.1.0](https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala#L99-L102)
 and the second check is added in 
[2.4.0](https://github.com/apache/spark/blob/v2.4.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala#L133-L143).

cc @dilipbiswal 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23170: [SPARK-24423][FOLLOW-UP][SQL] Fix error example

2018-11-28 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/23170

[SPARK-24423][FOLLOW-UP][SQL] Fix error example

## What changes were proposed in this pull request?

![image](https://user-images.githubusercontent.com/5399861/49172173-42ad9800-f37b-11e8-8135-7adc323357ae.png)
It will throw:
```
requirement failed: When reading JDBC data sources, users need to specify 
all or none for the following options: 'partitionColumn', 'lowerBound', 
'upperBound', and 'numPartitions'
```
and 
```
User-defined partition column subq.c1 not found in the JDBC relation ...
```

This PR fix this error example.

## How was this patch tested?

manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-24499

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23170.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23170


commit e2d8229614ed232d1089c304cbc4bdb88292d213
Author: Yuming Wang 
Date:   2018-11-28T17:59:23Z

Fix error example




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23164: [SPARK-26198][SQL] Fix Metadata serialize null va...

2018-11-28 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/23164

[SPARK-26198][SQL] Fix Metadata serialize null values throw NPE

## What changes were proposed in this pull request?
How to reproduce this issue:
```scala
scala> val meta = new 
org.apache.spark.sql.types.MetadataBuilder().putNull("key").build()
java.lang.NullPointerException
  at 
org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196)
  at 
org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180)
```

This pr fix `NullPointerException` when `Metadata` serialize `null` values.

## How was this patch tested?

unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-26198

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23164.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23164


commit 03101868a72b5ae68bf6324e627f1874af32f040
Author: Yuming Wang 
Date:   2018-11-28T12:22:09Z

Fix Metadata serialize null values throw NPE




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark ...

2018-11-25 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/23027


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark ...

2018-11-21 Thread wangyum

GitHub user wangyum reopened a pull request:

https://github.com/apache/spark/pull/23027

[SPARK-26049][SQL][TEST] FilterPushdownBenchmark add InMemoryTable case

## What changes were proposed in this pull request?

`FilterPushdownBenchmark` add InMemoryTable case.

## How was this patch tested?

manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-26049

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23027.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23027


commit d0a2a3f4af492fbf69f7774e03d67d4af39cd5c7
Author: Yuming Wang 
Date:   2018-11-14T00:35:35Z

Add InMemoryTable filter benchmark

commit 01d01e7995ecb72705d0d610892dc99a6c3f4621
Author: Yuming Wang 
Date:   2018-11-19T07:59:19Z

cache inMemoryTable from file

commit b8c54ea5048524f7df0b750a11a8fb109b43f479
Author: Yuming Wang 
Date:   2018-11-19T12:44:46Z

Fix path




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark ...

2018-11-20 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/23027#discussion_r235259340
  
--- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt ---
@@ -2,669 +2,809 @@
 Pushdown for many distinct value case
 

 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
+Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
 Select 0 string row (value IS NULL): Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 

-Parquet Vectorized  11405 / 11485  1.4 
725.1   1.0X
-Parquet Vectorized (Pushdown)  675 /  690 23.3 
 42.9  16.9X
-Native ORC Vectorized 7127 / 7170  2.2 
453.1   1.6X
-Native ORC Vectorized (Pushdown)   519 /  541 30.3 
 33.0  22.0X
+Parquet Vectorized7823 / 7996  2.0 
497.4   1.0X
+Parquet Vectorized (Pushdown)  460 /  468 34.2 
 29.2  17.0X
+Native ORC Vectorized 5412 / 5550  2.9 
344.1   1.4X
+Native ORC Vectorized (Pushdown)   551 /  563 28.6 
 35.0  14.2X
+InMemoryTable Vectorized 6 /6   2859.1 
  0.31422.0X
+InMemoryTable Vectorized (Pushdown)  5 /6   3023.0 
  0.31503.6X
 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
+Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
 Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized  11457 / 11473  1.4 
728.4   1.0X
-Parquet Vectorized (Pushdown)  656 /  686 24.0 
 41.7  17.5X
-Native ORC Vectorized 7328 / 7342  2.1 
465.9   1.6X
-Native ORC Vectorized (Pushdown)   539 /  565 29.2 
 34.2  21.3X
+Parquet Vectorized   8322 / 11160  1.9 
529.1   1.0X
+Parquet Vectorized (Pushdown)  463 /  472 34.0 
 29.4  18.0X
+Native ORC Vectorized 5622 / 5635  2.8 
357.4   1.5X
+Native ORC Vectorized (Pushdown)   563 /  595 27.9 
 35.8  14.8X
+InMemoryTable Vectorized  4831 / 4881  3.3 
307.2   1.7X
+InMemoryTable Vectorized (Pushdown)   1980 / 2027  7.9 
125.9   4.2X
--- End diff --

I think the reason is 
[SPARK-22599](https://issues.apache.org/jira/browse/SPARK-22599). But if we 
cached all data to memory, the result is:
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized6491 / 6716  2.4  
   412.7   1.0X
Parquet Vectorized (Pushdown)  491 /  496 32.0  
31.2  13.2X
Native ORC Vectorized 5849 / 6103  2.7  
   371.9   1.1X
Native ORC Vectorized (Pushdown)   533 /  572 29.5  
33.9  12.2X
InMemoryTable Vectorized  2788 / 2854  5.6  
   177.2   2.3X
InMemoryTable Vectorized (Pushdown)370 /  408 42.5  
23.5  17.5X
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark ...

2018-11-20 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/23027


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22879: [SPARK-25872][SQL][TEST] Add an optimizer tracker...

2018-11-20 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/22879


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...

2018-11-19 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/23004#discussion_r234857747
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -237,6 +237,13 @@ case class InMemoryTableScanExec(
   if list.forall(ExtractableLiteral.unapply(_).isDefined) && 
list.nonEmpty =>
   list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
 l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
+
+case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
--- End diff --

Added to pr description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...

2018-11-19 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22683
  
@srowen Thanks for ping me. I agree with you. If @httfighter willing, 
updating all usages in the UI to indicate units like `KiB`, `MiB`, `GiB`. 
@httfighter you can check details here: https://en.wikipedia.org/wiki/Kilobyte
https://en.wikipedia.org/wiki/Kibibyte


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark ...

2018-11-19 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/23027#discussion_r234521489
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -104,6 +107,10 @@ object FilterPushdownBenchmark extends BenchmarkBase 
with SQLHelper {
 df.write.mode("overwrite")
   .option("parquet.block.size", blockSize).parquet(parquetPath)
 spark.read.parquet(parquetPath).createOrReplaceTempView("parquetTable")
+
+df.write.mode("overwrite").save(inMemoryTablePath)
--- End diff --

Cache `inMemoryTable` from file to avoid the performance issue: 
https://github.com/apache/spark/pull/23027#pullrequestreview-175054485


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark ...

2018-11-18 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/23027#discussion_r234482766
  
--- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt ---
@@ -2,669 +2,809 @@
 Pushdown for many distinct value case
 

 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
+Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
 Select 0 string row (value IS NULL): Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 

-Parquet Vectorized  11405 / 11485  1.4 
725.1   1.0X
-Parquet Vectorized (Pushdown)  675 /  690 23.3 
 42.9  16.9X
-Native ORC Vectorized 7127 / 7170  2.2 
453.1   1.6X
-Native ORC Vectorized (Pushdown)   519 /  541 30.3 
 33.0  22.0X
+Parquet Vectorized7823 / 7996  2.0 
497.4   1.0X
+Parquet Vectorized (Pushdown)  460 /  468 34.2 
 29.2  17.0X
+Native ORC Vectorized 5412 / 5550  2.9 
344.1   1.4X
+Native ORC Vectorized (Pushdown)   551 /  563 28.6 
 35.0  14.2X
+InMemoryTable Vectorized 6 /6   2859.1 
  0.31422.0X
+InMemoryTable Vectorized (Pushdown)  5 /6   3023.0 
  0.31503.6X
 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
+Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
 Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized  11457 / 11473  1.4 
728.4   1.0X
-Parquet Vectorized (Pushdown)  656 /  686 24.0 
 41.7  17.5X
-Native ORC Vectorized 7328 / 7342  2.1 
465.9   1.6X
-Native ORC Vectorized (Pushdown)   539 /  565 29.2 
 34.2  21.3X
+Parquet Vectorized   8322 / 11160  1.9 
529.1   1.0X
+Parquet Vectorized (Pushdown)  463 /  472 34.0 
 29.4  18.0X
+Native ORC Vectorized 5622 / 5635  2.8 
357.4   1.5X
+Native ORC Vectorized (Pushdown)   563 /  595 27.9 
 35.8  14.8X
+InMemoryTable Vectorized  4831 / 4881  3.3 
307.2   1.7X
+InMemoryTable Vectorized (Pushdown)   1980 / 2027  7.9 
125.9   4.2X
 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
+Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
 Select 1 string row (value = '7864320'): Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 

-Parquet Vectorized  11878 / 11888  1.3 
755.2   1.0X
-Parquet Vectorized (Pushdown)  630 /  654 25.0 
 40.1  18.9X
-Native ORC Vectorized 7342 / 7362  2.1 
466.8   1.6X
-Native ORC Vectorized (Pushdown)   519 /  537 30.3 
 33.0  22.9X
+Parquet Vectorized8322 / 8386  1.9 
529.1   1.0X
+Parquet Vectorized (Pushdown)  434 /  441 36.2 
 27.6  19.2X
+Native ORC Vectorized 5659 / 5944  2.8 
359.8   1.5X
+Native ORC Vectorized (Pushdown)   535 /  567 29.4 
 34.0  15.6X
+InMemoryTable Vectorized  4784 / 4879  3.3 
304.1   1.7X
+InMemoryTable Vectorized (Pushdown)   1950 / 1985  8.1 
124.0   4.3X
 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Ser

[GitHub] spark pull request #23030: [MINOR][YARN] Make memLimitExceededLogMessage mor...

2018-11-16 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/23030#discussion_r234386679
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -598,13 +597,25 @@ private[yarn] class YarnAllocator(
 (false, s"Container ${containerId}${onHostStr} was preempted.")
   // Should probably still count memory exceeded exit codes 
towards task failures
   case VMEM_EXCEEDED_EXIT_CODE =>
-(true, memLimitExceededLogMessage(
-  completedContainer.getDiagnostics,
-  VMEM_EXCEEDED_PATTERN))
+val vmemExceededPattern = raw"$MEM_REGEX of $MEM_REGEX virtual 
memory used".r
+val diag = 
vmemExceededPattern.findFirstIn(completedContainer.getDiagnostics)
+  .map(_.concat(".")).getOrElse("")
+val additional = if 
(conf.getBoolean(YarnConfiguration.NM_VMEM_CHECK_ENABLED,
--- End diff --

I see. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23030: [MINOR][YARN] Make memLimitExceededLogMessage mor...

2018-11-16 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/23030#discussion_r234233444
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -735,18 +744,8 @@ private[yarn] class YarnAllocator(
 
 private object YarnAllocator {
   val MEM_REGEX = "[0-9.]+ [KMG]B"
-  val PMEM_EXCEEDED_PATTERN =
-Pattern.compile(s"$MEM_REGEX of $MEM_REGEX physical memory used")
-  val VMEM_EXCEEDED_PATTERN =
-Pattern.compile(s"$MEM_REGEX of $MEM_REGEX virtual memory used")
+  val PMEM_EXCEEDED_PATTERN = raw"$MEM_REGEX of $MEM_REGEX physical memory 
used".r
--- End diff --

Yes. I will do it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22999: [SPARK-20319][SQL] Already quoted identifiers are gettin...

2018-11-15 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22999
  
cc @gatorsmile @maropu


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark ...

2018-11-14 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/23027#discussion_r233689556
  
--- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt ---
@@ -2,669 +2,809 @@
 Pushdown for many distinct value case
 

 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
+Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
 Select 0 string row (value IS NULL): Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 

-Parquet Vectorized  11405 / 11485  1.4 
725.1   1.0X
-Parquet Vectorized (Pushdown)  675 /  690 23.3 
 42.9  16.9X
-Native ORC Vectorized 7127 / 7170  2.2 
453.1   1.6X
-Native ORC Vectorized (Pushdown)   519 /  541 30.3 
 33.0  22.0X
+Parquet Vectorized7823 / 7996  2.0 
497.4   1.0X
+Parquet Vectorized (Pushdown)  460 /  468 34.2 
 29.2  17.0X
+Native ORC Vectorized 5412 / 5550  2.9 
344.1   1.4X
+Native ORC Vectorized (Pushdown)   551 /  563 28.6 
 35.0  14.2X
+InMemoryTable Vectorized 6 /6   2859.1 
  0.31422.0X
+InMemoryTable Vectorized (Pushdown)  5 /6   3023.0 
  0.31503.6X
 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
+Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
 Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized  11457 / 11473  1.4 
728.4   1.0X
-Parquet Vectorized (Pushdown)  656 /  686 24.0 
 41.7  17.5X
-Native ORC Vectorized 7328 / 7342  2.1 
465.9   1.6X
-Native ORC Vectorized (Pushdown)   539 /  565 29.2 
 34.2  21.3X
+Parquet Vectorized   8322 / 11160  1.9 
529.1   1.0X
+Parquet Vectorized (Pushdown)  463 /  472 34.0 
 29.4  18.0X
+Native ORC Vectorized 5622 / 5635  2.8 
357.4   1.5X
+Native ORC Vectorized (Pushdown)   563 /  595 27.9 
 35.8  14.8X
+InMemoryTable Vectorized  4831 / 4881  3.3 
307.2   1.7X
+InMemoryTable Vectorized (Pushdown)   1980 / 2027  7.9 
125.9   4.2X
 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
+Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
 Select 1 string row (value = '7864320'): Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 

-Parquet Vectorized  11878 / 11888  1.3 
755.2   1.0X
-Parquet Vectorized (Pushdown)  630 /  654 25.0 
 40.1  18.9X
-Native ORC Vectorized 7342 / 7362  2.1 
466.8   1.6X
-Native ORC Vectorized (Pushdown)   519 /  537 30.3 
 33.0  22.9X
+Parquet Vectorized8322 / 8386  1.9 
529.1   1.0X
+Parquet Vectorized (Pushdown)  434 /  441 36.2 
 27.6  19.2X
+Native ORC Vectorized 5659 / 5944  2.8 
359.8   1.5X
+Native ORC Vectorized (Pushdown)   535 /  567 29.4 
 34.0  15.6X
+InMemoryTable Vectorized  4784 / 4879  3.3 
304.1   1.7X
+InMemoryTable Vectorized (Pushdown)   1950 / 1985  8.1 
124.0   4.3X
 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Ser

[GitHub] spark pull request #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark ...

2018-11-14 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/23027#discussion_r233687968
  
--- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt ---
@@ -2,669 +2,809 @@
 Pushdown for many distinct value case
 

 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
+Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
 Select 0 string row (value IS NULL): Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 

-Parquet Vectorized  11405 / 11485  1.4 
725.1   1.0X
-Parquet Vectorized (Pushdown)  675 /  690 23.3 
 42.9  16.9X
-Native ORC Vectorized 7127 / 7170  2.2 
453.1   1.6X
-Native ORC Vectorized (Pushdown)   519 /  541 30.3 
 33.0  22.0X
+Parquet Vectorized7823 / 7996  2.0 
497.4   1.0X
+Parquet Vectorized (Pushdown)  460 /  468 34.2 
 29.2  17.0X
+Native ORC Vectorized 5412 / 5550  2.9 
344.1   1.4X
+Native ORC Vectorized (Pushdown)   551 /  563 28.6 
 35.0  14.2X
+InMemoryTable Vectorized 6 /6   2859.1 
  0.31422.0X
+InMemoryTable Vectorized (Pushdown)  5 /6   3023.0 
  0.31503.6X
 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
+Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
 Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
 

-Parquet Vectorized  11457 / 11473  1.4 
728.4   1.0X
-Parquet Vectorized (Pushdown)  656 /  686 24.0 
 41.7  17.5X
-Native ORC Vectorized 7328 / 7342  2.1 
465.9   1.6X
-Native ORC Vectorized (Pushdown)   539 /  565 29.2 
 34.2  21.3X
+Parquet Vectorized   8322 / 11160  1.9 
529.1   1.0X
+Parquet Vectorized (Pushdown)  463 /  472 34.0 
 29.4  18.0X
+Native ORC Vectorized 5622 / 5635  2.8 
357.4   1.5X
+Native ORC Vectorized (Pushdown)   563 /  595 27.9 
 35.8  14.8X
+InMemoryTable Vectorized  4831 / 4881  3.3 
307.2   1.7X
+InMemoryTable Vectorized (Pushdown)   1980 / 2027  7.9 
125.9   4.2X
--- End diff --

Yes. This is the current benchmark result. I plan to improve it step by 
step. Example: [SPARK-26004](https://issues.apache.org/jira/browse/SPARK-26004)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark ...

2018-11-14 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/23027#discussion_r233686986
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -131,6 +134,15 @@ object FilterPushdownBenchmark extends BenchmarkBase 
with SQLHelper {
   }
 }
 
+Seq(false, true).foreach { pushDownEnabled =>
+  val name = s"InMemoryTable Vectorized ${if (pushDownEnabled) 
s"(Pushdown)" else ""}"
+  benchmark.addCase(name) { _ =>
+withSQLConf(SQLConf.IN_MEMORY_PARTITION_PRUNING.key -> 
s"$pushDownEnabled") {
--- End diff --

I think the InMemoryTable's partition same to Parquet RowGroup(@kiszk 
please correct if I'm wrong). We put them together and it's easy to compare 
performance.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23030: [MINOR][YARN] Make memLimitExceededLogMessage more clean

2018-11-14 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23030
  
cc @vanzin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23030: [MINOR][YARN] Make memLimitExceededLogMessage mor...

2018-11-14 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/23030

[MINOR][YARN] Make memLimitExceededLogMessage more clean

## What changes were proposed in this pull request?
Current `memLimitExceededLogMessage`:

https://user-images.githubusercontent.com/5399861/48467789-ec8e1000-e824-11e8-91fc-280d342e1bf3.png;
 width="360">

Itâs not very clear, because physical memory exceeds but suggestion 
contains virtual memory config. This pr makes it more clear and replace  
deprecated config: ```spark.yarn.executor.memoryOverhead```.
## How was this patch tested?

manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark EXECUTOR_MEMORY_OVERHEAD

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23030.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23030


commit e7ff43d0056e38033ae2a2edf81b296797bb46e0
Author: Yuming Wang 
Date:   2018-11-14T07:46:58Z

Make memLimitExceededLogMessage more clean




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23028: [SPARK-26053][SQL] Enhance LikeSimplification

2018-11-13 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/23028#discussion_r233317201
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -504,6 +504,19 @@ object LikeSimplification extends Rule[LogicalPlan] {
 Like(input, Literal.create(pattern, StringType))
 }
   }
+
+case Like(Literal(pattern, StringType), input) =>
+  if (pattern == null) {
+// If pattern is null, return null value directly, since "null 
like col" == null.
+Literal(null, BooleanType)
+  } else {
+pattern.toString match {
+  case equalTo(str) =>
+EqualTo(Literal(str), input)
--- End diff --

Yes. 
```
select  "abc" like "%abc%" -> true
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23028: [SPARK-26053][SQL] Enhance LikeSimplification

2018-11-13 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/23028


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark add InM...

2018-11-13 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23027
  
cc @HyukjinKwon @dongjoon-hyun 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23028: [SPARK-26053][SQL] Enhance LikeSimplification

2018-11-13 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/23028

[SPARK-26053][SQL] Enhance LikeSimplification

## What changes were proposed in this pull request?

This PR enhance `LikeSimplification` in 2 cases:

1.  null like col -> null
2. 'str' like col -> col = 'str'

It difficult to handle these cases:
1. 'str%' like col
2. '%str' like col
3. 'str%str' like col
4. '%' like col

for example:
```sql
select '8%' like '8%';  -- true
select '8%' like '%8%';  -- true
select '8%' like '%%8%%';  -- true
select '8%' like '%%5%%8%%'; --false

select '%8' like '%8%';  -- true
select '%8' like '%8%';  -- true
select '%8' like '%%8%';  -- true
select '%8' like '%%5%%8%'; -- false

select '%' like '%';  -- true
select '%' like '%%';  -- true
select '%' like '%%8%';  -- false
```

## How was this patch tested?

unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-26053

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23028.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23028


commit 56a02eaaa63f297d3dbaf0ca183e4248d4882834
Author: Yuming Wang 
Date:   2018-11-14T02:55:25Z

Enhance LikeSimplification




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23027: [SPARK-26049][SQL][TEST] Add InMemoryTable filter...

2018-11-13 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/23027

[SPARK-26049][SQL][TEST] Add InMemoryTable filter benchmark to 
FilterPushdownBenchmark

## What changes were proposed in this pull request?

Add InMemoryTable filter benchmark to `FilterPushdownBenchmark`.

## How was this patch tested?

manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-26049

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23027.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23027


commit d0a2a3f4af492fbf69f7774e03d67d4af39cd5c7
Author: Yuming Wang 
Date:   2018-11-14T00:35:35Z

Add InMemoryTable filter benchmark




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...

2018-11-13 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/23004#discussion_r233272718
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -237,6 +237,13 @@ case class InMemoryTableScanExec(
   if list.forall(ExtractableLiteral.unapply(_).isDefined) && 
list.nonEmpty =>
   list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
 l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
+
+case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
+  statsFor(a).lowerBound.substr(0, Length(l)) <= l &&
+l <= statsFor(a).upperBound.substr(0, Length(l))
+case StartsWith(ExtractableLiteral(l), a: AttributeReference) =>
--- End diff --

Good question, The last one should be removed, `DataSourceStrategy` has the 
same logic:  
https://github.com/apache/spark/blob/3d6b68b030ee85a0f639dd8e9b68aedf5f27b46f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L512-L513


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23014: [MINOR][SQL] Add disable bucketedRead workaround when th...

2018-11-13 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23014
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23004: [SPARK-26004][SQL] InMemoryTable support StartsWith pred...

2018-11-13 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23004
  
cc @cloud-fan @HyukjinKwon @kiszk


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23014: [MINOR][SQL] Add disable bucketedRead workaround when th...

2018-11-13 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23014
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23014: [MINOR][SQL] Add disable bucketedRead workaround when th...

2018-11-12 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23014
  
Yes. The `filePartitions` are the same as the bucket number when 
`BucketedRead`:

https://github.com/apache/spark/blob/ab5752cb952e6536a68a988289e57100fdbba142/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L382-L414


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23018: [SPARK-26023][SQL] Dumping truncated plans and ge...

2018-11-12 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/23018#discussion_r232871356
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala 
---
@@ -469,7 +471,21 @@ abstract class TreeNode[BaseType <: 
TreeNode[BaseType]] extends Product {
   def treeString: String = treeString(verbose = true)
 
   def treeString(verbose: Boolean, addSuffix: Boolean = false): String = {
-generateTreeString(0, Nil, new StringBuilder, verbose = verbose, 
addSuffix = addSuffix).toString
+val writer = new StringBuilderWriter()
+try {
+  treeString(writer, verbose, addSuffix, None)
+  writer.toString
+} finally {
+  writer.close()
+}
+  }
+
+  def treeString(
+  writer: Writer,
+  verbose: Boolean,
+  addSuffix: Boolean,
+  maxFields: Option[Int]): Unit = {
+generateTreeString(0, Nil, writer, verbose, "", addSuffix)
--- End diff --

How about add another function only save `nodeName`? I'll use it in another 
PR: https://github.com/apache/spark/pull/22879


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22419: [SPARK-23906][SQL] Add built-in UDF TRUNCATE(numb...

2018-11-12 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/22419


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22941: [SPARK-25936][SQL] Fix InsertIntoDataSourceComman...

2018-11-12 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/22941


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22930: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand'...

2018-11-12 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/22930


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23014: [MINOR][SQL] Add disable bucketedRead workaround ...

2018-11-12 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/23014

[MINOR][SQL] Add disable bucketedRead workaround when throw RuntimeException

## What changes were proposed in this pull request?
It will throw `RuntimeException` when read from bucketed table(about 1.7G 
per bucket file):

![image](https://user-images.githubusercontent.com/5399861/48346889-8041ce00-e6b7-11e8-83b0-ead83fb15821.png)

Default(enable bucket read):

![image](https://user-images.githubusercontent.com/5399861/48347084-2c83b480-e6b8-11e8-913a-9cafc043e9e4.png)

Disable bucket read:

![image](https://user-images.githubusercontent.com/5399861/48347099-3a393a00-e6b8-11e8-94af-cb814e1ba277.png)


## How was this patch tested?

manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark anotherWorkaround

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23014.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23014


commit a41551efd667f3ed6c30b0a2b262818e37d00884
Author: Yuming Wang 
Date:   2018-11-12T12:06:35Z

Add new workaround




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22721: [SPARK-19784][SPARK-25403][SQL] Refresh the table even t...

2018-11-12 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22721
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23004: [SPARK-26004][SQL] InMemoryTable support StartsWi...

2018-11-10 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/23004

[SPARK-26004][SQL] InMemoryTable support StartsWith predicate push down

## What changes were proposed in this pull request?

[SPARK-24638](https://issues.apache.org/jira/browse/SPARK-24638) adds 
support for Parquet file `StartsWith` predicate push down.
`InMemoryTable` can also support this feature.


## How was this patch tested?

 unit tests and benchmark tests

benchmark test result:
```


Pushdown benchmark for StringStartsWith



Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
StringStartsWith filter: (value like '10%'): Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative


InMemoryTable Vectorized12068 / 14198  1.3  
   767.3   1.0X
InMemoryTable Vectorized (Pushdown)   5457 / 8662  2.9  
   347.0   2.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
StringStartsWith filter: (value like '1000%'): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


InMemoryTable Vectorized  5246 / 5355  3.0  
   333.5   1.0X
InMemoryTable Vectorized (Pushdown)   2185 / 2346  7.2  
   138.9   2.4X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
StringStartsWith filter: (value like '786432%'): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


InMemoryTable Vectorized  5112 / 5312  3.1  
   325.0   1.0X
InMemoryTable Vectorized (Pushdown)   2292 / 2522  6.9  
   145.7   2.2X
```



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-26004

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23004.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23004


commit 7bbdb0713056f387e49cf3921a226554e9af5557
Author: Yuming Wang 
Date:   2018-11-11T03:56:36Z

InMemoryTable support StartsWith predicate push down




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23001: [INFRA] Close stale PRs

2018-11-10 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/23001
  
cc @srowen @dongjoon-hyun  @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23001: [INFRA] Close stale PRs

2018-11-10 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/23001

[INFRA] Close stale PRs

Closes https://github.com/apache/spark/pull/21766
Closes https://github.com/apache/spark/pull/21679
Closes https://github.com/apache/spark/pull/21161
Closes https://github.com/apache/spark/pull/20846
Closes https://github.com/apache/spark/pull/19434
Closes https://github.com/apache/spark/pull/18080
Closes https://github.com/apache/spark/pull/17648
Closes https://github.com/apache/spark/pull/17169



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark CloseStalePRs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23001.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23001


commit 45855463c6c5c0ee1183cc8d011b691add948f9f
Author: Yuming Wang 
Date:   2018-11-10T13:01:45Z

Close stale PRs




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17169: [SPARK-19714][ML] Bucketizer.handleInvalid docs improved

2018-11-10 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/17169
  
It seems has improved:

https://github.com/apache/spark/blob/v2.4.0/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala#L91-L104


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18323: [SPARK-21117][SQL] Built-in SQL Function Support ...

2018-11-10 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/18323


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21339: [SPARK-24287][Core] Spark -packages option should suppor...

2018-11-10 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21339
  
ping @fangshil 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21783: [SPARK-24799]A solution of dealing with data skew in lef...

2018-11-10 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21783
  
@marymwu Could you make a PR against `master` branch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22092: [SPARK-25101][CORE]Creating leaderLatch with id f...

2018-11-10 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22092#discussion_r232449802
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/master/ZooKeeperLeaderElectionAgent.scala
 ---
@@ -19,7 +19,6 @@ package org.apache.spark.deploy.master
 
 import org.apache.curator.framework.CuratorFramework
 import org.apache.curator.framework.recipes.leader.{LeaderLatch, 
LeaderLatchListener}
-
--- End diff --

Do not remove this line.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22999: [SPARK-20319][SQL] Already quoted identifiers are...

2018-11-10 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/22999

[SPARK-20319][SQL] Already quoted identifiers are getting wrapped with 
additional quotes

## What changes were proposed in this pull request?

Currently JDBC Writer uses dialects to quote the field names but when the 
quotes are explicitly wrapped with the column names, JDBC driver fails to parse 
columns with two double quotes.
e.g. ```""columnName""```. 
This pr fix this issue.

## How was this patch tested?

unit tests

Closes https://github.com/apache/spark/pull/17631


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-20319

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22999.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22999


commit b2af6b33e5289c07b1c70d4eb96c721cc7db8cea
Author: Yuming Wang 
Date:   2018-11-10T11:05:38Z

SPARK-20319 Already quoted identifiers are getting wrapped with additional 
quotes




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17631: [SPARK-20319][SQL] Already quoted identifiers are gettin...

2018-11-10 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/17631
  
@gatorsmile @maropu I took over.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21161: [SPARK-21645]left outer join synchronize the condition f...

2018-11-10 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21161
  
I think it has fixed by 
[SPARK-21479](https://issues.apache.org/jira/browse/SPARK-21479).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22855: [SPARK-25839] [Core] Implement use of KryoPool in...

2018-11-09 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22855#discussion_r232203634
  
--- Diff: 
core/src/test/scala/org/apache/spark/serializer/KryoSerializerBenchmark.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.serializer
+
+import scala.concurrent._
+import scala.concurrent.ExecutionContext.Implicits.global
+import scala.concurrent.duration._
+
+import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.benchmark.{Benchmark, BenchmarkBase}
+import org.apache.spark.serializer.KryoTest._
+import org.apache.spark.util.ThreadUtils
+
+/**
+ * Benchmark for KryoPool vs old "pool of 1".
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *  bin/spark-submit --class  --jars 
+ *   2. build/sbt "core/test:runMain "
+ *   3. generate result:
+ *  SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "core/test:runMain 
"
+ *  Results will be written to 
"benchmarks/KryoSerializerBenchmark-results.txt".
+ * }}}
+ */
+object KryoSerializerBenchmark extends BenchmarkBase {
--- End diff --

cc @dongjoon-hyun for Benchmark change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...

2018-11-09 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22683
  
cc @srowen Cloud we merge this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22982: Spark 25973 - Spark History Main page performance improv...

2018-11-08 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22982
  
@Willymontaz  Please update PR title to ```[SPARK-25973][CORE] Spark 
History Main page performance improvement```.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22985: [SPARK-25510][SQL][TEST][FOLLOW-UP] Remove Benchm...

2018-11-08 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/22985

[SPARK-25510][SQL][TEST][FOLLOW-UP] Remove BenchmarkWithCodegen

## What changes were proposed in this pull request?

Remove `BenchmarkWithCodegen` as we don't use it anymore.
More details: 
https://github.com/apache/spark/pull/22484#discussion_r221397904

## How was this patch tested?

N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-25510

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22985.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22985


commit 9ced28d6b28c631440865d3510394a0e95cc2da8
Author: Yuming Wang 
Date:   2018-11-08T15:11:18Z

Remove BenchmarkWithCodegen




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22981: SPARK-25975 - Spark History does not display necessarily...

2018-11-08 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22981
  
Do you know which PR fixed this issue?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22981: SPARK-25975 - Spark History does not display necessarily...

2018-11-08 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22981
  
@Willymontaz Could you make a PR against `master` branch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22980: Spark 25973 - Spark History Main page performance improv...

2018-11-08 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22980
  
@Willymontaz Could you make a PR against `master` branch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...

2018-11-08 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/20944
  
This fix for testing only, production won't use derby as their matestore 
database.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22721: [SPARK-25403][SQL] Refreshes the table after inse...

2018-11-08 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22721#discussion_r231791936
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -183,13 +183,14 @@ case class InsertIntoHadoopFsRelationCommand(
 refreshUpdatedPartitions(updatedPartitionPaths)
   }
 
-  // refresh cached files in FileIndex
-  fileIndex.foreach(_.refresh())
-  // refresh data cache if table is cached
-  sparkSession.catalog.refreshByPath(outputPath.toString)
-
   if (catalogTable.nonEmpty) {
+
sparkSession.sessionState.catalog.refreshTable(catalogTable.get.identifier)
--- End diff --

May be we should fixed it by: https://github.com/apache/spark/pull/20430


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22721: [SPARK-25403][SQL] Refreshes the table after inse...

2018-11-07 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22721#discussion_r231789027
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -183,13 +183,14 @@ case class InsertIntoHadoopFsRelationCommand(
 refreshUpdatedPartitions(updatedPartitionPaths)
   }
 
-  // refresh cached files in FileIndex
-  fileIndex.foreach(_.refresh())
-  // refresh data cache if table is cached
-  sparkSession.catalog.refreshByPath(outputPath.toString)
-
   if (catalogTable.nonEmpty) {
+
sparkSession.sessionState.catalog.refreshTable(catalogTable.get.identifier)
--- End diff --

Good catch. new created table's stats is empty, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...

2018-11-07 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/20944
  
Sorry @HyukjinKwon It's difficult reproduce. I am not sure whether it is 
caused by multithreading.
But you can verify it by:

https://github.com/apache/spark/blob/a75571b46f813005a6d4b076ec39081ffab11844/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala#L117-L120




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22617: [SPARK-25484][SQL][TEST] Refactor ExternalAppendOnlyUnsa...

2018-11-06 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22617
  
Retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22943: [SPARK-25098][SQL] Trim the string when cast stri...

2018-11-05 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22943#discussion_r230970713
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ---
@@ -359,7 +359,7 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
   // TimestampConverter
   private[this] def castToTimestamp(from: DataType): Any => Any = from 
match {
 case StringType =>
-  buildCast[UTF8String](_, utfs => 
DateTimeUtils.stringToTimestamp(utfs, timeZone).orNull)
+  buildCast[UTF8String](_, s => 
DateTimeUtils.stringToTimestamp(s.trim(), timeZone).orNull)
--- End diff --

How about change `stringToDate` to `trimStringToDate` and update 
`trimStringToDate` to:

![image](https://user-images.githubusercontent.com/5399861/48036353-ec49a100-e1a2-11e8-80e6-b52b4a007493.png)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22949: [minor] update known_translations

2018-11-05 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22949
  
Update line 132:
```
wangyum - Yuming Wang
```
How about add:
```
lipzhu - Lipeng Zhu
daviddingly - David Ding
laskfla - Keith Sun
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22502: [SPARK-25474][SQL]When the "fallBackToHdfsForStat...

2018-11-05 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22502#discussion_r230734089
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
 ---
@@ -86,10 +89,28 @@ case class HadoopFsRelation(
   }
 
   override def sizeInBytes: Long = {
--- End diff --

May be you need to implement a rule similar to `DetermineTableStats` for 
the datasource table?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for parsing...

2018-11-05 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22590
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22943: [SPARK-25098][SQL] Trim the string when cast stringToTim...

2018-11-05 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22943
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22943: [SPARK-25098][SQL] Trim the string when cast stringToTim...

2018-11-05 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22943
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22693: [SPARK-25701][SQL] Supports calculation of table ...

2018-11-04 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22693#discussion_r230639634
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -115,26 +116,45 @@ class ResolveHiveSerdeTable(session: SparkSession) 
extends Rule[LogicalPlan] {
 
 class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] 
{
   override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case filterPlan @ Filter(_, SubqueryAlias(_, relation: 
HiveTableRelation)) =>
+  val predicates = 
PhysicalOperation.unapply(filterPlan).map(_._2).getOrElse(Nil)
+  computeTableStats(relation, predicates)
 case relation: HiveTableRelation
 if DDLUtils.isHiveTable(relation.tableMeta) && 
relation.tableMeta.stats.isEmpty =>
-  val table = relation.tableMeta
-  val sizeInBytes = if 
(session.sessionState.conf.fallBackToHdfsForStatsEnabled) {
-try {
-  val hadoopConf = session.sessionState.newHadoopConf()
-  val tablePath = new Path(table.location)
-  val fs: FileSystem = tablePath.getFileSystem(hadoopConf)
-  fs.getContentSummary(tablePath).getLength
-} catch {
-  case e: IOException =>
-logWarning("Failed to get table size from hdfs.", e)
-session.sessionState.conf.defaultSizeInBytes
-}
-  } else {
-session.sessionState.conf.defaultSizeInBytes
+  computeTableStats(relation)
+  }
+
+  private def computeTableStats(
+  relation: HiveTableRelation,
+  predicates: Seq[Expression] = Nil): LogicalPlan = {
+val table = relation.tableMeta
+val sizeInBytes = if 
(session.sessionState.conf.fallBackToHdfsForStatsEnabled) {
+  try {
+val hadoopConf = session.sessionState.newHadoopConf()
+val tablePath = new Path(table.location)
+val fs: FileSystem = tablePath.getFileSystem(hadoopConf)
+BigInt(fs.getContentSummary(tablePath).getLength)
+  } catch {
+case e: IOException =>
+  logWarning("Failed to get table size from hdfs.", e)
+  getSizeInBytesFromTablePartitions(table.identifier, predicates)
   }
+} else {
+  getSizeInBytesFromTablePartitions(table.identifier, predicates)
+}
+val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes 
= sizeInBytes)))
+relation.copy(tableMeta = withStats)
+  }
 
-  val withStats = table.copy(stats = 
Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes
-  relation.copy(tableMeta = withStats)
+  private def getSizeInBytesFromTablePartitions(
+  tableIdentifier: TableIdentifier,
+  predicates: Seq[Expression] = Nil): BigInt = {
+session.sessionState.catalog.listPartitionsByFilter(tableIdentifier, 
predicates) match {
--- End diff --

After [this refactor](https://github.com/apache/spark/pull/22743). We can 
avoid compute stats if `LogicalRelation` already cached. because the computed 
stats will not take effect. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22943: [SPARK-25098][SQL] Trim the string when cast stri...

2018-11-04 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/22943

[SPARK-25098][SQL] Trim the string when cast stringToTimestamp and 
stringToDate

## What changes were proposed in this pull request?

**Hive** and **Oracle** trim the string when cast `stringToTimestamp` and 
`stringToDate`. this PR support this feature:

![image](https://user-images.githubusercontent.com/5399861/47979721-793b1e80-e0ff-11e8-97c8-24b10950ee9e.png)

![image](https://user-images.githubusercontent.com/5399861/47979725-7dffd280-e0ff-11e8-87d4-5767a00ed46e.png)


## How was this patch tested?

unit tests

Closes https://github.com/apache/spark/pull/22089


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-25098

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22943.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22943


commit d297817b7457fef40eb78b803542aed213afb7fc
Author: Yuming Wang 
Date:   2018-11-05T05:31:22Z

trim() the string when cast stringToTimestamp and stringToDate




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22941: [SPARK-25936][SQL] Fix InsertIntoDataSourceComman...

2018-11-04 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22941#discussion_r230622708
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala ---
@@ -589,4 +590,33 @@ class InsertSuite extends DataSourceTest with 
SharedSQLContext {
   sql("INSERT INTO TABLE test_table SELECT 2, null")
 }
   }
+
+  test("SPARK-25936 InsertIntoDataSourceCommand does not use Cached Data") 
{
--- End diff --

It works. Do we need to fix this plan issue?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22903: [SPARK-24196][SQL] Implement Spark's own GetSchemasOpera...

2018-11-04 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22903
  
cc @gatorsmile


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22089: [SPARK-25098][SQL]‘Cast’ will return NULL when input...

2018-11-04 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22089
  
Sure, @gatorsmile .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22908: [MINOR][SQL] Replace all TreeNode's node name in ...

2018-11-04 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/22908


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22941: [SPARK-25936][SQL] Fix InsertIntoDataSourceComman...

2018-11-04 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/22941

[SPARK-25936][SQL] Fix InsertIntoDataSourceCommand does not use Cached Data

## What changes were proposed in this pull request?

```java
spark.sql("""
  CREATE TABLE jdbcTable
  USING org.apache.spark.sql.jdbc
  OPTIONS (
url "jdbc:mysql://localhost:3306/test",
dbtable "test.InsertIntoDataSourceCommand",
user "hive",
password "hive"
  )""")

spark.range(2).createTempView("test_view")
spark.catalog.cacheTable("test_view")
spark.sql("INSERT INTO TABLE jdbcTable SELECT * FROM test_view").explain
```

Before this PR:
```
== Physical Plan == 

Execute InsertIntoDataSourceCommand
   +- InsertIntoDataSourceCommand
 +- Project
+- SubqueryAlias
   +- Range (0, 2, step=1, splits=Some(8))
```

After this PR:
```
== Physical Plan == 

Execute InsertIntoDataSourceCommand InsertIntoDataSourceCommand 
Relation[id#8L] JDBCRelation(test.InsertIntoDataSourceCommand) 
[numPartitions=1], false, [id]
+- *(1) InMemoryTableScan [id#0L]
  +- InMemoryRelation [id#0L], StorageLevel(disk, memory, deserialized, 
1 replicas)
+- *(1) Range (0, 2, step=1, splits=8)
```

## How was this patch tested?

unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-25936

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22941.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22941


commit 2968b2c34f42f6b0bcb5e373a400377abfd09e86
Author: Yuming Wang 
Date:   2018-11-04T10:36:20Z

Fix InsertIntoDataSourceCommand does not use Cached Data




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230570164
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

cc @dongjoon-hyun


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22930: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand's input...

2018-11-03 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22930
  
cc @gatorsmile @gengliangwang @maropu


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230559647
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

Cloud we close this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230557937
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

@arman1371 What do you think? ```def join(right: Dataset[_], usingColumn: 
String, joinType: String)``` only support one column. right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22935: Branch 2.2

2018-11-03 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22935
  
@litao1223 Please close this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22089: [SPARK-25098][SQL]‘Cast’ will return NULL when input...

2018-11-03 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22089
  
ping @bingbai0912 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22088: [SPARK-24931][CORE]CoarseGrainedExecutorBackend send wro...

2018-11-03 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22088
  
cc @jiangxb1987 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22934: [BUILD] Close stale PRs

2018-11-03 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/22934

[BUILD] Close stale PRs

Closes https://github.com/apache/spark/pull/22859
Closes https://github.com/apache/spark/pull/22591
Closes https://github.com/apache/spark/pull/22322
Closes https://github.com/apache/spark/pull/22312
Closes https://github.com/apache/spark/pull/19590

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark CloseStalePRs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22934.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22934


commit 322e21c29919cb7dcfc2e088cd5d605e1f4bb5a7
Author: Yuming Wang 
Date:   2018-11-03T12:34:05Z

Close stale PRs




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22889: [SPARK-25882][SQL] Added a function to join two d...

2018-11-03 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22889#discussion_r230555316
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -883,6 +883,31 @@ class Dataset[T] private[sql](
 join(right, Seq(usingColumn))
   }
 
+  /**
+* Equi-join with another `DataFrame` using the given column.
+*
+* Different from other join functions, the join column will only 
appear once in the output,
+* i.e. similar to SQL's `JOIN USING` syntax.
+*
+* {{{
+*   // Left join of df1 and df2 using the column "user_id"
+*   df1.join(df2, "user_id", "left")
+* }}}
+*
+* @param right Right side of the join operation.
+* @param usingColumn Name of the column to join on. This column must 
exist on both sides.
+* @param joinType Type of join to perform. Default `inner`. Must be 
one of:
+* `inner`, `cross`, `outer`, `full`, `full_outer`, 
`left`, `left_outer`,
+* `right`, `right_outer`, `left_semi`, `left_anti`.
+* @note If you perform a self-join using this function without 
aliasing the input
+* `DataFrame`s, you will NOT be able to reference any columns after 
the join, since
+* there is no way to disambiguate which side of the join you would 
like to reference.
+* @group untypedrel
+*/
+  def join(right: Dataset[_], usingColumn: String, joinType: String): 
DataFrame = {
--- End diff --

So in your case. Could you replace ```df1.join(df2, "user_id", "left")``` 
with ```df1.join(df2, Seq("user_id"), "left")```?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...

2018-11-03 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22683
  
cc @ajbozarth


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 992 matches

Mail list logo