[GitHub] spark pull request #18544: [SPARK-21318][SQL]Improve exception message throw...

2018-09-21 Thread stanzhai
Github user stanzhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/18544#discussion_r219485843
  
--- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/UDFSuite.scala 
---
@@ -193,4 +193,29 @@ class UDFSuite
   }
 }
   }
+
+  test("SPARK-21318: The correct exception message should be thrown " +
+"if a UDF/UDAF has already been registered") {
+val UDAFName = "empty"
+val UDAFClassName = 
classOf[org.apache.spark.sql.hive.execution.UDAFEmpty].getCanonicalName
+
+withTempDatabase { dbName =>
--- End diff --

@cloud-fan I just copied and modified the code from another test case, the 
default database works well.

The test case has been simplified now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18544: [SPARK-21318][SQL]Improve exception message throw...

2018-09-21 Thread stanzhai
Github user stanzhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/18544#discussion_r219468948
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -1440,6 +1441,8 @@ abstract class SessionCatalogSuite extends 
AnalysisTest {
 }
 
 assert(cause.getMessage.contains("Undefined function: 
'undefined_fn'"))
+// SPARK-21318: the error message should contains the current 
database name
--- End diff --

org.apache.spark.sql.AnalysisException: Undefined function: 'undefined_fn'. 
This function is neither a registered temporary function nor a permanent 
function registered in the database 'db1'.;


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...

2018-09-21 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18544
  
@cloud-fan 

User's hive UDFs are registered in externalCatalog which not exists in 
functionRegistry.

It will throws a NoSuchFunctionException when an exception is encountered 
while loading a hive UDF.

But we should throw the original exception.

So, I just fix the issue by:

```
if (functionRegistry.functionExists(funcName)) {
  throw error
} else {
  ...
}
```

changed to:

```
if (super.functionExists(name)) {
  throw error
} else {
  ...
}
```

The following is implementation of `super.functionExists`

```
def functionExists(name: FunctionIdentifier): Boolean = {
  val db = formatDatabaseName(name.database.getOrElse(getCurrentDatabase))
  requireDbExists(db)
  functionRegistry.functionExists(name) ||
externalCatalog.functionExists(db, name.funcName)
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...

2018-09-18 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18544
  
The issue has been addressed a long time ago @cloud-fan @maropu 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22051: [SPARK-25064][WEBUI] Add killed tasks count info ...

2018-08-09 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/22051

[SPARK-25064][WEBUI] Add killed tasks count info to WebUI

## What changes were proposed in this pull request?

Add missing killed tasks to WebUI.

Total tasks = Active + Failed + Killed + Complete tasks.

## How was this patch tested?

Manual tests + Unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stanzhai/spark fix-webui-task-count

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22051.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22051


commit 4317d0c578e3dd1bd3182325bf7089fa380420e5
Author: Stan Zhai 
Date:   2018-08-03T07:21:45Z

add killed task count info to webui




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...

2018-07-18 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18544
  
It's not reasonable, `failFunctionLookup` throws `NoSuchFunctionException`.
The function actually exists in current selected database, we should throw 
the exception which is due to an initialization failure, but not 
`NoSuchFunctionException`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...

2018-07-17 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18544
  
cc @gatorsmile changes in 
`sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala`
 has been reverted.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18544: [SPARK-21318][SQL]Improve exception message throw...

2018-07-16 Thread stanzhai
Github user stanzhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/18544#discussion_r202607295
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala ---
@@ -129,14 +129,14 @@ private[sql] class HiveSessionCatalog(
 Try(super.lookupFunction(funcName, children)) match {
   case Success(expr) => expr
   case Failure(error) =>
-if (functionRegistry.functionExists(funcName)) {
-  // If the function actually exists in functionRegistry, it means 
that there is an
-  // error when we create the Expression using the given children.
+if (super.functionExists(name)) {
+  // If the function actually exists in functionRegistry or 
externalCatalog,
+  // it means that there is an error when we create the Expression 
using the given children.
   // We need to throw the original exception.
   throw error
 } else {
-  // This function is not in functionRegistry, let's try to load 
it as a Hive's
-  // built-in function.
+  // This function is not in functionRegistry or externalCatalog,
+  // let's try to load it as a Hive's built-in function.
   // Hive is case insensitive.
   val functionName = 
funcName.unquotedString.toLowerCase(Locale.ROOT)
   if (!hiveFunctions.contains(functionName)) {
--- End diff --

Yes, that's right.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...

2018-07-11 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18544
  
cc @gatorsmile Addressed. Review this please. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18544: [SPARK-21318][SQL]Improve exception message throw...

2018-07-11 Thread stanzhai
Github user stanzhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/18544#discussion_r201579348
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala ---
@@ -129,14 +129,14 @@ private[sql] class HiveSessionCatalog(
 Try(super.lookupFunction(funcName, children)) match {
   case Success(expr) => expr
   case Failure(error) =>
-if (functionRegistry.functionExists(funcName)) {
-  // If the function actually exists in functionRegistry, it means 
that there is an
-  // error when we create the Expression using the given children.
+if (super.functionExists(name)) {
--- End diff --

We should keep use `super.functionExists(name)`, we can not load  a Hive's 
built-in function  if replaced by `functionExists(name)` and 
`org.apache.spark.sql.AnalysisException: Undefined function: 
'histogram_numeric'` will be thrown.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21663: [SPARK-24680][Deploy]Support spark.executorEnv.JAVA_HOME...

2018-07-04 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/21663
  
@jerryshao My Spark Application is built on top of JDK10, but the 
standalone cluster manager is running with JDK8 which does not support JDK10.

Java 7 support has been removed since Spark 2.2. 

I've tried that JDK10 serialized message from executors which can be read 
by JDK8 worker.

Aside from that, I think we should let the spark.executorEnv.JAVA_HOME 
configuration work, and as for effectiveness, we should give it to the user.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21680: [SPARK-24704][WebUI] Fix the order of stages in t...

2018-06-30 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/21680

[SPARK-24704][WebUI] Fix the order of stages in the DAG graph

## What changes were proposed in this pull request?

Before:


![wx20180630-155537](https://user-images.githubusercontent.com/1438757/42123357-2c2e2d84-7c83-11e8-8abd-1c2860f38783.png)

After:


![wx20180630-155604](https://user-images.githubusercontent.com/1438757/42123359-32fae990-7c83-11e8-8a7b-cdcee94f9123.png)

## How was this patch tested?

Manual tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stanzhai/spark fix-dag-graph

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21680.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21680


commit b3420d61025f7bb9e17160dfb586bc54fba1a51d
Author: Stan Zhai 
Date:   2018-06-30T07:57:08Z

fix stage order in job graph




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-29 Thread stanzhai
Github user stanzhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r199062132
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -378,6 +378,14 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED =
+buildConf("spark.sql.parquet.filterPushdown.string.startsWith")
--- End diff --

It would be better if we added `.enabled` postfix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21663: [SPARK-24680][Deploy]Support spark.executorEnv.JA...

2018-06-28 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/21663

[SPARK-24680][Deploy]Support spark.executorEnv.JAVA_HOME in Standalone mode

## What changes were proposed in this pull request?

spark.executorEnv.JAVA_HOME does not take effect when a Worker starting an 
Executor process in Standalone mode.

This PR fixed this.

## How was this patch tested?

Manual tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stanzhai/spark fix-executor-env-java-home

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21663.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21663


commit b46c5357746880d420b208733443cb8b49164e81
Author: Stan Zhai 
Date:   2018-06-28T13:44:01Z

fix spark.executorEnv.JAVA_HOME




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...

2018-01-19 Thread stanzhai
Github user stanzhai closed the pull request at:

https://github.com/apache/spark/pull/19301


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...

2017-10-28 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18544
  
fixed @gatorsmile . retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...

2017-10-25 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18544
  
Hi @gatorsmile , I've added some test cases, and passed on my machine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...

2017-09-25 Thread stanzhai
Github user stanzhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19301#discussion_r140699522
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala
 ---
@@ -72,11 +74,19 @@ object AggregateExpression {
   aggregateFunction: AggregateFunction,
   mode: AggregateMode,
   isDistinct: Boolean): AggregateExpression = {
+val state = if (aggregateFunction.resolved) {
+  Seq(aggregateFunction.toString, aggregateFunction.dataType,
+aggregateFunction.nullable, mode, isDistinct)
+} else {
+  Seq(aggregateFunction.toString, mode, isDistinct)
+}
+val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * 
a + b)
+
 AggregateExpression(
   aggregateFunction,
   mode,
   isDistinct,
-  NamedExpression.newExprId)
+  ExprId(hashCode))
--- End diff --

I've tried to optimize in aggregate planner 
(https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L211).

```scala
  // A single aggregate expression might appear multiple times in 
resultExpressions.
  // In order to avoid evaluating an individual aggregate function 
multiple times, we'll
  // build a set of the distinct aggregate expressions and build a 
function which can
  // be used to re-write expressions so that they reference the single 
copy of the
  // aggregate function which actually gets computed.
  val aggregateExpressions = resultExpressions.flatMap { expr =>
expr.collect {
  case agg: AggregateExpression =>
val aggregateFunction = agg.aggregateFunction
val state = if (aggregateFunction.resolved) {
  Seq(aggregateFunction.toString, aggregateFunction.dataType,
aggregateFunction.nullable, agg.mode, agg.isDistinct)
} else {
  Seq(aggregateFunction.toString, agg.mode, agg.isDistinct)
}
val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b) 
=> 31 * a + b)
(hashCode, agg)
}
  }.groupBy(_._1).map { case (_, values) =>
values.head._2
  }.toSeq
```

But it's difficult to distinguish between different typed aggregators 
without expr id. Current solution can work well for all of aggregate functions.

I'm not familiar with typed aggregators, any suggestions will be 
appreciated.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19301: [SPARK-22084][SQL] Fix performance regression in aggrega...

2017-09-22 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/19301
  
@viirya 
Benchmark code:

```scala
val N = 500L << 22
val benchmark = new Benchmark("agg", N)
val expressions = (0 until 50).map(i => s"sum(id) as r$i")

benchmark.addCase("agg with optimize", numIters = 2) { iter =>
  sparkSession.range(N).selectExpr(expressions: _*).collect()
}

benchmark.run()
```

Result:

```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14 on Mac OS X 10.12.6
Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz

agg: Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


agg with optimize 1306 / 1354   1605.7  
 0.6   1.0X
agg without optimize  121799 / 148115 17.2  
58.1   1.0X

```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19301: [SPARK-22084][SQL] Fix performance regression in aggrega...

2017-09-22 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/19301
  
@viirya The problem is already obvious, and the same aggregate expression 
will be computed multi times. I will provide a benchmark result later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19301: [SPARK-22084][SQL] Fix performance regression in aggrega...

2017-09-21 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/19301
  
@cenyuhai This  is an optimize for physical plan, and your case can be 
optimized.

```SQL
select dt,
geohash_of_latlng,
sum(mt_cnt),
sum(ele_cnt),
round(sum(mt_cnt) * 1.0 * 100 / sum(mt_cnt_all), 2),
round(sum(ele_cnt) * 1.0 * 100 / sum(ele_cnt_all), 2)
from values(1, 2, 3, 4, 5, 6) as (dt, geohash_of_latlng, mt_cnt, ele_cnt, 
mt_cnt_all, ele_cnt_all)
group by dt, geohash_of_latlng
order by dt, geohash_of_latlng limit 10
```

Before:

```
== Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[dt#26 ASC NULLS 
FIRST,geohash_of_latlng#27 ASC NULLS FIRST], 
output=[dt#26,geohash_of_latlng#27,sum(mt_cnt)#38L,sum(ele_cnt)#39L,round((CAST((CAST((CAST(CAST(sum(CAST(mt_cnt
 AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) 
AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS 
DECIMAL(38,2)) / CAST(CAST(sum(CAST(mt_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS 
DECIMAL(38,2))), 2)#40,round((CAST((CAST((CAST(CAST(sum(CAST(ele_cnt AS 
BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS 
DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS 
DECIMAL(38,2)) / CAST(CAST(sum(CAST(ele_cnt_all AS BIGINT)) AS DECIMAL(20,0)) 
AS DECIMAL(38,2))), 2)#41])
+- *HashAggregate(keys=[dt#26, geohash_of_latlng#27], 
functions=[sum(cast(mt_cnt#28 as bigint)), sum(cast(ele_cnt#29 as bigint)), 
sum(cast(mt_cnt#28 as bigint)), sum(cast(mt_cnt_all#30 as bigint)), 
sum(cast(ele_cnt#29 as bigint)), sum(cast(ele_cnt_all#31 as bigint))])
   +- Exchange hashpartitioning(dt#26, geohash_of_latlng#27, 200)
  +- *HashAggregate(keys=[dt#26, geohash_of_latlng#27], 
functions=[partial_sum(cast(mt_cnt#28 as bigint)), partial_sum(cast(ele_cnt#29 
as bigint)), partial_sum(cast(mt_cnt#28 as bigint)), 
partial_sum(cast(mt_cnt_all#30 as bigint)), partial_sum(cast(ele_cnt#29 as 
bigint)), partial_sum(cast(ele_cnt_all#31 as bigint))])
 +- LocalTableScan [dt#26, geohash_of_latlng#27, mt_cnt#28, 
ele_cnt#29, mt_cnt_all#30, ele_cnt_all#31]
```

After:

```
== Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[dt#28 ASC NULLS 
FIRST,geohash_of_latlng#29 ASC NULLS FIRST], 
output=[dt#28,geohash_of_latlng#29,sum(mt_cnt)#34L,sum(ele_cnt)#35L,round((CAST((CAST((CAST(CAST(sum(CAST(mt_cnt
 AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) 
AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS 
DECIMAL(38,2)) / CAST(CAST(sum(CAST(mt_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS 
DECIMAL(38,2))), 2)#36,round((CAST((CAST((CAST(CAST(sum(CAST(ele_cnt AS 
BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS 
DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS 
DECIMAL(38,2)) / CAST(CAST(sum(CAST(ele_cnt_all AS BIGINT)) AS DECIMAL(20,0)) 
AS DECIMAL(38,2))), 2)#37])
+- *HashAggregate(keys=[dt#28, geohash_of_latlng#29], 
functions=[sum(cast(mt_cnt#30 as bigint)), sum(cast(ele_cnt#31 as bigint)), 
sum(cast(mt_cnt_all#32 as bigint)), sum(cast(ele_cnt_all#33 as bigint))])
   +- Exchange hashpartitioning(dt#28, geohash_of_latlng#29, 200)
  +- *HashAggregate(keys=[dt#28, geohash_of_latlng#29], 
functions=[partial_sum(cast(mt_cnt#30 as bigint)), partial_sum(cast(ele_cnt#31 
as bigint)), partial_sum(cast(mt_cnt_all#32 as bigint)), 
partial_sum(cast(ele_cnt_all#33 as bigint))])
 +- LocalTableScan [dt#28, geohash_of_latlng#29, mt_cnt#30, 
ele_cnt#31, mt_cnt_all#32, ele_cnt_all#33]
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19301: [SPARK-22084][SQL] Fix performance regression in aggrega...

2017-09-21 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/19301
  

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L211

```scala
  val aggregateExpressions = resultExpressions.flatMap { expr =>
expr.collect {
  case agg: AggregateExpression => agg
}
  }.distinct
```

Before the fix, the exprId of each aggregate expression is different which 
can cause distinct fail.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...

2017-09-21 Thread stanzhai
Github user stanzhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/19301#discussion_r140155475
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -38,7 +38,7 @@ import org.apache.spark.sql.internal.SQLConf
  * view resolution, in this way, we are able to get the correct 
view column ordering and
  * omit the extra columns that we don't require);
  *1.2. Else set the child output attributes to `queryOutput`.
- * 2. Map the `queryQutput` to view output by index, if the corresponding 
attributes don't match,
+ * 2. Map the `queryOutput` to view output by index, if the corresponding 
attributes don't match,
--- End diff --

Q -> O


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...

2017-09-21 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/19301

[SPARK-22084][SQL] Fix performance regression in aggregation strategy

## What changes were proposed in this pull request?

This PR fix a performance regression in aggregation strategy which 
introduced in Spark 2.0.

For the following SQL:

```SQL
SELECT a, SUM(b) AS b0, SUM(b) AS b1 
FROM VALUES(1, 1), (2, 2) AS (a, b) 
GROUP BY a
```

Before the fix:

```
== Physical Plan ==
*HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), 
sum(cast(b#12 as bigint))])
+- Exchange hashpartitioning(a#11, 200)
   +- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as 
bigint)), partial_sum(cast(b#12 as bigint))])
  +- LocalTableScan [a#11, b#12]
```

After

```
== Physical Plan ==
*HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint))])
+- Exchange hashpartitioning(a#11, 2)
   +- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as 
bigint))])
  +- LocalTableScan [a#11, b#12]
```

## How was this patch tested?

WIP

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stanzhai/spark improve-aggregate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19301.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19301


commit 6f555c20c5c6d2821410aff671758ba73cd8f300
Author: Stan Zhai <m...@stanzhai.site>
Date:   2017-09-19T09:27:35Z

use hashCode as exprId

commit 5aaae4caa6225ecc6d174afb2eefa8d68af5471a
Author: Stan Zhai <m...@stanzhai.site>
Date:   2017-09-19T09:53:56Z

typo

commit adce4740c3c41000215f5d7cc0285701d15bb7cf
Author: Stan Zhai <m...@stanzhai.site>
Date:   2017-09-20T07:12:23Z

Merge branch 'master' of https://github.com/apache/spark into 
improve-aggregate

commit bf7d2cf103e2a0caf1538e3df5c174df173cfc56
Author: Stan Zhai <m...@stanzhai.site>
Date:   2017-09-21T05:19:20Z

Merge branch 'master' of https://github.com/apache/spark into 
improve-aggregate




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18986: [SPARK-21774][SQL] The rule PromoteStrings should cast a...

2017-08-21 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18986
  
@gatorsmile @DonnyZone  When comparing a string to a int in Hive, it will 
cast string type to double.

```
hive> select * from tb;
0   0
0.1 0
true0
19157170390056971   0
hive> select * from tb where a = 0;
0   0
hive> select * from tb where a = 19157170390056973L;
WARNING: Comparing a bigint and a string may result in a loss of precision.
19157170390056973   0
hive> select 1 = 'true';
NULL
hive> select 19157170390056973L = '19157170390056971';
WARNING: Comparing a bigint and a string may result in a loss of precision.
true
```

So, I think that cast a string to double type when compare with a numeric 
is more reasonable.

Actually, my usage scenarios are for Spark compatibility. The problem I 
found when I upgraded Spark to 2.2.0, and lots of SQL's results are wrong.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18986: [SPARK-21774][SQL] The rule PromoteStrings should cast a...

2017-08-21 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18986
  
@DonnyZone @gatorsmile @cloud-fan  PostgreSQL will throw an error when 
comparing a string to  a int.

```
postgres=# select * from tb;
  a   | b
--+---
 0.1  | 1
 a| 1
 true | 1
(3 rows)

postgres=# select * from tb where a>0;
ERROR:  operator does not exist: character varying > integer
LINE 1: select * from tb where a>0;
^
HINT:  No operator matches the given name and argument type(s). You might 
need to add explicit type casts.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18986: [SPARK-21774][SQL] The rule PromoteStrings should cast a...

2017-08-18 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18986
  
In MySQL conversion of values from one string type to numeric, will be 
compared as floating-point (real) numbers.

[](https://dev.mysql.com/doc/refman/5.7/en/type-conversion.html)

The following rules describe how conversion occurs for comparison 
operations:

- If one or both arguments are NULL, the result of the comparison is NULL, 
except for the NULL-safe <=> equality comparison operator. For NULL <=> NULL, 
the result is true. No conversion is needed.
- If both arguments in a comparison operation are strings, they are 
compared as strings.
- If both arguments are integers, they are compared as integers.
- Hexadecimal values are treated as binary strings if not compared to a 
number.
- If one of the arguments is a TIMESTAMP or DATETIME column and the other 
argument is a constant, the constant is converted to a timestamp before the 
comparison is performed. This is done to be more ODBC-friendly.   > Note that 
this is not done for the arguments to IN()! To be safe, always use complete 
datetime, date, or time strings when doing comparisons. For example, to achieve 
best results when using BETWEEN with date or time values, use CAST() to 
explicitly convert the values to the desired data type.
- A single-row subquery from a table or tables is not considered a 
constant. For example, if a subquery returns an integer to be compared to a 
DATETIME value, the comparison is done as two integers. The integer is not 
converted to a temporal value. To compare the operands as DATETIME values, use 
CAST() to explicitly convert the subquery value to DATETIME.
- If one of the arguments is a decimal value, comparison depends on the 
other argument. The arguments are compared as decimal values if the other 
argument is a decimal or integer value, or as floating-point values if the 
other argument is a floating-point value.
- In all other cases, the arguments are compared as floating-point (real) 
numbers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18986: [SPARK-21774][SQL] The rule PromoteStrings should...

2017-08-17 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/18986

[SPARK-21774][SQL] The rule PromoteStrings should cast a string to double 
type when compare with a int

## What changes were proposed in this pull request?

The rule PromoteStrings should cast a string to double type when compare 
with a int.

This PR fixed this.

## How was this patch tested?

Origin test cases updated.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stanzhai/spark fix-type-coercion

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18986.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18986


commit 1a289a5a1b0756e86e225d43de73d9b42afb0a0e
Author: Stan Zhai <m...@stanzhai.site>
Date:   2017-08-18T02:17:20Z

fix a bug of TypeCoercion




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...

2017-08-08 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18544
  
@gatorsmile 
Some test cases have been added.
Thanks for reviewing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...

2017-07-05 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/18544
  
cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18544: [SPARK-21318][SQL]Improve exception message throw...

2017-07-05 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/18544

[SPARK-21318][SQL]Improve exception message thrown by `lookupFunction`

## What changes were proposed in this pull request?

The function actually exists in current selected database, and it's failed 
to init during `lookupFunciton`, but the exception message is: 
```
This function is neither a registered temporary function nor a permanent 
function registered in the database 'default'.
```

This is not conducive to positioning problems. This PR fix the problem.

## How was this patch tested?

Exists tests + manual tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stanzhai/spark fix-udf-error-message

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18544.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18544


commit 373fc5cacb77bb6e6be02eb3608497cbcaa7edef
Author: Stan Zhai <m...@stanzhai.site>
Date:   2017-07-05T14:41:02Z

optimized udf lookup exception message




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17529: [SPARK-20211][SQL]Fix a bug in FLOOR and CEIL whe...

2017-06-11 Thread stanzhai
Github user stanzhai closed the pull request at:

https://github.com/apache/spark/pull/17529


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18244: [SPARK-20211][SQL] Fix the Precision and Scale of...

2017-06-09 Thread stanzhai
Github user stanzhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/18244#discussion_r121060627
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala ---
@@ -126,7 +126,15 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
   def set(decimal: BigDecimal): Decimal = {
 this.decimalVal = decimal
 this.longVal = 0L
-this._precision = decimal.precision
+if (decimal.precision <= decimal.scale) {
--- End diff --

Got it, thanks for the fix!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18244: [SPARK-20211][SQL] Fix the Precision and Scale of...

2017-06-09 Thread stanzhai
Github user stanzhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/18244#discussion_r121058323
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala ---
@@ -126,7 +126,15 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
   def set(decimal: BigDecimal): Decimal = {
 this.decimalVal = decimal
 this.longVal = 0L
-this._precision = decimal.precision
+if (decimal.precision <= decimal.scale) {
--- End diff --

But the comment is `// For Decimal, we expect the precision is equal to or 
large than the scale`.

`=` has been processed within the function `floor` and `ceil`.

<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L387>

This is reason that I think we should use `if (decimal.precision < 
decimal.scale)`, and it works fine for `0.90`.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18244: [SPARK-20211][SQL] Fix the Precision and Scale of...

2017-06-08 Thread stanzhai
Github user stanzhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/18244#discussion_r121053165
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala ---
@@ -126,7 +126,15 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
   def set(decimal: BigDecimal): Decimal = {
 this.decimalVal = decimal
 this.longVal = 0L
-this._precision = decimal.precision
+if (decimal.compare(BigDecimal(1.0)) == -1 && 
decimal.compare(BigDecimal(-1.0)) == 1) {
--- End diff --

just `if (decimal.presision < decimal.scale) {`


https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L387


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #10991: [SPARK-12299][CORE] Remove history serving functionality...

2017-05-15 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/10991
  
We've just upgraded our Spark cluster from 1.6.x to 2.x, I found that the 
REST APIs from Spark MasterUI is unavailable.

It's important for us to use the REST APIs to monitor our Applications. I 
believe that some other people would rely on this function too.

Right now, the only way to get them is using the Spark Master WebUI, it's 
too bad.

It would be great that we have some REST APIs to access Master、Workers 
and Applications information from Master.

@BryanCutler @andrewor14 @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17529: [SPARK-20211][SQL]Fix a bug in FLOOR and CEIL when a dec...

2017-04-25 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/17529
  
cc @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17529: [SPARK-20211][SQL]floor or ceil with a decimal that its ...

2017-04-04 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/17529
  
cc @chenghao-intel 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17529: [SPARK-20211][SQL]floor or ceil with a decimal th...

2017-04-04 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/17529

[SPARK-20211][SQL]floor or ceil with a decimal that its `precision < scale` 
should be supported

## What changes were proposed in this pull request?

`precision` in a decimal indicates the length of the arbitrary precision 
integer. Here are a few examples of numbers with the same scale, but different 
precision:

- 12345 / 10 = 0.12345 // scale = 5, precision = 5
- 12340 / 10 = 0.1234 // scale = 5, precision = 4
- 1 / 10 = 0.1 // scale = 5, precision = 1

This PR fix a bug in floor and ceil in `org.apache.spark.sql.types.Decimal` 
that will throw a `Decimal scale (0) cannot be greater than precision (-2)` 
exception when `precision < scale`.

Before the fix, the following SQL will throw exception:

```
select 1 > 0.0001 from tb
select floor(0.0001) from tb
select ceil(0.0001) from tb
```

## How was this patch tested?

Added unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stanzhai/spark fix_decimal_precision

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17529.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17529


commit 2b094b6e8fb1b0b8ae8bc89782305ac44d172ec3
Author: stanzhai <stanz...@outlook.com>
Date:   2017-04-04T12:48:53Z

fix decimal floor/ceil precision bug

commit 2d60230b8344b391c3edfeec7c19ad1717e93710
Author: stanzhai <stanz...@outlook.com>
Date:   2017-04-04T14:28:03Z

add test case

commit 61058b6e69802312bda35cdaf04a5b2af7dcd827
Author: stanzhai <stanz...@outlook.com>
Date:   2017-04-04T15:02:54Z

update test case




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17131: [SPARK-19766][SQL][BRANCH-2.0] Constant alias col...

2017-03-02 Thread stanzhai
Github user stanzhai closed the pull request at:

https://github.com/apache/spark/pull/17131


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17131: [SPARK-19766][SQL][BRANCH-2.0] Constant alias col...

2017-03-01 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/17131

[SPARK-19766][SQL][BRANCH-2.0] Constant alias columns in INNER JOIN should 
not be folded by FoldablePropagation rule

This PR fix for branch-2.0

Refer #17099 

@gatorsmile 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stanzhai/spark fix-inner-join-2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17131.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17131


commit 4975ac7f3a6a714c80e5f875ab54dd60f4aa22a5
Author: Stan Zhai <zhaishi...@haizhi.com>
Date:   2017-03-02T05:56:07Z

fix innner join




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17099: [SPARK-19766][SQL] Constant alias columns in INNE...

2017-03-01 Thread stanzhai
Github user stanzhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17099#discussion_r103848391
  
--- Diff: sql/core/src/test/resources/sql-tests/results/inner-join.sql.out 
---
@@ -0,0 +1,68 @@
+-- Automatically generated by SQLQueryTestSuite
+-- Number of queries: 13
--- End diff --

Thanks!
I will pay attention to this next time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17099: [SPARK-19766][SQL] Constant alias columns in INNER JOIN ...

2017-03-01 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/17099
  
ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17099: [SPARK-19766][SQL] Constant alias columns in INNER JOIN ...

2017-02-28 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/17099
  
Thanks for @gatorsmile 's help.

`ConstantFolding` will affect other test cases in 
`FoldablePropagationSuite`.

It's fine without adding `ConstantFolding`.

Before fix:
```
[info]   !'Join Inner, ((a#0 = a#0) && (1 = 1))'Join Inner, (('tb.a 
= 'ta.a) && ('tb.tag = 'ta.tag))
[info]   !:- Union :- 'SubqueryAlias ta
[info]   !:  :- Project [a#0, 1 AS tag#0]  :  +- 'Union
[info]   !:  :  +- LocalRelation , [a#0, b#0]   : :- 'Project 
['a, 1 AS tag#0]
[info]   !:  +- Project [a#0, 2 AS tag#0]  : :  +- 
LocalRelation , [a#0, b#0]
[info]   !: +- LocalRelation , [a#0, b#0]   : +- 'Project 
['a, 2 AS tag#0]
[info]   !+- Union :+- 
LocalRelation , [a#0, b#0]
[info]   !   :- Project [a#0, 1 AS tag#0]  +- 'SubqueryAlias tb
[info]   !   :  +- LocalRelation , [a#0, b#0]  +- 'Union
[info]   !   +- Project [a#0, 2 AS tag#0]:- 'Project 
['a, 1 AS tag#0]
[info]   !  +- LocalRelation , [a#0, b#0] :  +- 
LocalRelation , [a#0, b#0]
[info]   !   +- 'Project 
['a, 2 AS tag#0]
[info]   !  +- 
LocalRelation , [a#0, b#0] (PlanTest.scala:99)
```

After fix:
```
[info]   !'Join Inner, ((a#0 = a#0) && (tag#0 = tag#0))   'Join Inner, 
(('tb.a = 'ta.a) && ('tb.tag = 'ta.tag))
[info]   !:- Union:- 'SubqueryAlias 
ta
[info]   !:  :- Project [a#0, 1 AS tag#0] :  +- 'Union
[info]   !:  :  +- LocalRelation , [a#0, b#0]  : :- 'Project 
['a, 1 AS tag#0]
[info]   !:  +- Project [a#0, 2 AS tag#0] : :  +- 
LocalRelation , [a#0, b#0]
[info]   !: +- LocalRelation , [a#0, b#0]  : +- 'Project 
['a, 2 AS tag#0]
[info]   !+- Union:+- 
LocalRelation , [a#0, b#0]
[info]   !   :- Project [a#0, 1 AS tag#0] +- 'SubqueryAlias 
tb
[info]   !   :  +- LocalRelation , [a#0, b#0] +- 'Union
[info]   !   +- Project [a#0, 2 AS tag#0]   :- 'Project 
['a, 1 AS tag#0]
[info]   !  +- LocalRelation , [a#0, b#0]:  +- 
LocalRelation , [a#0, b#0]
[info]   !  +- 'Project 
['a, 2 AS tag#0]
[info]   ! +- 
LocalRelation , [a#0, b#0] (PlanTest.scala:99)
```

I just fix the test case(`"tb.tag" -> "tb.tag".attr`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17099: [SPARK-19766][SQL] Constant alias columns in INNER JOIN ...

2017-02-28 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/17099
  
@hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17099: Constant alias columns in INNER JOIN should not b...

2017-02-28 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/17099

Constant alias columns in INNER JOIN should not be folded by 
FoldablePropagation rule

## What changes were proposed in this pull request?
This PR fixes the code in Optimizer phase where the constant alias columns 
of a `INNER JOIN` query are folded in Rule `FoldablePropagation`.

For the following query():

```
val sqlA =
  """
|create temporary view ta as
|select a, 'a' as tag from t1 union all
|select a, 'b' as tag from t2
  """.stripMargin

val sqlB =
  """
|create temporary view tb as
|select a, 'a' as tag from t3 union all
|select a, 'b' as tag from t4
  """.stripMargin

val sql =
  """
|select tb.* from ta inner join tb on
|ta.a = tb.a and
|ta.tag = tb.tag
  """.stripMargin
```

The tag column is an constant alias column, it's folded by 
`FoldablePropagation` like this:

```
TRACE SparkOptimizer: 
=== Applying Rule 
org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
 Project [a#4, tag#14]  Project [a#4, tag#14]
!+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14))   +- Join Inner, ((a#0 = 
a#4) && (a = a))
:- Union   :- Union
:  :- Project [a#0, a AS tag#8]:  :- Project [a#0, 
a AS tag#8]
:  :  +- LocalRelation [a#0]   :  :  +- 
LocalRelation [a#0]
:  +- Project [a#2, b AS tag#9]:  +- Project [a#2, 
b AS tag#9]
: +- LocalRelation [a#2]   : +- 
LocalRelation [a#2]
+- Union   +- Union
   :- Project [a#4, a AS tag#14]  :- Project [a#4, 
a AS tag#14]
   :  +- LocalRelation [a#4]  :  +- 
LocalRelation [a#4]
   +- Project [a#6, b AS tag#15]  +- Project [a#6, 
b AS tag#15]
  +- LocalRelation [a#6] +- 
LocalRelation [a#6]
```

Finally the Result of Batch Operator Optimizations is:

```
Project [a#4, tag#14]  Project [a#4, tag#14]
!+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14))   +- Join Inner, (a#0 = 
a#4)
!   :- SubqueryAlias ta, `ta`  :- Union
!   :  +- Union:  :- LocalRelation 
[a#0]
!   : :- Project [a#0, a AS tag#8] :  +- LocalRelation 
[a#2]
!   : :  +- SubqueryAlias t1, `t1` +- Union
!   : : +- Project [a#0]  :- LocalRelation 
[a#4, tag#14]
!   : :+- SubqueryAlias grouping  +- LocalRelation 
[a#6, tag#15]
!   : :   +- LocalRelation [a#0]
!   : +- Project [a#2, b AS tag#9]  
!   :+- SubqueryAlias t2, `t2`  
!   :   +- Project [a#2]
!   :  +- SubqueryAlias grouping
!   : +- LocalRelation [a#2]
!   +- SubqueryAlias tb, `tb`   
!  +- Union 
! :- Project [a#4, a AS tag#14] 
! :  +- SubqueryAlias t3, `t3`  
! : +- Project [a#4]
! :+- SubqueryAlias grouping
! :   +- LocalRelation [a#4]
! +- Project [a#6, b AS tag#15] 
!+- SubqueryAlias t4, `t4`  
!   +- Project [a#6]
!  +- SubqueryAlias grouping
! +- LocalRelation [a#6]
```

The condition `tag#8 = tag#14` of INNER JOIN has been removed. This leads 
to the data of inner join being wrong.

After fix:

```
=== Result of Batch LocalRelation ===
 GlobalLimit 21   GlobalLimit 21
 +- LocalLimit 21 +- LocalLimit 21
+- Project [a#4, tag#11] +- Project 
[a#4, tag#11]
   +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11)) +- Join 
Inner, ((a#0 = a#4) && (tag#8 = tag#11))
! :- SubqueryAlias ta  :- Union
! :  +- Union  :  :- 
LocalRelation [a#0, tag#8]
! : :- Project [a#0, a AS tag#8]   :  +- 
LocalRelation [a#2, tag#9]
! : :  +- SubqueryAlias t

[GitHub] spark pull request #16953: [SPARK-19622][WebUI]Fix a http error in a paged t...

2017-02-15 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/16953

[SPARK-19622][WebUI]Fix a http error in a paged table when using a `Go` 
button to search.

## What changes were proposed in this pull request?

The search function of paged table is not available because of we don't 
skip the hash data of the reqeust path. 


![](https://issues.apache.org/jira/secure/attachment/12852996/screenshot-1.png)

## How was this patch tested?

Tested manually with my browser.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stanzhai/spark fix-webui-paged-table

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16953.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16953


commit a4364dace3a8305f5ef7627ce68973bf7b7f7c6b
Author: Stan Zhai <zhaishi...@haizhi.com>
Date:   2017-02-16T06:17:54Z

fixed a pagination bug of paged table.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16874: [SPARK-19509][SQL]Fix a NPE problem in grouping s...

2017-02-09 Thread stanzhai
Github user stanzhai closed the pull request at:

https://github.com/apache/spark/pull/16874


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16874: [SPARK-19509][SQL][branch-2.1]Fix a NPE problem i...

2017-02-09 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/16874

[SPARK-19509][SQL][branch-2.1]Fix a NPE problem in grouping sets when using 
an empty column

## What changes were proposed in this pull request?

If a column of a table is all null values, the follow SQL will throw an 
NPE: `select count(1) from test group by e grouping sets(e)`.

The reason is that when transformUp a `GroupingSets` in 
`ResolveGroupingAnalytics` it uses a `nullBitmask` to set an attribute with 
null ability, the nullable attribute may be modified.

This pr just set all attribute's null ability to `true` in group by  
expressions to fix the problem.

The pr #15484 in master branch has fixed this problem. 

## How was this patch tested?

Test with Hive in my environment.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stanzhai/spark fix-grouping-sets

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16874.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16874


commit 3690cb29a3c15903dd6290502fb736daa99157b4
Author: Stan Zhai <zhaishi...@haizhi.com>
Date:   2017-02-09T13:22:02Z

fix a NPE issue of grouping sets




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16617: [SPARK-19261][SQL]Support `ALTER TABLE table_name ADD CO...

2017-01-17 Thread stanzhai
Github user stanzhai commented on the issue:

https://github.com/apache/spark/pull/16617
  
Good job!
I will review your PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16617: [SPARK-19261][SQL]Support `ALTER TABLE table_name...

2017-01-17 Thread stanzhai
Github user stanzhai closed the pull request at:

https://github.com/apache/spark/pull/16617


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16617: [SPARK-19261][SQL]Support `ALTER TABLE table_name...

2017-01-17 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/16617

[SPARK-19261][SQL]Support `ALTER TABLE table_name ADD COLUMNS(..)` 
statement 

## What changes were proposed in this pull request?

We should support `ALTER TABLE table_name ADD COLUMNS(..)` statement, which 
already be supported in version < 2.x.
This is very useful for those who want to upgrade there Spark version to 
2.x.

## How was this patch tested?

Add some test cases in `DDLCommandSuite`, and test with Hive in my 
environment.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stanzhai/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16617.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16617


commit 69729ef083f152eb91a80e1ea7f1481234766c7c
Author: zhaishidan <zhaishi...@haizhi.com>
Date:   2015-07-14T07:47:36Z

fix document error about spark.kryoserializer.buffer.max.mb

commit f7f5c77194492c41eaa63efc1516e1cb73603c3f
Author: zhaishidan <zhaishi...@haizhi.com>
Date:   2016-03-17T10:00:02Z

Merge branch 'master' of https://github.com/apache/spark

commit e0b6a807a6374553a81a8a07d37fdd643e9fcbc0
Author: StanZhai <stan@stanzhaidemac-mini.local>
Date:   2016-10-08T15:22:46Z

Merge branch 'master' of https://github.com/apache/spark

commit f50377b9e4d5c3ae1a7b232fffe96015319a32af
Author: Stan Zhai <zhaishi...@haizhi.com>
Date:   2017-01-16T07:45:34Z

Merge branch 'master' of https://github.com/apache/spark into stan-master

commit 2e1e53a2bd28decef6dbef3af16b10512b26a664
Author: Stan Zhai <zhaishi...@haizhi.com>
Date:   2017-01-16T08:43:46Z

support `alter table add columns`

commit e55350a1876e4b46584476795cfce6184248d66d
Author: Stan Zhai <zhaishi...@haizhi.com>
Date:   2017-01-17T10:16:22Z

Merge branch 'master' of https://github.com/apache/spark into stan-master

commit ba7373256a9deeefcb22a7facf006ec85403afb9
Author: Stan Zhai <zhaishi...@haizhi.com>
Date:   2017-01-17T12:09:42Z

update test

commit 3cafe2c0d54b0bb2a9d9ff8814e5183571deff26
Author: Stan Zhai <zhaishi...@haizhi.com>
Date:   2017-01-17T13:09:58Z

revert pom.xml




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9010][Documentation]Improve the Spark C...

2015-07-15 Thread stanzhai
Github user stanzhai closed the pull request at:

https://github.com/apache/spark/pull/7393


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9010][Documentation]Improve the Spark C...

2015-07-14 Thread stanzhai
Github user stanzhai closed the pull request at:

https://github.com/apache/spark/pull/7368


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9010][Documentation]Improve the Spark C...

2015-07-14 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/7393

[SPARK-9010][Documentation]Improve the Spark Configuration document about 
`spark.kryoserializer.buffer`

The meaning of spark.kryoserializer.buffer should be Initial size of 
Kryo's serialization buffer. Note that there will be one buffer per core on 
each worker. This buffer will grow up to spark.kryoserializer.buffer.max if 
needed..

The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stanzhai/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7393.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7393


commit 69729ef083f152eb91a80e1ea7f1481234766c7c
Author: zhaishidan zhaishi...@haizhi.com
Date:   2015-07-14T07:47:36Z

fix document error about spark.kryoserializer.buffer.max.mb




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9010][Documentation]Improve the Spark C...

2015-07-13 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/7368

[SPARK-9010][Documentation]Improve the Spark Configuration document about 
`spark.kryoserializer.buffer`

The meaning of spark.kryoserializer.buffer should be Initial size of 
Kryo's serialization buffer. Note that there will be one buffer per core on 
each worker. This buffer will grow up to spark.kryoserializer.buffer.max if 
needed..

The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stanzhai/spark branch-1.4

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7368.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7368


commit bfb80d44f90fe4d538907309dfa55d9ec8703ff5
Author: zhaishidan zhaishi...@haizhi.com
Date:   2015-07-13T07:53:11Z

fix document error about spark.kryoserializer.buffer.max.mb




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org