[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...

2017-12-18 Thread yangw1234
Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r157666054
  
--- Diff: 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile 
---
@@ -0,0 +1,47 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+FROM openjdk:8-alpine
--- End diff --

@foxish The libc6-compat did not fix our issue because libiomp5.so can not 
be loaded.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...

2017-12-07 Thread yangw1234
Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r155707713
  
--- Diff: 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile 
---
@@ -0,0 +1,47 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+FROM openjdk:8-alpine
--- End diff --

The base image **openjdk:8-alpine** uses **musl libc** instead of 
**glibc**, so certain library (like Intel MKL for native BLAS support in mllib) 
cannot run on the container.  
There is also report on openjdk compatibility issue on alpine 
(https://github.com/apache-spark-on-k8s/spark/issues/326). 
Should we change to a base image that supports glibc?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11929: [SPARK-13934][SQL] fixed table identifier

2017-06-13 Thread yangw1234
Github user yangw1234 closed the pull request at:

https://github.com/apache/spark/pull/11929


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11929: [SPARK-13934][SQL] fixed table identifier

2017-06-13 Thread yangw1234
Github user yangw1234 commented on the issue:

https://github.com/apache/spark/pull/11929
  
sure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16820: [SPARK-19471] AggregationIterator does not initialize th...

2017-06-05 Thread yangw1234
Github user yangw1234 commented on the issue:

https://github.com/apache/spark/pull/16820
  
Sorry I could not find time to finish this pr recently. Close it for now. 
If you need this fix, please feel free to base on it and finish it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16820: [SPARK-19471] AggregationIterator does not initia...

2017-06-05 Thread yangw1234
Github user yangw1234 closed the pull request at:

https://github.com/apache/spark/pull/16820


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16820: [SPARK-19471] AggregationIterator does not initialize th...

2017-05-23 Thread yangw1234
Github user yangw1234 commented on the issue:

https://github.com/apache/spark/pull/16820
  
@gatorsmile Sorry, I totally forget this pr. I will try to address the 
comment this week (need a little time to re-familiarize the context). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16820: [SPARK-19471] AggregationIterator does not initialize th...

2017-02-06 Thread yangw1234
Github user yangw1234 commented on the issue:

https://github.com/apache/spark/pull/16820
  
@hvanhovell thanks for your review. Whole stage code generation seems fine 
and unit test is added. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16820: [SPARK-19471] AggregationIterator does not initialize th...

2017-02-06 Thread yangw1234
Github user yangw1234 commented on the issue:

https://github.com/apache/spark/pull/16820
  
@mengxr  @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16820: [SPARK-19471] AggregationIterator does not initia...

2017-02-06 Thread yangw1234
GitHub user yangw1234 opened a pull request:

https://github.com/apache/spark/pull/16820

[SPARK-19471] AggregationIterator does not initialize the generated result 
projection before using it

## What changes were proposed in this pull request?

When AggregationIterator generates result projection, it does not call the 
initialize method of the Projection class. This will cause a runtime 
NullPointerException when the projection involves nondeterministic expressions.

This problem was introduced by #15567.

## How was this patch tested?

manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yangw1234/spark proj

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16820.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16820


commit 7ec5ebf867110f1c0105faee36ee6776ad36aab1
Author: wangyang <wangy...@haizhi.com>
Date:   2017-02-06T12:07:57Z

SPARK-19471




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15416: [SPARK-17849] [SQL] Fix NPE problem when using grouping ...

2016-11-05 Thread yangw1234
Github user yangw1234 commented on the issue:

https://github.com/apache/spark/pull/15416
  
@rxin @davies Will this patch be merged in 2.0.2? Kind of need this to 
upgrade our production environment. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15416: [SPARK-17849] [SQL] Fix NPE problem when using grouping ...

2016-10-14 Thread yangw1234
Github user yangw1234 commented on the issue:

https://github.com/apache/spark/pull/15416
  
scala style fixed.
 I didn't notice. Sorry for the delay. @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...

2016-10-10 Thread yangw1234
Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15416#discussion_r82726593
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -298,10 +298,14 @@ class Analyzer(
   case other => Alias(other, other.toString)()
 }
 
-val nonNullBitmask = x.bitmasks.reduce(_ & _)
+// The rightmost bit in the bitmasks corresponds to the last 
expression in groupByAliases with 0
+// indicating this expression is in the grouping set. The 
following line of code calculates the
+// bitmask representing the expressions that exist in all the 
grouping sets (also indicated by 0).
+val nonNullBitmask = x.bitmasks.reduce(_ | _)
--- End diff --

done @davies 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15416: [SPARK-17849] [SQL] Fix NPE problem when using grouping ...

2016-10-10 Thread yangw1234
Github user yangw1234 commented on the issue:

https://github.com/apache/spark/pull/15416
  
@davies Other places all seem to be correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...

2016-10-10 Thread yangw1234
Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15416#discussion_r82721927
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -298,10 +298,14 @@ class Analyzer(
   case other => Alias(other, other.toString)()
 }
 
-val nonNullBitmask = x.bitmasks.reduce(_ & _)
+// The left most bit in the bitmasks corresponds to the last 
expression in groupByAliases
+// with 0 indicating this expression is in the grouping set. The 
following line of code
+// calculates the bit mask representing the expressions that exist 
in all the grouping sets.
+val nonNullBitmask = ~ x.bitmasks.reduce(_ | _)
--- End diff --

Do you mean `((nonNullBitmask >> (attrLength - idx - 1)) & 1) == 1`? We can 
only test on `0` if we left shift `1`, right? @davies 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...

2016-10-10 Thread yangw1234
Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15416#discussion_r82715694
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -298,10 +298,11 @@ class Analyzer(
   case other => Alias(other, other.toString)()
 }
 
-val nonNullBitmask = x.bitmasks.reduce(_ & _)
+val nonNullBitmask = ~ x.bitmasks.reduce(_ | _)
--- End diff --

@hvanhovell @rxin comments are added


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...

2016-10-10 Thread yangw1234
Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15416#discussion_r82715416
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -298,10 +298,11 @@ class Analyzer(
   case other => Alias(other, other.toString)()
 }
 
-val nonNullBitmask = x.bitmasks.reduce(_ & _)
+val nonNullBitmask = ~ x.bitmasks.reduce(_ | _)
--- End diff --

Ok, I'll do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...

2016-10-10 Thread yangw1234
Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15416#discussion_r82562070
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2189,6 +2189,24 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
+  test("SPARK-17849: grouping set throws NPE") {
--- End diff --

@rxin done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15416: [SPARK-17849] Fix NPE problem when using grouping sets

2016-10-10 Thread yangw1234
Github user yangw1234 commented on the issue:

https://github.com/apache/spark/pull/15416
  
also cc @hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15416: [SPARK-17849] Fix NPE problem when using grouping sets

2016-10-10 Thread yangw1234
Github user yangw1234 commented on the issue:

https://github.com/apache/spark/pull/15416
  
cc @davies Would you help reviewing this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15416: [SPARK-17849] Fix NPE problem when using grouping...

2016-10-10 Thread yangw1234
GitHub user yangw1234 opened a pull request:

https://github.com/apache/spark/pull/15416

[SPARK-17849] Fix NPE problem when using grouping sets

## What changes were proposed in this pull request?
Prior this pr, the following code would cause an NPE:
`case class point(a:String, b:String, c:String, d: Int)`

`val data = Seq(
point("1","2","3", 1),
point("4","5","6", 1),
point("7","8","9", 1)
)`
`sc.parallelize(data).toDF().registerTempTable("table")`
`spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING 
SETS ((a)) ").show()`

The reason is that when the grouping_id() behavior was changed in #10677, 
some code (which should be changed) was left out. 

Take the above code for example, prior #10677, the bit mask for set "(a)" 
was `001`, while after #10677 the bit mask was changed to `011`. However, the 
`nonNullBitmask` was not changed accordingly.

This pr will fix this problem.


## How was this patch tested?
add integration tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yangw1234/spark groupingid

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15416.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15416


commit a6d4651d63c5d31dcd7b1ca6d48be9707316e33b
Author: wangyang <wangy...@haizhi.com>
Date:   2016-10-10T04:19:31Z

fix grouping id bug

commit ab82b211afc9c9d2030c25b4f953db6f4b1bb62e
Author: wangyang <wangy...@haizhi.com>
Date:   2016-10-10T07:03:35Z

add test

commit c849b6a02ee65e3a818abc03f301d82b36ec7f43
Author: wangyang <wangy...@haizhi.com>
Date:   2016-10-10T07:31:51Z

add test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15096: [SPARK-17537] [SQL] Reading parquet schema from d...

2016-09-15 Thread yangw1234
Github user yangw1234 closed the pull request at:

https://github.com/apache/spark/pull/15096


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15096: [SPARK-17537] [SQL] Reading parquet schema from driver d...

2016-09-15 Thread yangw1234
Github user yangw1234 commented on the issue:

https://github.com/apache/spark/pull/15096
  
Yes, this is a duplicate, closing it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15096: [SPARK-17537] [SQL] Reading parquet schema from d...

2016-09-14 Thread yangw1234
GitHub user yangw1234 opened a pull request:

https://github.com/apache/spark/pull/15096

[SPARK-17537] [SQL] Reading parquet schema from driver directly when there 
is only one file to touch

## What changes were proposed in this pull request?

`spark.read.parquet("parquet/dir")` would issue a spark job to read parquet 
schema. When `spark.sql.parquet.mergeSchema` are set to false (the default 
value), there is often only one file to read, so there is no need to issue a 
spark job to do it. In this case, we can read it from driver directly instead 
of issuing a spark job. This could reduce the infer schema latency from several 
hundreds milliseconds to around ten milliseconds in my environment.

## How was this patch tested?
manually tested



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yangw1234/spark mergeSchema_2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15096.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15096


commit 91081bccf4326e8261dd41035c199fbd8f12c69d
Author: wangyang <wangy...@haizhi.com>
Date:   2016-09-14T13:12:24Z

opt merge schema

commit 9b610c22e8fedbf61b98943ca9a8b3ba08320c94
Author: wangyang <wangy...@haizhi.com>
Date:   2016-09-14T13:20:49Z

fix typo




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11929: [SPARK-13934][SQL] fixed table identifier

2016-06-15 Thread yangw1234
Github user yangw1234 commented on the issue:

https://github.com/apache/spark/pull/11929
  
Hi @hvanhovell , just checked. In branch-1.6 latest code, yes this problem 
still exists. Branch master and branch-2.0 don't have this problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13585: [SPARK-15859][SQL] Optimize the partition pruning...

2016-06-13 Thread yangw1234
Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13585#discussion_r66743506
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -65,15 +65,20 @@ private[hive] trait HiveStrategies {
 // hive table scan operator to be used for partition pruning.
 val partitionKeyIds = AttributeSet(relation.partitionKeys)
 val (pruningPredicates, otherPredicates) = predicates.partition { 
predicate =>
-  !predicate.references.isEmpty &&
+  predicate.references.nonEmpty &&
   predicate.references.subsetOf(partitionKeyIds)
 }
+val additionalPartPredicates =
+  PhysicalOperation.partitionPrunningFromDisjunction(
+otherPredicates.foldLeft[Expression](Literal(true))(And(_, 
_)), partitionKeyIds)
 
 pruneFilterProject(
   projectList,
   otherPredicates,
   identity[Seq[Expression]],
-  HiveTableScanExec(_, relation, pruningPredicates)(sparkSession)) 
:: Nil
+HiveTableScanExec(_,
+relation,
+pruningPredicates ++ additionalPartPredicates)(sparkSession)) 
:: Nil
--- End diff --

glad I could help ^_^


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13585: [SPARK-15859][SQL] Optimize the partition pruning...

2016-06-12 Thread yangw1234
Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13585#discussion_r66742485
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -92,6 +92,36 @@ object PhysicalOperation extends PredicateHelper {
   .map(Alias(_, a.name)(a.exprId, a.qualifier, isGenerated = 
a.isGenerated)).getOrElse(a)
 }
   }
+
+  /**
+   * Drop the non-partition key expression in the disjunctions, to 
optimize the partition pruning.
+   * For instances: (We assume part1 & part2 are the partition keys)
+   * (part1 == 1 and a > 3) or (part2 == 2 and a < 5)  ==> (part1 == 1 or 
part1 == 2)
+   * (part1 == 1 and a > 3) or (a < 100) => None
+   * (a > 100 && b < 100) or (part1 = 10) => None
+   * (a > 100 && b < 100 and part1 = 10) or (part1 == 2) => (part1 = 10 or 
part1 == 2)
+   * @param predicate disjunctions
+   * @param partitionKeyIds partition keys in attribute set
+   * @return
+   */
+  def partitionPrunningFromDisjunction(
+predicate: Expression, partitionKeyIds: AttributeSet): 
Option[Expression] = {
+// ignore the pure non-partition key expression in conjunction of the 
expression tree
+val additionalPartPredicate = predicate transformUp {
+  case a @ And(left, right) if a.deterministic &&
+left.references.intersect(partitionKeyIds).isEmpty => right
+  case a @ And(left, right) if a.deterministic &&
+right.references.intersect(partitionKeyIds).isEmpty => left
--- End diff --

The problem is here. Imagine a record `a = 2` in `partition = 1`. Such a 
record satisfies the above expression (`!(partition = 1 && a > 3)`), but if we 
simply drop `a > 3` and push `!(partition = 1)` down to the table scan, 
partition =1 will be discarded and the record won't appear in the result. 

The test case passed because the `BooleanSimplification` optimizer rule 
will  transform `!(partition =1 && a > 3)` to `(!(partition=1) || (a <= 3))`, 
such an expression will be dropped entirely by your 
`partitionPrunningFromDisjunction`, in which case "partition = 1" will not be 
discarded.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13585: [SPARK-15859][SQL] Optimize the partition pruning...

2016-06-11 Thread yangw1234
Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13585#discussion_r66714468
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -92,6 +92,36 @@ object PhysicalOperation extends PredicateHelper {
   .map(Alias(_, a.name)(a.exprId, a.qualifier, isGenerated = 
a.isGenerated)).getOrElse(a)
 }
   }
+
+  /**
+   * Drop the non-partition key expression in the disjunctions, to 
optimize the partition pruning.
--- End diff --

It is `(part=1 conjunction a=1) disjunction (part=2 conjunction a=4)`, 
right? But the expression get dropped is `a=1` which is in "conjunction" with 
`part=1` and `a=4` which is in "conjunction" with `part=2`. So I thought it 
should be conjunctions.

Or maybe we can phrase it in another way to avoid the confusion? ^_^




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13585: [SPARK-15859][SQL] Optimize the partition pruning within...

2016-06-11 Thread yangw1234
Github user yangw1234 commented on the issue:

https://github.com/apache/spark/pull/13585
  
Hi @liancheng , CNF is truly a more systematic way to deal with this 
problem. 

Not really sure I am right or not, but I think as long as we push the `not` 
operator down to the lowest level of the expression tree, the approach proposed 
by @chenghao-intel will work. Take the above example, expression `!(partition = 
1 && a > 3)` will be transformed to `(!(partition = 1)) || (! (a > 3))`, and 
according to the second example given by @chenghao-intel in the doc, the 
expression should be dropped entirely, so `partition = 1` will not be pruned. 
(But this rule is not appeared in the code, maybe he is working in progress to 
implemented this rule. I don't know for sure.) 

What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13601: [SPARK-15875] Try to use Seq.isEmpty and Seq.nonE...

2016-06-10 Thread yangw1234
Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13601#discussion_r66630258
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1656,7 +1656,7 @@ class Analyzer(
 
 // We do a final check and see if we only have a single Window 
Spec defined in an
 // expressions.
-if (distinctWindowSpec.length == 0 ) {
+if (distinctWindowSpec.isEmpty ) {
--- End diff --

@srowen fixed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13601: [SPARK-15875] Try to use Seq.isEmpty and Seq.nonE...

2016-06-10 Thread yangw1234
GitHub user yangw1234 opened a pull request:

https://github.com/apache/spark/pull/13601

[SPARK-15875] Try to use Seq.isEmpty and Seq.nonEmpty instead of Seq.length 
== 0 and Seq.length > 0

## What changes were proposed in this pull request?

In scala, immutable.List.length is an expensive operation so we should
avoid using Seq.length == 0 and Seq.lenth > 0 and use Seq.isEmpty and 
Seq.nonEmpty instead.

## How was this patch tested?
existing tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yangw1234/spark isEmpty

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13601.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13601


commit 3fa193942fabd03e2ac35f9fb27dc862931a5690
Author: wangyang <wangy...@haizhi.com>
Date:   2016-06-10T14:25:22Z

avoid using seq.length




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org