[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r157666054 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile --- @@ -0,0 +1,47 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +FROM openjdk:8-alpine --- End diff -- @foxish The libc6-compat did not fix our issue because libiomp5.so can not be loaded. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r155707713 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile --- @@ -0,0 +1,47 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +FROM openjdk:8-alpine --- End diff -- The base image **openjdk:8-alpine** uses **musl libc** instead of **glibc**, so certain library (like Intel MKL for native BLAS support in mllib) cannot run on the container. There is also report on openjdk compatibility issue on alpine (https://github.com/apache-spark-on-k8s/spark/issues/326). Should we change to a base image that supports glibc? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11929: [SPARK-13934][SQL] fixed table identifier
Github user yangw1234 closed the pull request at: https://github.com/apache/spark/pull/11929 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11929: [SPARK-13934][SQL] fixed table identifier
Github user yangw1234 commented on the issue: https://github.com/apache/spark/pull/11929 sure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16820: [SPARK-19471] AggregationIterator does not initialize th...
Github user yangw1234 commented on the issue: https://github.com/apache/spark/pull/16820 Sorry I could not find time to finish this pr recently. Close it for now. If you need this fix, please feel free to base on it and finish it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16820: [SPARK-19471] AggregationIterator does not initia...
Github user yangw1234 closed the pull request at: https://github.com/apache/spark/pull/16820 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16820: [SPARK-19471] AggregationIterator does not initialize th...
Github user yangw1234 commented on the issue: https://github.com/apache/spark/pull/16820 @gatorsmile Sorry, I totally forget this pr. I will try to address the comment this week (need a little time to re-familiarize the context). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16820: [SPARK-19471] AggregationIterator does not initialize th...
Github user yangw1234 commented on the issue: https://github.com/apache/spark/pull/16820 @hvanhovell thanks for your review. Whole stage code generation seems fine and unit test is added. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16820: [SPARK-19471] AggregationIterator does not initialize th...
Github user yangw1234 commented on the issue: https://github.com/apache/spark/pull/16820 @mengxr @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16820: [SPARK-19471] AggregationIterator does not initia...
GitHub user yangw1234 opened a pull request: https://github.com/apache/spark/pull/16820 [SPARK-19471] AggregationIterator does not initialize the generated result projection before using it ## What changes were proposed in this pull request? When AggregationIterator generates result projection, it does not call the initialize method of the Projection class. This will cause a runtime NullPointerException when the projection involves nondeterministic expressions. This problem was introduced by #15567. ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/yangw1234/spark proj Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16820.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16820 commit 7ec5ebf867110f1c0105faee36ee6776ad36aab1 Author: wangyang <wangy...@haizhi.com> Date: 2017-02-06T12:07:57Z SPARK-19471 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15416: [SPARK-17849] [SQL] Fix NPE problem when using grouping ...
Github user yangw1234 commented on the issue: https://github.com/apache/spark/pull/15416 @rxin @davies Will this patch be merged in 2.0.2? Kind of need this to upgrade our production environment. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15416: [SPARK-17849] [SQL] Fix NPE problem when using grouping ...
Github user yangw1234 commented on the issue: https://github.com/apache/spark/pull/15416 scala style fixed. I didn't notice. Sorry for the delay. @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/15416#discussion_r82726593 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -298,10 +298,14 @@ class Analyzer( case other => Alias(other, other.toString)() } -val nonNullBitmask = x.bitmasks.reduce(_ & _) +// The rightmost bit in the bitmasks corresponds to the last expression in groupByAliases with 0 +// indicating this expression is in the grouping set. The following line of code calculates the +// bitmask representing the expressions that exist in all the grouping sets (also indicated by 0). +val nonNullBitmask = x.bitmasks.reduce(_ | _) --- End diff -- done @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15416: [SPARK-17849] [SQL] Fix NPE problem when using grouping ...
Github user yangw1234 commented on the issue: https://github.com/apache/spark/pull/15416 @davies Other places all seem to be correct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/15416#discussion_r82721927 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -298,10 +298,14 @@ class Analyzer( case other => Alias(other, other.toString)() } -val nonNullBitmask = x.bitmasks.reduce(_ & _) +// The left most bit in the bitmasks corresponds to the last expression in groupByAliases +// with 0 indicating this expression is in the grouping set. The following line of code +// calculates the bit mask representing the expressions that exist in all the grouping sets. +val nonNullBitmask = ~ x.bitmasks.reduce(_ | _) --- End diff -- Do you mean `((nonNullBitmask >> (attrLength - idx - 1)) & 1) == 1`? We can only test on `0` if we left shift `1`, right? @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/15416#discussion_r82715694 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -298,10 +298,11 @@ class Analyzer( case other => Alias(other, other.toString)() } -val nonNullBitmask = x.bitmasks.reduce(_ & _) +val nonNullBitmask = ~ x.bitmasks.reduce(_ | _) --- End diff -- @hvanhovell @rxin comments are added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/15416#discussion_r82715416 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -298,10 +298,11 @@ class Analyzer( case other => Alias(other, other.toString)() } -val nonNullBitmask = x.bitmasks.reduce(_ & _) +val nonNullBitmask = ~ x.bitmasks.reduce(_ | _) --- End diff -- Ok, I'll do it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15416: [SPARK-17849] [SQL] Fix NPE problem when using gr...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/15416#discussion_r82562070 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2189,6 +2189,24 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { } } + test("SPARK-17849: grouping set throws NPE") { --- End diff -- @rxin done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15416: [SPARK-17849] Fix NPE problem when using grouping sets
Github user yangw1234 commented on the issue: https://github.com/apache/spark/pull/15416 also cc @hvanhovell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15416: [SPARK-17849] Fix NPE problem when using grouping sets
Github user yangw1234 commented on the issue: https://github.com/apache/spark/pull/15416 cc @davies Would you help reviewing this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15416: [SPARK-17849] Fix NPE problem when using grouping...
GitHub user yangw1234 opened a pull request: https://github.com/apache/spark/pull/15416 [SPARK-17849] Fix NPE problem when using grouping sets ## What changes were proposed in this pull request? Prior this pr, the following code would cause an NPE: `case class point(a:String, b:String, c:String, d: Int)` `val data = Seq( point("1","2","3", 1), point("4","5","6", 1), point("7","8","9", 1) )` `sc.parallelize(data).toDF().registerTempTable("table")` `spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()` The reason is that when the grouping_id() behavior was changed in #10677, some code (which should be changed) was left out. Take the above code for example, prior #10677, the bit mask for set "(a)" was `001`, while after #10677 the bit mask was changed to `011`. However, the `nonNullBitmask` was not changed accordingly. This pr will fix this problem. ## How was this patch tested? add integration tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/yangw1234/spark groupingid Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15416.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15416 commit a6d4651d63c5d31dcd7b1ca6d48be9707316e33b Author: wangyang <wangy...@haizhi.com> Date: 2016-10-10T04:19:31Z fix grouping id bug commit ab82b211afc9c9d2030c25b4f953db6f4b1bb62e Author: wangyang <wangy...@haizhi.com> Date: 2016-10-10T07:03:35Z add test commit c849b6a02ee65e3a818abc03f301d82b36ec7f43 Author: wangyang <wangy...@haizhi.com> Date: 2016-10-10T07:31:51Z add test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15096: [SPARK-17537] [SQL] Reading parquet schema from d...
Github user yangw1234 closed the pull request at: https://github.com/apache/spark/pull/15096 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15096: [SPARK-17537] [SQL] Reading parquet schema from driver d...
Github user yangw1234 commented on the issue: https://github.com/apache/spark/pull/15096 Yes, this is a duplicate, closing it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15096: [SPARK-17537] [SQL] Reading parquet schema from d...
GitHub user yangw1234 opened a pull request: https://github.com/apache/spark/pull/15096 [SPARK-17537] [SQL] Reading parquet schema from driver directly when there is only one file to touch ## What changes were proposed in this pull request? `spark.read.parquet("parquet/dir")` would issue a spark job to read parquet schema. When `spark.sql.parquet.mergeSchema` are set to false (the default value), there is often only one file to read, so there is no need to issue a spark job to do it. In this case, we can read it from driver directly instead of issuing a spark job. This could reduce the infer schema latency from several hundreds milliseconds to around ten milliseconds in my environment. ## How was this patch tested? manually tested You can merge this pull request into a Git repository by running: $ git pull https://github.com/yangw1234/spark mergeSchema_2.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15096.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15096 commit 91081bccf4326e8261dd41035c199fbd8f12c69d Author: wangyang <wangy...@haizhi.com> Date: 2016-09-14T13:12:24Z opt merge schema commit 9b610c22e8fedbf61b98943ca9a8b3ba08320c94 Author: wangyang <wangy...@haizhi.com> Date: 2016-09-14T13:20:49Z fix typo --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11929: [SPARK-13934][SQL] fixed table identifier
Github user yangw1234 commented on the issue: https://github.com/apache/spark/pull/11929 Hi @hvanhovell , just checked. In branch-1.6 latest code, yes this problem still exists. Branch master and branch-2.0 don't have this problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13585: [SPARK-15859][SQL] Optimize the partition pruning...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/13585#discussion_r66743506 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -65,15 +65,20 @@ private[hive] trait HiveStrategies { // hive table scan operator to be used for partition pruning. val partitionKeyIds = AttributeSet(relation.partitionKeys) val (pruningPredicates, otherPredicates) = predicates.partition { predicate => - !predicate.references.isEmpty && + predicate.references.nonEmpty && predicate.references.subsetOf(partitionKeyIds) } +val additionalPartPredicates = + PhysicalOperation.partitionPrunningFromDisjunction( +otherPredicates.foldLeft[Expression](Literal(true))(And(_, _)), partitionKeyIds) pruneFilterProject( projectList, otherPredicates, identity[Seq[Expression]], - HiveTableScanExec(_, relation, pruningPredicates)(sparkSession)) :: Nil +HiveTableScanExec(_, +relation, +pruningPredicates ++ additionalPartPredicates)(sparkSession)) :: Nil --- End diff -- glad I could help ^_^ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13585: [SPARK-15859][SQL] Optimize the partition pruning...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/13585#discussion_r66742485 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala --- @@ -92,6 +92,36 @@ object PhysicalOperation extends PredicateHelper { .map(Alias(_, a.name)(a.exprId, a.qualifier, isGenerated = a.isGenerated)).getOrElse(a) } } + + /** + * Drop the non-partition key expression in the disjunctions, to optimize the partition pruning. + * For instances: (We assume part1 & part2 are the partition keys) + * (part1 == 1 and a > 3) or (part2 == 2 and a < 5) ==> (part1 == 1 or part1 == 2) + * (part1 == 1 and a > 3) or (a < 100) => None + * (a > 100 && b < 100) or (part1 = 10) => None + * (a > 100 && b < 100 and part1 = 10) or (part1 == 2) => (part1 = 10 or part1 == 2) + * @param predicate disjunctions + * @param partitionKeyIds partition keys in attribute set + * @return + */ + def partitionPrunningFromDisjunction( +predicate: Expression, partitionKeyIds: AttributeSet): Option[Expression] = { +// ignore the pure non-partition key expression in conjunction of the expression tree +val additionalPartPredicate = predicate transformUp { + case a @ And(left, right) if a.deterministic && +left.references.intersect(partitionKeyIds).isEmpty => right + case a @ And(left, right) if a.deterministic && +right.references.intersect(partitionKeyIds).isEmpty => left --- End diff -- The problem is here. Imagine a record `a = 2` in `partition = 1`. Such a record satisfies the above expression (`!(partition = 1 && a > 3)`), but if we simply drop `a > 3` and push `!(partition = 1)` down to the table scan, partition =1 will be discarded and the record won't appear in the result. The test case passed because the `BooleanSimplification` optimizer rule will transform `!(partition =1 && a > 3)` to `(!(partition=1) || (a <= 3))`, such an expression will be dropped entirely by your `partitionPrunningFromDisjunction`, in which case "partition = 1" will not be discarded. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13585: [SPARK-15859][SQL] Optimize the partition pruning...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/13585#discussion_r66714468 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala --- @@ -92,6 +92,36 @@ object PhysicalOperation extends PredicateHelper { .map(Alias(_, a.name)(a.exprId, a.qualifier, isGenerated = a.isGenerated)).getOrElse(a) } } + + /** + * Drop the non-partition key expression in the disjunctions, to optimize the partition pruning. --- End diff -- It is `(part=1 conjunction a=1) disjunction (part=2 conjunction a=4)`, right? But the expression get dropped is `a=1` which is in "conjunction" with `part=1` and `a=4` which is in "conjunction" with `part=2`. So I thought it should be conjunctions. Or maybe we can phrase it in another way to avoid the confusion? ^_^ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13585: [SPARK-15859][SQL] Optimize the partition pruning within...
Github user yangw1234 commented on the issue: https://github.com/apache/spark/pull/13585 Hi @liancheng , CNF is truly a more systematic way to deal with this problem. Not really sure I am right or not, but I think as long as we push the `not` operator down to the lowest level of the expression tree, the approach proposed by @chenghao-intel will work. Take the above example, expression `!(partition = 1 && a > 3)` will be transformed to `(!(partition = 1)) || (! (a > 3))`, and according to the second example given by @chenghao-intel in the doc, the expression should be dropped entirely, so `partition = 1` will not be pruned. (But this rule is not appeared in the code, maybe he is working in progress to implemented this rule. I don't know for sure.) What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13601: [SPARK-15875] Try to use Seq.isEmpty and Seq.nonE...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/13601#discussion_r66630258 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1656,7 +1656,7 @@ class Analyzer( // We do a final check and see if we only have a single Window Spec defined in an // expressions. -if (distinctWindowSpec.length == 0 ) { +if (distinctWindowSpec.isEmpty ) { --- End diff -- @srowen fixed it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13601: [SPARK-15875] Try to use Seq.isEmpty and Seq.nonE...
GitHub user yangw1234 opened a pull request: https://github.com/apache/spark/pull/13601 [SPARK-15875] Try to use Seq.isEmpty and Seq.nonEmpty instead of Seq.length == 0 and Seq.length > 0 ## What changes were proposed in this pull request? In scala, immutable.List.length is an expensive operation so we should avoid using Seq.length == 0 and Seq.lenth > 0 and use Seq.isEmpty and Seq.nonEmpty instead. ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/yangw1234/spark isEmpty Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13601.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13601 commit 3fa193942fabd03e2ac35f9fb27dc862931a5690 Author: wangyang <wangy...@haizhi.com> Date: 2016-06-10T14:25:22Z avoid using seq.length --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org