[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-03 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-54262405
  
@marmbrus Addressed all the comments, thanks for the detailed review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-54262762
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19650/consoleFull)
 for   PR 2188 at commit 
[`d2a1d66`](https://github.com/apache/spark/commit/d2a1d6621a4b28bc9dc351204963659abab89d75).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-54269670
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19650/consoleFull)
 for   PR 2188 at commit 
[`d2a1d66`](https://github.com/apache/spark/commit/d2a1d6621a4b28bc9dc351204963659abab89d75).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-54362420
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19684/consoleFull)
 for   PR 2188 at commit 
[`4254f6c`](https://github.com/apache/spark/commit/4254f6c2e05e523c051442edcd183e2cefc32f7e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-54373129
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19684/consoleFull)
 for   PR 2188 at commit 
[`4254f6c`](https://github.com/apache/spark/commit/4254f6c2e05e523c051442edcd183e2cefc32f7e).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-03 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-54374907
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-03 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-54382609
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-03 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-54382647
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-54383012
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19705/consoleFull)
 for   PR 2188 at commit 
[`68cf019`](https://github.com/apache/spark/commit/68cf0197a9b3159ce93909d5d66c35e84507e802).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-54391806
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19705/consoleFull)
 for   PR 2188 at commit 
[`68cf019`](https://github.com/apache/spark/commit/68cf0197a9b3159ce93909d5d66c35e84507e802).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class SparkListenerBlockManagerAdded(time: Long, blockManagerId: 
BlockManagerId, maxMem: Long)`
  * `case class SparkListenerBlockManagerRemoved(time: Long, 
blockManagerId: BlockManagerId)`
  * `case class SparkListenerApplicationStart(appName: String, appId: 
Option[String], time: Long,`
  * `class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2188


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-02 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/2188#discussion_r17030789
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
 ---
@@ -96,13 +103,95 @@ private[sql] case class InMemoryRelation(
 
 private[sql] case class InMemoryColumnarTableScan(
 attributes: Seq[Attribute],
+predicates: Seq[Expression],
 relation: InMemoryRelation)
   extends LeafNode {
 
   override def output: Seq[Attribute] = attributes
 
+  import org.apache.spark.sql.catalyst.dsl.expressions._
+  import org.apache.spark.sql.catalyst.expressions._
+
+  val buildFilter: PartialFunction[Expression, Expression] = {
--- End diff --

Can we document the contract for these filters?  Returns false iff it is 
impossible for the input expression to evaluate to true based on statistics 
collected about this partition?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-02 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/2188#discussion_r17030853
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
 ---
@@ -110,8 +199,26 @@ private[sql] case class InMemoryColumnarTableScan(
 attributes.map(a = relation.output.indexWhere(_.exprId == 
a.exprId))
   }
 
-  iterator
-.map(batch = 
requestedColumns.map(batch(_)).map(ColumnAccessor(_)))
+  val rows = iterator
+// Skip pruned batches
+.filter { cachedBatch =
+  if (!partitionFilter(cachedBatch.stats)) {
--- End diff --

How risky do you think this is?  I'm wondering if we should add a flag that 
turns off the partition skipping.  We don't even have to document it...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-02 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/2188#discussion_r17030908
  
--- Diff: 
sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 ---
@@ -30,31 +30,30 @@ import org.apache.spark.sql.hive.test.TestHive
  */
 class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter 
{
   // TODO: bundle in jar files... get from classpath
-  private lazy val hiveQueryDir = TestHive.getHiveFile(
--- End diff --

Why did you change the visibility here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-02 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/2188#discussion_r17030988
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
 ---
@@ -96,13 +103,95 @@ private[sql] case class InMemoryRelation(
 
 private[sql] case class InMemoryColumnarTableScan(
 attributes: Seq[Attribute],
+predicates: Seq[Expression],
 relation: InMemoryRelation)
   extends LeafNode {
 
   override def output: Seq[Attribute] = attributes
 
+  import org.apache.spark.sql.catalyst.dsl.expressions._
+  import org.apache.spark.sql.catalyst.expressions._
--- End diff --

These probably don't belong here...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-02 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/2188#discussion_r17031043
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala
 ---
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+/**
+ * Builds a map that is keyed by an Attribute's expression id. Using the 
expression id allows values
+ * to be looked up even when the attributes used differ cosmetically 
(i.e., the capitalization
+ * of the name, or the expected nullability).
+ */
+object AttributeMap {
--- End diff --

Perhaps not in this PR, but we should search for all instances where we 
create Maps keyed by Attributes or expressionId and switch it to use this class 
instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-02 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-54250488
  
Thanks for taking this over!  A few minor comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-02 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2188#discussion_r17031879
  
--- Diff: 
sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 ---
@@ -30,31 +30,30 @@ import org.apache.spark.sql.hive.test.TestHive
  */
 class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter 
{
   // TODO: bundle in jar files... get from classpath
-  private lazy val hiveQueryDir = TestHive.getHiveFile(
--- End diff --

It's changed by mistake. Will make it `private` again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-02 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2188#discussion_r17032019
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
 ---
@@ -110,8 +199,26 @@ private[sql] case class InMemoryColumnarTableScan(
 attributes.map(a = relation.output.indexWhere(_.exprId == 
a.exprId))
   }
 
-  iterator
-.map(batch = 
requestedColumns.map(batch(_)).map(ColumnAccessor(_)))
+  val rows = iterator
+// Skip pruned batches
+.filter { cachedBatch =
+  if (!partitionFilter(cachedBatch.stats)) {
--- End diff --

Do you mean the risk of generating wrong query results? Hm... agree to add 
a flag. I did encounter several cases that I didn't covered at first when 
testing this branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-02 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/2188#discussion_r17032061
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
 ---
@@ -110,8 +199,26 @@ private[sql] case class InMemoryColumnarTableScan(
 attributes.map(a = relation.output.indexWhere(_.exprId == 
a.exprId))
   }
 
-  iterator
-.map(batch = 
requestedColumns.map(batch(_)).map(ColumnAccessor(_)))
+  val rows = iterator
+// Skip pruned batches
+.filter { cachedBatch =
+  if (!partitionFilter(cachedBatch.stats)) {
--- End diff --

Yeah exactly.  I think in general it might be a good idea to guard these 
kind of optimizations behind feature flags in case bugs are found in the 
future.  We can remove them after enough time has passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-02 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2188#discussion_r17032472
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/columnar/compression/BooleanBitSetSuite.scala
 ---
@@ -31,7 +31,7 @@ class BooleanBitSetSuite extends FunSuite {
 // Tests encoder
 // -
 
-val builder = TestCompressibleColumnBuilder(new BooleanColumnStats, 
BOOLEAN, BooleanBitSet)
+val builder = TestCompressibleColumnBuilder(new NoopColumnStats, 
BOOLEAN, BooleanBitSet)
--- End diff --

@marmbrus Would you mind to elaborate a bit on why you changed 
`BooleanColumnStats` to `NoopColumnStats` in #1883?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-09-02 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/2188#discussion_r17032500
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/columnar/compression/BooleanBitSetSuite.scala
 ---
@@ -31,7 +31,7 @@ class BooleanBitSetSuite extends FunSuite {
 // Tests encoder
 // -
 
-val builder = TestCompressibleColumnBuilder(new BooleanColumnStats, 
BOOLEAN, BooleanBitSet)
+val builder = TestCompressibleColumnBuilder(new NoopColumnStats, 
BOOLEAN, BooleanBitSet)
--- End diff --

laziness :) We should probably implement the statistics that make sense 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-08-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-53952977
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19517/consoleFull)
 for   PR 2188 at commit 
[`270ca61`](https://github.com/apache/spark/commit/270ca6122be1e4a32c40edc66298b9bc7ed81a6c).
 * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-08-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-53953073
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19518/consoleFull)
 for   PR 2188 at commit 
[`062c315`](https://github.com/apache/spark/commit/062c3157f0aaebe8bd6e92b15b0efa1ac1983041).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-08-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-53954582
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19517/consoleFull)
 for   PR 2188 at commit 
[`270ca61`](https://github.com/apache/spark/commit/270ca6122be1e4a32c40edc66298b9bc7ed81a6c).
 * This patch **passes** unit tests.
 * This patch **does not** merge cleanly!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-08-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-53954660
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19518/consoleFull)
 for   PR 2188 at commit 
[`062c315`](https://github.com/apache/spark/commit/062c3157f0aaebe8bd6e92b15b0efa1ac1983041).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-08-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-53851776
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19463/consoleFull)
 for   PR 2188 at commit 
[`b6f9f6c`](https://github.com/apache/spark/commit/b6f9f6c1e91bada438f1de1440de2719899f88e6).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-08-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-53858144
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19463/consoleFull)
 for   PR 2188 at commit 
[`b6f9f6c`](https://github.com/apache/spark/commit/b6f9f6c1e91bada438f1de1440de2719899f88e6).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-08-28 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/2188

[SPARK-2961][SQL] Use statistics to prune batches within cached partitions

This PR is based on #1883 authored by @marmbrus. Key differences:

1. Batch pruning instead of partition pruning

   When #1883 was authored, batched column buffer building (#1880) hadn't 
been introduced. This PR combines these two and provide partition batch level 
pruning, which leads to smaller memory footprints and can generally skip more 
elements. The cost is that the pruning predicates are evaluated more frequently 
(partition number multiplies batch number per partition).

1. More filters are supported

   Filter predicates consist of `=`, ``, `=`, ``, `=` and their 
conjunctions are supported. Disjunctions are not supported yet. In short, 

   ```sql
   SELECT * FROM t WHERE i  0 AND i  10
   ```

   is executed in a more smart way, while

   ```sql
   SELECT * FROM t WHERE i  0 OR i  10
   ```

   still scans the whole table.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark in-mem-batch-pruning

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2188.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2188


commit 1a31cd519b999848f0b76c94a9e258b17d977cc6
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-08-16T09:49:34Z

Merge branch 'inMemStats' into in-mem-batch-pruning

Tried to combine Michael's partition pruning branch and the batched
column buffer building. In this way, we actually got batch pruning
rather than partition pruning.

Conflicts:

sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala

Conflicts:

sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala

commit 520f41a0618a9529dfa86e2f748db405a4802f4f
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-08-18T02:44:38Z

Added more predicate filters, fixed table scan stats for testing purposes

commit 77a2203ceb18546ddee2373af29036971bebd572
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-08-28T23:00:44Z

Code cleanup, bugfix, and adding tests

* Bugfix: gatherStats() should be called in NullableColumnBuilder,
  otherwise null values are skipped
* Bugfix: fixed lower bound comparison in StringColumnStats and
  TimestampColumnStats

commit 46890319002be21061bcb1931ea9ccb43063fcf0
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-08-29T00:17:21Z

More test cases

commit 04a956d6ed415b0ecfa717f7067231026ba25414
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-08-29T00:18:02Z

Renamed PartitionSkippingSuite to PartitionBatchPruningSuite

commit ef057747fa3d9f85615c5650873abc374e21e16d
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-08-29T00:20:20Z

Minor code cleanup

commit b4a82810b3567a9a7ba085750f6f03103ed83ae7
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-08-29T00:22:32Z

Minor code cleanup




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-08-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-53825215
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19439/consoleFull)
 for   PR 2188 at commit 
[`b4a8281`](https://github.com/apache/spark/commit/b4a82810b3567a9a7ba085750f6f03103ed83ae7).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-08-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-53825264
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19439/consoleFull)
 for   PR 2188 at commit 
[`b4a8281`](https://github.com/apache/spark/commit/b4a82810b3567a9a7ba085750f6f03103ed83ae7).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-08-28 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-53825846
  
Scala style check failed, although the code style is actually OK...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-08-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-53826091
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19441/consoleFull)
 for   PR 2188 at commit 
[`9bd234b`](https://github.com/apache/spark/commit/9bd234bed6c2550218a997fb20684a0c750372c1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2961][SQL] Use statistics to prune batc...

2014-08-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2188#issuecomment-53830460
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19441/consoleFull)
 for   PR 2188 at commit 
[`9bd234b`](https://github.com/apache/spark/commit/9bd234bed6c2550218a997fb20684a0c750372c1).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org