[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16243#discussion_r91831031
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -180,35 +214,28 @@ abstract class StatisticsCollectionTestBase extends 
QueryTest with SQLTestUtils
 "ctimestamp" -> ColumnStat(2, Some(t1), Some(t2), 1, 8, 8)
   )
 
-  test("column stats round trip serialization") {
-// Make sure we serialize and then deserialize and we will get the 
result data
-val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
-stats.zip(df.schema).foreach { case ((k, v), field) =>
-  withClue(s"column $k with type ${field.dataType}") {
-val roundtrip = ColumnStat.fromMap("table_is_foo", field, v.toMap)
-assert(roundtrip == Some(v))
-  }
-}
-  }
-
-  test("analyze column command - result verification") {
-val tableName = "column_stats_test2"
-// (data.head.productArity - 1) because the last column does not 
support stats collection.
-assert(stats.size == data.head.productArity - 1)
-val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
+  private val randomName = new Random(31)
 
+  /**
+   * Compute column stats for the given DataFrame and compare it with 
colStats.
+   */
+  def checkColStats(
+  df: DataFrame,
+  colStats: mutable.LinkedHashMap[String, ColumnStat]): Unit = {
+val tableName = "column_stats_test_" + randomName.nextInt(1000)
--- End diff --

why not just use a fixed name like `stats_test_tbl`? It will be used in 
`withTable`, which means we will drop this table every time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-09 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13909
  
Ah, I understand my misunderstanding. In this discussion, you mean 
"intermediate array" is "new int[]".
Yes, let me do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16193: [SPARK-18766] [SQL] Push Down Filter Through BatchEvalPy...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16193
  
**[Test build #69961 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69961/consoleFull)**
 for PR 16193 at commit 
[`2c8e593`](https://github.com/apache/spark/commit/2c8e593a2a705f536a284581f33c469574695015).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16244: [SQL][minor] simplify a test to fix the maven tests

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16244
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69955/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16244: [SQL][minor] simplify a test to fix the maven tests

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16244
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16244: [SQL][minor] simplify a test to fix the maven tests

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16244
  
**[Test build #69955 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69955/consoleFull)**
 for PR 16244 at commit 
[`2983832`](https://github.com/apache/spark/commit/2983832c39d1cd0fcb71ddd76d3feaa7a810cd11).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16243: [SPARK-18815] [SQL] Fix NPE when collecting column stats...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16243
  
**[Test build #69960 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69960/consoleFull)**
 for PR 16243 at commit 
[`618c998`](https://github.com/apache/spark/commit/618c998b44b5ead0df193fc052e8aa59cf221e90).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...

2016-12-09 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/16243#discussion_r91830908
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -180,35 +214,28 @@ abstract class StatisticsCollectionTestBase extends 
QueryTest with SQLTestUtils
 "ctimestamp" -> ColumnStat(2, Some(t1), Some(t2), 1, 8, 8)
   )
 
-  test("column stats round trip serialization") {
-// Make sure we serialize and then deserialize and we will get the 
result data
-val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
-stats.zip(df.schema).foreach { case ((k, v), field) =>
-  withClue(s"column $k with type ${field.dataType}") {
-val roundtrip = ColumnStat.fromMap("table_is_foo", field, v.toMap)
-assert(roundtrip == Some(v))
-  }
-}
-  }
-
-  test("analyze column command - result verification") {
-val tableName = "column_stats_test2"
-// (data.head.productArity - 1) because the last column does not 
support stats collection.
-assert(stats.size == data.head.productArity - 1)
-val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
+  private val randomName = new Random(31)
 
+  /**
+   * Compute column stats for the given DataFrame and compare it with 
colStats.
+   */
+  def checkColStats(
+  df: DataFrame,
+  colStats: mutable.LinkedHashMap[String, ColumnStat]): Unit = {
+val tableName = "column_stats_test_" + randomName.nextInt(10)
--- End diff --

ok, done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...

2016-12-09 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16243#discussion_r91830853
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -180,35 +214,28 @@ abstract class StatisticsCollectionTestBase extends 
QueryTest with SQLTestUtils
 "ctimestamp" -> ColumnStat(2, Some(t1), Some(t2), 1, 8, 8)
   )
 
-  test("column stats round trip serialization") {
-// Make sure we serialize and then deserialize and we will get the 
result data
-val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
-stats.zip(df.schema).foreach { case ((k, v), field) =>
-  withClue(s"column $k with type ${field.dataType}") {
-val roundtrip = ColumnStat.fromMap("table_is_foo", field, v.toMap)
-assert(roundtrip == Some(v))
-  }
-}
-  }
-
-  test("analyze column command - result verification") {
-val tableName = "column_stats_test2"
-// (data.head.productArity - 1) because the last column does not 
support stats collection.
-assert(stats.size == data.head.productArity - 1)
-val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
+  private val randomName = new Random(31)
 
+  /**
+   * Compute column stats for the given DataFrame and compare it with 
colStats.
+   */
+  def checkColStats(
+  df: DataFrame,
+  colStats: mutable.LinkedHashMap[String, ColumnStat]): Unit = {
+val tableName = "column_stats_test_" + randomName.nextInt(10)
--- End diff --

the chance of collision here is not low. just use a broader key space 
nextInt


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16243: [SPARK-18815] [SQL] Fix NPE when collecting column stats...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16243
  
**[Test build #69959 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69959/consoleFull)**
 for PR 16243 at commit 
[`153e382`](https://github.com/apache/spark/commit/153e382efa54b1cc6da9eb697f04fe734d038d7d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13909
  
I think it's nothing to do with physical operators, we are talking about 
`CreateArray` right? Avoiding to create the intermedia primitive array should 
be faster generally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-09 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13909
  
I understand what you want to do. I agree with this in this case. This is 
because the child of `Project` is an array creation. 

== Physical Plan ==
*Project [array((value#2 + 1.1), (value#2 + 2.2)) AS array((value + 1.1), 
(value + 2.2))#10]
+- *SerializeFromObject [input[0, double, false] AS value#2]
   +- Scan ExternalRDDScan[obj#1]


I still have two questions.
1. Is this optimization general? If we have additional operations between 
`Project` and an array creation, what physical plan is generated? 
2. How can we know this case at an array creation? An evaluation order of a 
physical plan tree is bottom-up.

Let me clarify these questions. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...

2016-12-09 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/16243#discussion_r91830643
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -133,6 +133,79 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
 }
   }
 
+  test("column stats round trip serialization") {
+// Make sure we serialize and then deserialize and we will get the 
result data
+val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
+stats.zip(df.schema).foreach { case ((k, v), field) =>
+  withClue(s"column $k with type ${field.dataType}") {
+val roundtrip = ColumnStat.fromMap("table_is_foo", field, v.toMap)
+assert(roundtrip == Some(v))
+  }
+}
+  }
+
+  test("analyze column command - result verification") {
+val tableName = "column_stats_test2"
+// (data.head.productArity - 1) because the last column does not 
support stats collection.
+assert(stats.size == data.head.productArity - 1)
+val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
+checkColStats(df, tableName, stats)
+  }
+
+  private def checkColStats(
+  df: DataFrame,
+  tableName: String,
+  colStats: mutable.LinkedHashMap[String, ColumnStat]): Unit = {
+withTable(tableName) {
+  df.write.saveAsTable(tableName)
+
+  // Collect statistics
+  sql(s"analyze table $tableName compute STATISTICS FOR COLUMNS " +
+colStats.keys.mkString(", "))
+
+  // Validate statistics
+  val table = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName))
+  assert(table.stats.isDefined)
+  assert(table.stats.get.colStats.size == colStats.size)
+
+  colStats.foreach { case (k, v) =>
+withClue(s"column $k") {
+  assert(table.stats.get.colStats(k) == v)
+}
+  }
+}
+  }
+
+  test("column stats collection for null columns") {
+def nullColumnStat(dataType: DataType): ColumnStat = {
+  ColumnStat(0, None, None, 1, dataType.defaultSize.toLong, 
dataType.defaultSize.toLong)
+}
+
+val tableName = "column_stats_test3"
--- End diff --

That's much better, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16220: [SPARK-18796][SS]StreamingQueryManager should not block ...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16220
  
**[Test build #69958 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69958/consoleFull)**
 for PR 16220 at commit 
[`4be4149`](https://github.com/apache/spark/commit/4be4149d81d9860445ce4b53ae5951c1467632f4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16238: [SPARK-18811] StreamSource resolution should happ...

2016-12-09 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16238#discussion_r91830440
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -214,6 +228,10 @@ class StreamExecution(
   // While active, repeatedly attempt to run batches.
   SparkSession.setActiveSession(sparkSession)
 
+  updateStatusMessage("Initializing sources")
--- End diff --

NVM. My suggestion is not right since it will block `StreamingQuery.start`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-09 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/13909
  
Yea we can avoid the intermediate primitive array. Maybe we can benchmark 
against it also.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16244: [SQL][minor] simplify a test to fix the maven tests

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16244
  
Actually I can't reproduce this issue locally, but by looking at the logs, 
I'm 90% percent sure this is the cause. The only way to verify it may be 
merging and checking the jenkins maven status again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15071: [SPARK-17517][SQL]Improve generated Code for BroadcastHa...

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15071
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13909
  
In `CreateArray` we can just evaluate each expression and write their 
results to unsafe array with the writer, no intermediate data need to be 
created.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16193: [SPARK-18766] [SQL] Push Down Filter Through BatchEvalPy...

2016-12-09 Thread davies
Github user davies commented on the issue:

https://github.com/apache/spark/pull/16193
  
Pushing down predicates into data source is also during optimization in 
planner, I think this one is not the first that do optimization outside 
Optimizer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16193: [SPARK-18766] [SQL] Push Down Filter Through BatchEvalPy...

2016-12-09 Thread davies
Github user davies commented on the issue:

https://github.com/apache/spark/pull/16193
  
The reason we move the PythonUDFEvaluator from logical plan into physical 
plan, because this one-off break many things, many rules need to treat 
specially.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHARD_END t...

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16213
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69956/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16238: [SPARK-18811] StreamSource resolution should happ...

2016-12-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16238


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHARD_END t...

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16213
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHARD_END t...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16213
  
**[Test build #69956 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69956/consoleFull)**
 for PR 16213 at commit 
[`adf4dd6`](https://github.com/apache/spark/commit/adf4dd6688f373bb3de26055face210f8265c0fb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16238: [SPARK-18811] StreamSource resolution should happen in s...

2016-12-09 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/16238
  
Merging to master and 2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16238: [SPARK-18811] StreamSource resolution should happ...

2016-12-09 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16238#discussion_r91830357
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -214,6 +228,10 @@ class StreamExecution(
   // While active, repeatedly attempt to run batches.
   SparkSession.setActiveSession(sparkSession)
 
+  updateStatusMessage("Initializing sources")
--- End diff --

Okey. I'm going to merge this PR since it passed tests. I will fix this in 
my PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16193: [SPARK-18766] [SQL] Push Down Filter Through BatchEvalPy...

2016-12-09 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16193
  
If we add a logical node for python evaluator, we'd push down the Filter 
through it, so the optimizer rule won't combine two Filter into one again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16238: [SPARK-18811] StreamSource resolution should happ...

2016-12-09 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16238#discussion_r91830324
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -214,6 +228,10 @@ class StreamExecution(
   // While active, repeatedly attempt to run batches.
   SparkSession.setActiveSession(sparkSession)
 
+  updateStatusMessage("Initializing sources")
--- End diff --

Could you move `SparkSession.setActiveSession(sparkSession)`, 
`updateStatusMessage("Initializing sources")` and `logicalPlan` above 
`postEvent(new QueryStartedEvent(id, runId, name))`? Then we can initialize the 
logical plan before returning `StreamingQuery` to the user and ensure 
`logicalPlan` is always initialized in the stream thread. And you can also add 
`assert(Thread.currentThread().isInstanceOf[StreamExecutionThread])` when 
creating `logicalPlan`, and remove the unnecessary test you added since this 
assertion will be tested in all current stream tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16193: [SPARK-18766] [SQL] Push Down Filter Through BatchEvalPy...

2016-12-09 Thread davies
Github user davies commented on the issue:

https://github.com/apache/spark/pull/16193
  
@cloud-fan It's not trivial to do this in optimizer, for example, we should 
split one Filter into two, that will conflict with another optimizer rule, that 
combine two filter into one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-09 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13909
  
@cloud-fan Yeah, to keep primitive array and to use `UnsafeArrayWrite` can 
avoid extra bulkcopy.
To achieve this avoidance, I think that we need to create 
`GenericArrayData` after enabling https://github.com/apache/spark/pull/13758 
instead of `UnsfeArrayData`. To create `GenericArrayData` with  
https://github.com/apache/spark/pull/13758 can keep a primitive array inside it.
Does it mean what you want to do?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16244: [SQL][minor] simplify a test to fix the maven tests

2016-12-09 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16244
  
How do we make sure the fixed test passes Maven-based tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16134: [SPARK-18703] [SQL] Drop Staging Directories and ...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16134#discussion_r91829975
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala
 ---
@@ -166,6 +166,30 @@ class InsertIntoHiveTableSuite extends QueryTest with 
TestHiveSingleton with Bef
 sql("DROP TABLE tmp_table")
   }
 
+  test("Delete the temporary staging directory and files after each 
insert") {
+withTempDir { tmpDir =>
+  withTable("tab") {
+sql(
+  s"""
+ |CREATE TABLE tab(c1 string)
+ |location '${tmpDir.toURI.toString}'
+   """.stripMargin)
+
+(1 to 3).map { i =>
+  sql(s"INSERT OVERWRITE TABLE tab SELECT '$i'")
+}
+def listFiles(path: File): List[String] = {
+  val dir = path.listFiles()
+  val folders = dir.filter(_.isDirectory).toList
+  val filePaths = dir.map(_.getName).toList
+  filePaths ::: folders.flatMap(listFiles)
--- End diff --

what's `:::`? streaming collection?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16243: [SPARK-18815] [SQL] Fix NPE when collecting column stats...

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16243
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69953/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16243: [SPARK-18815] [SQL] Fix NPE when collecting column stats...

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16243
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16134: [SPARK-18703] [SQL] Drop Staging Directories and ...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16134#discussion_r91829943
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -85,6 +87,7 @@ case class InsertIntoHiveTable(
   def output: Seq[Attribute] = Seq.empty
 
   val hadoopConf = sessionState.newHadoopConf()
+  val createdTempDir = new scala.collection.mutable.ArrayBuffer[Path]
--- End diff --

use `Option`? It looks to me that we will only create one temp dir during 
one insertion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16243: [SPARK-18815] [SQL] Fix NPE when collecting column stats...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16243
  
**[Test build #69953 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69953/consoleFull)**
 for PR 16243 at commit 
[`674cd7c`](https://github.com/apache/spark/commit/674cd7cac3d542f1b1379fef416e3fce2bbc8cbc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16244: [SQL][minor] simplify a test to fix the maven tes...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16244#discussion_r91829910
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
 ---
@@ -98,20 +98,15 @@ class CodeGenerationSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("SPARK-18091: split large if expressions into blocks due to JVM 
code size limit") {
--- End diff --

yea of course, I reverted SPARK-18091 and ran this test locally, it failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16222: [SPARK-18797][SparkR]:Update spark.logit in sparkr-vigne...

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16222
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16193: [SPARK-18766] [SQL] Push Down Filter Through BatchEvalPy...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16193
  
It's a little hacky to me that we do optimization in a planner. How hard is 
it if we introduce a logical node for python evaluator? We can define an 
interface in catalyst, e.g. `ExternalUDFEvaluator`, so that R(or other 
languages in the future) UDF can also benefit from it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16222: [SPARK-18797][SparkR]:Update spark.logit in sparkr-vigne...

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16222
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69954/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16222: [SPARK-18797][SparkR]:Update spark.logit in sparkr-vigne...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16222
  
**[Test build #69954 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69954/consoleFull)**
 for PR 16222 at commit 
[`5fe125f`](https://github.com/apache/spark/commit/5fe125f11f04d481507cae246c33bc4969c43e2e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16193: [SPARK-18766] [SQL] Push Down Filter Through Batc...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16193#discussion_r91829876
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala
 ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.api.python.PythonFunction
+import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, 
In}
+import org.apache.spark.sql.execution.{FilterExec, SparkPlanTest}
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types.BooleanType
+
+class BatchEvalPythonExecSuite extends SparkPlanTest with SharedSQLContext 
{
+  import testImplicits.newProductEncoder
+  import testImplicits.localSeqToDatasetHolder
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+spark.udf.registerPython("dummyPythonUDF", new MyDummyPythonUDF)
+  }
+
+  override def afterAll(): Unit = {
+spark.sessionState.functionRegistry.dropFunction("dummyPythonUDF")
+super.afterAll()
+  }
+
+  test("Python UDF: push down deterministic FilterExec predicates") {
+val df = Seq(("Hello", 4)).toDF("a", "b")
+  .where("dummyPythonUDF(b) and dummyPythonUDF(a) and a in (3, 4)")
+val qualifiedPlanNodes = df.queryExecution.executedPlan.collect {
+  case f @ FilterExec(And(_: AttributeReference, _: 
AttributeReference), _) => f
+  case b: BatchEvalPythonExec => b
+  case f @ FilterExec(_: In, _) => f
+}
+assert(qualifiedPlanNodes.size == 3)
+  }
+
+  test("Nested Python UDF: push down deterministic FilterExec predicates") 
{
+val df = Seq(("Hello", 4)).toDF("a", "b")
+  .where("dummyPythonUDF(a, dummyPythonUDF(a, b)) and a in (3, 4)")
+val qualifiedPlanNodes = df.queryExecution.executedPlan.collect {
+  case f @ FilterExec(_: AttributeReference, _) => f
+  case b: BatchEvalPythonExec => b
+  case f @ FilterExec(_: In, _) => f
+}
+assert(qualifiedPlanNodes.size == 4)
+  }
+
+  test("Python UDF: no push down on non-deterministic") {
+val df = Seq(("Hello", 4)).toDF("a", "b")
+  .where("b > 4 and dummyPythonUDF(a) and rand() > 3")
+val qualifiedPlanNodes = df.queryExecution.executedPlan.collect {
+  case f: FilterExec => f
+  case b: BatchEvalPythonExec => b
+}
+assert(qualifiedPlanNodes.size == 3)
--- End diff --

it's really hard to tell the correctness by checking the number of plan 
nodes...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16175: [SPARK-17460][SQL]Make sure sizeInBytes in Statistics wi...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16175
  
**[Test build #69957 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69957/consoleFull)**
 for PR 16175 at commit 
[`f516613`](https://github.com/apache/spark/commit/f516613f21581dd583d6deb34579fe51adf984c8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHARD_END t...

2016-12-09 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/16213
  
Thanks for these comments! ya, I do not like this approach, too. But, since 
those who reshard streams always hit this issue and resharding is important for 
load-balancing in Kinesis streams (recently, a new API 
[`UpdteShardCount`](http://docs.aws.amazon.com/kinesis/latest/APIReference/API_UpdateShardCount.html)
 has been implemented for resharding), I'd like to fix this as soon as 
possible, but I couldn't find better approaches than this.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHARD_END t...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16213
  
**[Test build #69956 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69956/consoleFull)**
 for PR 16213 at commit 
[`adf4dd6`](https://github.com/apache/spark/commit/adf4dd6688f373bb3de26055face210f8265c0fb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHAR...

2016-12-09 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/16213#discussion_r91829798
  
--- Diff: 
external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala
 ---
@@ -225,6 +225,74 @@ abstract class KinesisStreamTests(aggregateTestData: 
Boolean) extends KinesisFun
 ssc.stop(stopSparkContext = false)
   }
 
+  testIfEnabled("split and merge shards in a stream") {
+// Since this test tries to split and merge shards in a stream, we 
create another
+// temporary stream and then remove it when finished.
+val localAppName = s"KinesisStreamSuite-${math.abs(Random.nextLong())}"
+val localTestUtils = new KPLBasedKinesisTestUtils(1)
+localTestUtils.createStream()
+try {
+  val awsCredentials = KinesisTestUtils.getAWSCredentials()
+  val stream = KinesisUtils.createStream(ssc, localAppName, 
localTestUtils.streamName,
+localTestUtils.endpointUrl, localTestUtils.regionName, 
InitialPositionInStream.LATEST,
+Seconds(10), StorageLevel.MEMORY_ONLY,
+awsCredentials.getAWSAccessKeyId, awsCredentials.getAWSSecretKey)
+
+  val collected = new mutable.HashSet[Int]
+  stream.map { bytes => new String(bytes).toInt }.foreachRDD { rdd =>
+collected.synchronized {
+  collected ++= rdd.collect()
+  logInfo("Collected = " + collected.mkString(", "))
+}
+  }
+  ssc.start()
+
+  val (testData1, testData2, testData3) = (1 to 10, 11 to 20, 21 to 30)
+
+  eventually(timeout(60 seconds), interval(10 second)) {
+localTestUtils.pushData(testData1, aggregateTestData)
+assert(collected.synchronized { collected === testData1.toSet },
+  "\nData received does not match data sent")
+  }
+
+  val shardToSplit = localTestUtils.getShards().head
+  localTestUtils.splitShard(shardToSplit.getShardId)
+  val (splitOpenShards, splitCloseShards) = 
localTestUtils.getShards().partition { shard =>
+shard.getSequenceNumberRange.getEndingSequenceNumber == null
+  }
+
+  // We should have one closed shard and two open shards
+  assert(splitCloseShards.size == 1)
+  assert(splitOpenShards.size == 2)
+
+  eventually(timeout(60 seconds), interval(10 second)) {
+localTestUtils.pushData(testData2, aggregateTestData)
+assert(collected.synchronized { collected === (testData1 ++ 
testData2).toSet },
+  "\nData received does not match data sent after splitting a 
shard")
+  }
+
+  val (shardToMerge, adjShared) = splitOpenShards match { case Seq(e1, 
e2) => (e1, e2) }
--- End diff --

Yea, I tried to use `adjacentShard`, but it returned `null`. I didn't why 
it behaved like this though, do you know that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16175: [SPARK-17460][SQL]Make sure sizeInBytes in Statistics wi...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16175
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13909
  
creating a primitive array and setting element values is also element-wise 
copy right? And we need an extra bulkcopy to write the primitive array to 
unsafe array. By using unsafe array writer, we can save the extra bulkcopy


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16222: [SPARK-18797][SparkR]:Update spark.logit in sparkr-vigne...

2016-12-09 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16222
  
@mengxr 

I used the following R code and glmnet to check whether `regParam = 0.5` 
fits a good model.

> iris2 <- iris[iris$Species %in% c("versicolor", "virginica"), ]
> iris.x = as.matrix(iris2[, 1:4])
> iris.y = as.factor(as.character(iris2[, 5]))
> cvfit = cv.glmnet(iris.x, iris.y, family = "binomial", type.measure = 
"class")
> cvfit$lambda.min
[1] 0.000423808
> cvfit = cv.glmnet(iris.x, iris.y, family = "multinomial", type.measure = 
"class")
> cvfit$lambda.min
[1] 0.01324703

If I understand correctly, `regParam = 0.5` doesn't fit a good model for 
both binomial and multinomial cases, as the minimal lambda is < 0.1. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHAR...

2016-12-09 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/16213#discussion_r91829740
  
--- Diff: 
external/kinesis-asl/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/CheckpointerShim.java
 ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package com.amazonaws.services.kinesis.clientlibrary.lib.worker;
+
+import 
com.amazonaws.services.kinesis.clientlibrary.exceptions.InvalidStateException;
+import 
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException;
+import 
com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessorCheckpointer;
+import 
com.amazonaws.services.kinesis.clientlibrary.types.ExtendedSequenceNumber;
+
+
+/**
+ * This is a class to fix an issue in SPARK-18020. When resharding Kinesis 
streams,
+ * the KCL throws an exception because Spark does not checkpoint 
`SHARD_END` to finish reading
+ * closed shards in `KinesisRecordProcessor#shutdown`. This bug finally 
leads to stopping
+ * subscribing new split (or merged) shards. However, trying checkpoints 
with `SHARD_END` throws
+ * `IllegalArgumentException` exceptions with messages like "Sequence 
number must be numeric,
+ * but was SHARD_END". This fix is a workaround and the class will be 
removed in future
+ * if we find other better solutions.
+ */
+public class CheckpointerShim {
+
+  public static void shutdown(IRecordProcessorCheckpointer checkpointer)
--- End diff --

Yea, I tried by [this 
version](https://github.com/maropu/spark-kinesis-sql-asl/blob/SPARK-18020-tests/spark/spark-2.0/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisCheckpointer.scala#L78),
 but I got the same exception;
```
16/12/10 05:49:00 ERROR ShutdownTask: Application exception.
java.lang.IllegalArgumentException: Sequence number must be numeric, but 
was SHARD_END
at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.SequenceNumberValidator.validateSequenceNumber(SequenceNumberValidator.java:75)
at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:120)
at 
org.apache.spark.streaming.kinesis.KinesisCheckpointer.removeCheckpointer(KinesisCheckpointer.scala:77)
at 
org.apache.spark.streaming.kinesis.KinesisReceiver.removeCheckpointer(KinesisReceiver.scala:258)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16244: [SQL][minor] simplify a test to fix the maven tes...

2016-12-09 Thread kapilsingh5050
Github user kapilsingh5050 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16244#discussion_r91829731
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
 ---
@@ -98,20 +98,15 @@ class CodeGenerationSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("SPARK-18091: split large if expressions into blocks due to JVM 
code size limit") {
--- End diff --

This test simulates the scenario where large code for components of If 
expression causes JVM's method code size limit to be hit. So the point of 
having this test is if it fails before the fix for SPARK-18091 was committed 
and it passes afterwards.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16228: [WIP] [SPARK-17076] [SQL] Cardinality estimation ...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16228#discussion_r91829681
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/JoinEstimation.scala
 ---
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.estimation
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression}
+import org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys
+import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, Join, 
Statistics}
+import org.apache.spark.sql.types.DataType
+
+
+object JoinEstimation {
+  import EstimationUtils._
+
+  // scalastyle:off
+  /**
+   * Estimate output size and number of rows after a join operator, and 
propogate updated column
+   * statsitics.
+   * The number of rows of A inner join B on A.k1 = B.k1 is estimated by 
this basic formula:
+   * T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1)), where V is the 
number of distinct values of
+   * that column. The underlying assumption for this formula is: each 
value of the smaller domain
+   * is included in the larger domain.
+   * Generally, inner join with multiple join keys can also be estimated 
based on the above
+   * formula:
+   * T(A IJ B) = T(A) * T(B) / (max(V(A.k1), V(B.k1)) * max(V(A.k2), 
V(B.k2)) * ... * max(V(A.kn), V(B.kn)))
+   * However, the denominator can become very large and excessively reduce 
the result, so we use a
+   * conservative strategy to take only the largest max(V(A.ki), V(B.ki)) 
as the denominator.
+   *
+   * @return Return the updated statistics after join. Return `None` if 
the join type is not
+   * supported, or we don't have enough statistics for estimation.
+   */
+  // scalastyle:on
+  def estimate(join: Join): Option[Statistics] = join match {
+case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, 
left, right)
+if supportsJoinType(joinType) && hasRowCountStat(left, right) =>
+
+  // 1. Compute the denominator
+  var ndvDenom: BigInt = -1
+  val keyPairs = extractJoinKeys(leftKeys, rightKeys)
+  val leftStats = left.statistics
+  val rightStats = right.statistics
+  val intersectedStats = new mutable.HashMap[String, ColumnStat]()
+  var i = 0
+  while(i < keyPairs.length && ndvDenom != 0) {
+val (leftKey, rightKey) = keyPairs(i)
+// Do estimation if we have enough statistics
+if (hasColumnStat((left, leftKey), (right, rightKey))) {
+  val leftKeyStats = leftStats.colStats(leftKey.name)
+  val rightKeyStats = rightStats.colStats(rightKey.name)
+
+  // Check if the two sides are disjoint
+  val lRange = Range(leftKeyStats.min, leftKeyStats.max, 
leftKey.dataType)
+  val rRange = Range(rightKeyStats.min, rightKeyStats.max, 
rightKey.dataType)
+  if (Range.isIntersected(lRange, rRange)) {
+// Get the largest ndv among pairs of join keys
+val maxNdv = 
leftKeyStats.distinctCount.max(rightKeyStats.distinctCount)
+if (maxNdv > ndvDenom) ndvDenom = maxNdv
+
+// Update intersected column stats
+val minNdv = 
leftKeyStats.distinctCount.min(rightKeyStats.distinctCount)
+val (newMin1, newMax1, newMin2, newMax2) =
+  Range.intersect(lRange, rRange, leftKey.dataType, 
rightKey.dataType)
+intersectedStats.put(leftKey.name, 
intersectedColumnStat(leftKeyStats, minNdv,
+  newMin1, newMax1))
+intersectedStats.put(rightKey.name, 
intersectedColumnStat(rightKeyStats, minNdv,
+  newMin2, newMax2))
+  } else {
+// Set ndvDenom to zero to indicate that this join should have 
no output
+ndvDenom = 

[GitHub] spark pull request #16244: [SQL][minor] simplify a test to fix the maven tes...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16244#discussion_r91829627
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
 ---
@@ -98,20 +98,15 @@ class CodeGenerationSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("SPARK-18091: split large if expressions into blocks due to JVM 
code size limit") {
--- End diff --

what do you mean? this test was added by SPARK-18091


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16244: [SQL][minor] simplify a test to fix the maven tes...

2016-12-09 Thread kapilsingh5050
Github user kapilsingh5050 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16244#discussion_r91829600
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
 ---
@@ -98,20 +98,15 @@ class CodeGenerationSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 
   test("SPARK-18091: split large if expressions into blocks due to JVM 
code size limit") {
--- End diff --

@cloud-fan Just confirming, have you made sure that this test fails without 
the fix for SPARK-18091?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...

2016-12-09 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/16243#discussion_r91829574
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -213,7 +214,8 @@ object ColumnStat extends Logging {
 val nullLit = Literal(null, col.dataType)
 struct(
   ndv, nullLit, nullLit, numNulls,
-  Ceil(Average(Length(col))), Cast(Max(Length(col)), LongType))
+  Coalesce(Seq(Ceil(Average(Length(col))), defaultSize)),
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16179: [SPARK-18752][hive] "isSrcLocal" value should be ...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16179#discussion_r91829517
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala
 ---
@@ -418,4 +431,19 @@ class HiveCommandSuite extends QueryTest with 
SQLTestUtils with TestHiveSingleto
   assert(sql("SHOW PARTITIONS part_datasrc").count() == 3)
 }
   }
+
+  /**
+   * Run a function with a copy of the input file. Use this for tests that 
use "LOAD DATA"
+   * (instead of "LOAD DATA LOCAL") since, according to Hive's semantics, 
files are moved
--- End diff --

then can we test `LOAD DATA` and `LOAD DATA LOCAL` separately? We can add 
comments to explain the semantic difference between them and why we need to 
copy the file


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...

2016-12-09 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16243#discussion_r91829511
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -133,6 +133,79 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
 }
   }
 
+  test("column stats round trip serialization") {
+// Make sure we serialize and then deserialize and we will get the 
result data
+val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
+stats.zip(df.schema).foreach { case ((k, v), field) =>
+  withClue(s"column $k with type ${field.dataType}") {
+val roundtrip = ColumnStat.fromMap("table_is_foo", field, v.toMap)
+assert(roundtrip == Some(v))
+  }
+}
+  }
+
+  test("analyze column command - result verification") {
+val tableName = "column_stats_test2"
+// (data.head.productArity - 1) because the last column does not 
support stats collection.
+assert(stats.size == data.head.productArity - 1)
+val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
+checkColStats(df, tableName, stats)
+  }
+
+  private def checkColStats(
+  df: DataFrame,
+  tableName: String,
+  colStats: mutable.LinkedHashMap[String, ColumnStat]): Unit = {
+withTable(tableName) {
+  df.write.saveAsTable(tableName)
+
+  // Collect statistics
+  sql(s"analyze table $tableName compute STATISTICS FOR COLUMNS " +
+colStats.keys.mkString(", "))
+
+  // Validate statistics
+  val table = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName))
+  assert(table.stats.isDefined)
+  assert(table.stats.get.colStats.size == colStats.size)
+
+  colStats.foreach { case (k, v) =>
+withClue(s"column $k") {
+  assert(table.stats.get.colStats(k) == v)
+}
+  }
+}
+  }
+
+  test("column stats collection for null columns") {
+def nullColumnStat(dataType: DataType): ColumnStat = {
+  ColumnStat(0, None, None, 1, dataType.defaultSize.toLong, 
dataType.defaultSize.toLong)
+}
+
+val tableName = "column_stats_test3"
--- End diff --

You can simplify this a lot by doing it programmatically, e.g.

```

val dataTypes = Seq(BooleanType, ByteType, ...)
val df = sql("select " + dataTypes.map(tpe => s"cast(null as 
${tpe.sql})").mkString(", "))

val expectedStats = dataTypes.map { tpe =>
  ColumnStat(0, None, None, 1, dataType.defaultSize.toLong, 
dataType.defaultSize.toLong)
}

...
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16244: [SQL][minor] simplify a test to fix the maven tests

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16244
  
**[Test build #69955 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69955/consoleFull)**
 for PR 16244 at commit 
[`2983832`](https://github.com/apache/spark/commit/2983832c39d1cd0fcb71ddd76d3feaa7a810cd11).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16244: [SQL][minor] simplify a test to fix the maven tests

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16244
  
cc @srowen @kapilsingh5050 @ueshin @viirya 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16244: [SQL][minor] simplify a test to fix the maven tes...

2016-12-09 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/16244

[SQL][minor] simplify a test to fix the maven tests

## What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/15620 , all of the Maven-based 
2.0 Jenkins jobs time out consistently. As I pointed out in 
https://github.com/apache/spark/pull/15620#discussion_r91829129 , it seems that 
the regression test is an overkill and may hit constants pool size limitation, 
which is a known issue and hasn't been fixed yet.

Since #15620 only fix the code size limitation problem, we can simplify the 
test to avoid hitting constants pool size limitation.

## How was this patch tested?

test only change

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark minor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16244.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16244


commit 2983832c39d1cd0fcb71ddd76d3feaa7a810cd11
Author: Wenchen Fan 
Date:   2016-12-10T05:14:00Z

simplify a test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16222: [SPARK-18797][SparkR]:Update spark.logit in sparkr-vigne...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16222
  
**[Test build #69954 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69954/consoleFull)**
 for PR 16222 at commit 
[`5fe125f`](https://github.com/apache/spark/commit/5fe125f11f04d481507cae246c33bc4969c43e2e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...

2016-12-09 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16243#discussion_r91829428
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -133,6 +133,79 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
 }
   }
 
+  test("column stats round trip serialization") {
+// Make sure we serialize and then deserialize and we will get the 
result data
+val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
+stats.zip(df.schema).foreach { case ((k, v), field) =>
+  withClue(s"column $k with type ${field.dataType}") {
+val roundtrip = ColumnStat.fromMap("table_is_foo", field, v.toMap)
+assert(roundtrip == Some(v))
+  }
+}
+  }
+
+  test("analyze column command - result verification") {
+val tableName = "column_stats_test2"
+// (data.head.productArity - 1) because the last column does not 
support stats collection.
+assert(stats.size == data.head.productArity - 1)
+val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
+checkColStats(df, tableName, stats)
+  }
+
+  private def checkColStats(
--- End diff --

Need to add a line of comment documenting this function and change the 
signature to 

```
/**
  * Compute stats for the given DataFrame and compare it with colStats.
  */
private def checkStats(df: DataFrame, colStats: LinkedHashMap[String, 
ColumnStat]): Unit = {
}
```

Just generate a random tableName.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...

2016-12-09 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16243#discussion_r91829432
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -133,6 +133,79 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
 }
   }
 
+  test("column stats round trip serialization") {
+// Make sure we serialize and then deserialize and we will get the 
result data
+val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
+stats.zip(df.schema).foreach { case ((k, v), field) =>
+  withClue(s"column $k with type ${field.dataType}") {
+val roundtrip = ColumnStat.fromMap("table_is_foo", field, v.toMap)
+assert(roundtrip == Some(v))
+  }
+}
+  }
+
+  test("analyze column command - result verification") {
+val tableName = "column_stats_test2"
+// (data.head.productArity - 1) because the last column does not 
support stats collection.
+assert(stats.size == data.head.productArity - 1)
+val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
+checkColStats(df, tableName, stats)
+  }
+
+  private def checkColStats(
--- End diff --

Also don't put this in the middle of the function. Move it into the 
abstract class StatisticsCollectionTestBase


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16179: [SPARK-18752][hive] "isSrcLocal" value should be ...

2016-12-09 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16179#discussion_r91829413
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala
 ---
@@ -418,4 +431,19 @@ class HiveCommandSuite extends QueryTest with 
SQLTestUtils with TestHiveSingleto
   assert(sql("SHOW PARTITIONS part_datasrc").count() == 3)
 }
   }
+
+  /**
+   * Run a function with a copy of the input file. Use this for tests that 
use "LOAD DATA"
+   * (instead of "LOAD DATA LOCAL") since, according to Hive's semantics, 
files are moved
--- End diff --

Ah, the tests need to be updated because now `loadTable` is being called 
with "isSrcLocal = false". That makes the source file be moved instead of 
copied, and that makes subsequent unit tests fail. (That's the cause of the 
initial test failures in this PR.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16179: [SPARK-18752][hive] "isSrcLocal" value should be set fro...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16179
  
The changes LGTM, as we do propagate the `isSrcLocal` incorrectly. It would 
be better if we can also fix the inconsistent behavior of `LOAD DATA` between 
spark and hive, and improve the test coverage, in a follow-up


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...

2016-12-09 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16243#discussion_r91829403
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -213,7 +214,8 @@ object ColumnStat extends Logging {
 val nullLit = Literal(null, col.dataType)
 struct(
   ndv, nullLit, nullLit, numNulls,
-  Ceil(Average(Length(col))), Cast(Max(Length(col)), LongType))
+  Coalesce(Seq(Ceil(Average(Length(col))), defaultSize)),
--- End diff --

add a line of comment explaining we set the value to defaultSize if all the 
values are null (of if there is no value).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16179: [SPARK-18752][hive] "isSrcLocal" value should be ...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16179#discussion_r91829355
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala
 ---
@@ -418,4 +431,19 @@ class HiveCommandSuite extends QueryTest with 
SQLTestUtils with TestHiveSingleto
   assert(sql("SHOW PARTITIONS part_datasrc").count() == 3)
 }
   }
+
+  /**
+   * Run a function with a copy of the input file. Use this for tests that 
use "LOAD DATA"
+   * (instead of "LOAD DATA LOCAL") since, according to Hive's semantics, 
files are moved
--- End diff --

The semantic change happened in Hive 2.1, looks we don't need update the 
tests for now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16222: [SPARK-18797][SparkR]:Update spark.logit in spark...

2016-12-09 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16222#discussion_r91829349
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -768,8 +768,46 @@ newDF <- createDataFrame(data.frame(x = c(1.5, 3.2)))
 head(predict(isoregModel, newDF))
 ```
 
- What's More?
-We also expect Decision Tree, Random Forest, Kolmogorov-Smirnov Test 
coming in the next version 2.1.0.
+### Logistic Regression Model
+
+(Added in 2.1.0)
+
+[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) 
is a widely-used model when the response is categorical. It can be seen as a 
special case of the [Generalized Linear 
Model](https://en.wikipedia.org/wiki/Generalized_linear_model).
+There are two types of logistic regression models, namely binomial 
logistic regression (i.e., response is binary) and multinomial
+logistic regression (i.e., response falls into multiple classes). We 
provide `spark.logit` on top of `spark.glm` to support logistic regression with 
advanced hyper-parameters.
+It supports both binary and multiclass classification, elastic-net 
regularization, and feature standardization, similar to `glmnet`.
+
+
+`spark.logit` fits an Logistic Regression Model against a Spark DataFrame. 
The `family` parameter can be used to select between the
+binomial and multinomial algorithms, or leave it unset and Spark will 
infer the correct variant.
+
+We use a simple example to demonstrate `spark.logit` usage. In general, 
there are three steps of using `spark.logit`:
+1). Create a dataframe from a proper data source; 2). Fit a logistic 
regression model using `spark.logit` with a proper parameter setting;
+and 3). Obtain the coefficient matrix of the fitted model using `summary` 
and use the model for prediction with `predict`.
+
+Binomial logistic regression
+```{r, warning=FALSE}
+df <- createDataFrame(iris)
+# Create a dataframe containing two classes
+training <- df[df$Species %in% c("versicolor", "virginica"), ]
+model <- spark.logit(training, Species ~ ., regParam = 0.5)
--- End diff --

I changed the test as an example. I didn't check whether `regParam = 0.5` 
returns good model or not. I can do some experiments to check it out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...

2016-12-09 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16243#discussion_r91829332
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -133,6 +133,79 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
 }
   }
 
+  test("column stats round trip serialization") {
--- End diff --

why did you move the test cases?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15620: [SPARK-18091] [SQL] Deep if expressions cause Gen...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15620#discussion_r91829129
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
 ---
@@ -97,6 +97,27 @@ class CodeGenerationSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 assert(actual(0) == cases)
   }
 
+  test("SPARK-18091: split large if expressions into blocks due to JVM 
code size limit") {
+val inStr = "StringForTesting"
+val row = create_row(inStr)
+val inputStrAttr = 'a.string.at(0)
+
+var strExpr: Expression = inputStrAttr
+for (_ <- 1 to 13) {
+  strExpr = If(EqualTo(Decode(Encode(strExpr, "utf-8"), "utf-8"), 
inputStrAttr),
--- End diff --

cc @srowen I think this is the root cause. This test is an overkill, 
although this PR fixed the code size limitation problem, this test may still 
hit constants pool size limitation, which is a known limiation and has not been 
fixed yet. It seems that maven and sbt have different JVM settings when run 
test, so the problem only exists at maven side.

I'm going to submit a PR to simplify this test a bit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16230: [SPARK-13747][Core]Fix potential ThreadLocal leaks in RP...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16230
  
**[Test build #3486 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3486/consoleFull)**
 for PR 16230 at commit 
[`3fade95`](https://github.com/apache/spark/commit/3fade9526fda131535a5e37894474dc8e6ac09af).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15717: [SPARK-17910][SQL] Allow users to update the comment of ...

2016-12-09 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/15717
  
ok then the parser rule looks good to me, my only concern is the new APIs 
in `ExternalCatalog`, I don't think they are necessary, @jiangxb1987 what's the 
motivation you added them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

2016-12-09 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91828427
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -53,6 +56,18 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
   tableIdent.table.toLowerCase)
   }
 
+  /** ReadWriteLock for each tables, protect the read and write cached */
+  private val tableLockStripes = Striped.lazyWeakLock(10)
--- End diff --

nit: `tableCreationLocks`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

2016-12-09 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91828417
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -53,6 +56,18 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
   tableIdent.table.toLowerCase)
   }
 
+  /** ReadWriteLock for each tables, protect the read and write cached */
+  private val tableLockStripes = Striped.lazyWeakLock(10)
--- End diff --

Hm, may as well make it 100 if it's a lazy weak locks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

2016-12-09 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91828385
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -53,6 +56,18 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
   tableIdent.table.toLowerCase)
   }
 
+  /** ReadWriteLock for each tables, protect the read and write cached */
--- End diff --

Could you update this comment to say that the reason we lock is to prevent 
concurrent table instantiation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

2016-12-09 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91828390
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -53,6 +56,18 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
   tableIdent.table.toLowerCase)
   }
 
+  /** ReadWriteLock for each tables, protect the read and write cached */
+  private val tableLockStripes = Striped.lazyWeakLock(10)
+
+  /** Acquires a lock on the table cache for the duration of `f`. */
+  private def cacheLock[A](tableName: QualifiedTableName, f: => A): A = {
--- End diff --

withTableCreationLock


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

2016-12-09 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91828378
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -33,6 +35,7 @@ import org.apache.spark.sql.hive.orc.OrcFileFormat
 import org.apache.spark.sql.types._
 
 
+
--- End diff --

nit: extra newline


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

2016-12-09 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91828410
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
 ---
@@ -352,4 +353,28 @@ class PartitionedTablePerfStatsSuite
   }
 }
   }
+
+  test("SPARK-18700: add lock for each table's realation in cache") {
--- End diff --

"table loaded only once even when resolved concurrently"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

2016-12-09 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91828396
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
 ---
@@ -352,4 +353,28 @@ class PartitionedTablePerfStatsSuite
   }
 }
   }
+
+  test("SPARK-18700: add lock for each table's realation in cache") {
+withTable("test") {
+  withTempDir { dir =>
+HiveCatalogMetrics.reset()
+setupPartitionedHiveTable("test", dir)
+// select the table in multi-threads
+val executorPool = Executors.newFixedThreadPool(10)
+(1 to 10).map(threadId => {
+  val runnable = new Runnable {
+override def run(): Unit = {
+  spark.sql("select * from test where partCol1 = 999").count()
+}
+  }
+  executorPool.execute(runnable)
+  None
+})
+executorPool.shutdown()
+executorPool.awaitTermination(30, TimeUnit.SECONDS)
+// check the cache hit, the cache only load once
+
assert(HiveCatalogMetrics.METRIC_DATASOUCE_TABLE_CACHE_HITS.getCount() == 9)
--- End diff --

Does this test fail without the lock?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

2016-12-09 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91828375
  
--- Diff: 
core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala ---
@@ -97,6 +97,12 @@ object HiveCatalogMetrics extends Source {
 MetricRegistry.name("parallelListingJobCount"))
 
   /**
+   * Tracks the total number of cachedDataSourceTables hits.
+   */
+  val METRIC_DATASOUCE_TABLE_CACHE_HITS = metricRegistry.counter(
+MetricRegistry.name("dataSourceTableCacheHits"))
--- End diff --

Could we use one of the other metrics, rather than add a new one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16204: [SPARK-18775][SQL] Limit the max number of records writt...

2016-12-09 Thread ericl
Github user ericl commented on the issue:

https://github.com/apache/spark/pull/16204
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16243: [SPARK-18815] [SQL] Fix NPE when collecting column stats...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16243
  
**[Test build #69953 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69953/consoleFull)**
 for PR 16243 at commit 
[`674cd7c`](https://github.com/apache/spark/commit/674cd7cac3d542f1b1379fef416e3fce2bbc8cbc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16238: [SPARK-18811] StreamSource resolution should happen in s...

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16238
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16238: [SPARK-18811] StreamSource resolution should happen in s...

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16238
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69952/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16238: [SPARK-18811] StreamSource resolution should happen in s...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16238
  
**[Test build #69952 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69952/consoleFull)**
 for PR 16238 at commit 
[`e6d1e25`](https://github.com/apache/spark/commit/e6d1e251097cc5a71ef67dec995f5cf86f977891).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class BlockingSource extends StreamSourceProvider with 
StreamSinkProvider `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16243: [SPARK-18815] [SQL] Fix NPE when collecting column stats...

2016-12-09 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/16243
  
cc @cloud-fan @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...

2016-12-09 Thread wzhfy
GitHub user wzhfy opened a pull request:

https://github.com/apache/spark/pull/16243

[SPARK-18815] [SQL] Fix NPE when collecting column stats for string/binary 
column having only null values

## What changes were proposed in this pull request?

During column stats collection, average and max length will be null if a 
column of string/binary type has only null values. To fix this, I use default 
size when avg/max length is null.

## How was this patch tested?

Add a test for handling null columns

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wzhfy/spark nullStats

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16243.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16243


commit 674cd7cac3d542f1b1379fef416e3fce2bbc8cbc
Author: wangzhenhua 
Date:   2016-12-10T03:19:19Z

use default size for null columns




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16220: [SPARK-18796][SS]StreamingQueryManager should not block ...

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16220
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16220: [SPARK-18796][SS]StreamingQueryManager should not block ...

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16220
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69951/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16220: [SPARK-18796][SS]StreamingQueryManager should not block ...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16220
  
**[Test build #69951 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69951/consoleFull)**
 for PR 16220 at commit 
[`2db339e`](https://github.com/apache/spark/commit/2db339ed0e14cb3732a8359a9e88fcf379965a1a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16237: [SPARK-18807][SPARKR] Should suppress output prin...

2016-12-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16237


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16237: [SPARK-18807][SPARKR] Should suppress output print for c...

2016-12-09 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/16237
  
Merging into master, branch-2.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15936: [SPARK-18504][SQL] Scalar subquery with extra group by c...

2016-12-09 Thread ericl
Github user ericl commented on the issue:

https://github.com/apache/spark/pull/15936
  
This seems to be causing https://issues.apache.org/jira/browse/SPARK-18814


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16228: [WIP] [SPARK-17076] [SQL] Cardinality estimation ...

2016-12-09 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/16228#discussion_r91826623
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/JoinEstimation.scala
 ---
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.estimation
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression}
+import org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys
+import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, Join, 
Statistics}
+import org.apache.spark.sql.types.DataType
+
+
+object JoinEstimation {
+  import EstimationUtils._
+
+  // scalastyle:off
+  /**
+   * Estimate output size and number of rows after a join operator, and 
propogate updated column
+   * statsitics.
+   * The number of rows of A inner join B on A.k1 = B.k1 is estimated by 
this basic formula:
+   * T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1)), where V is the 
number of distinct values of
+   * that column. The underlying assumption for this formula is: each 
value of the smaller domain
+   * is included in the larger domain.
+   * Generally, inner join with multiple join keys can also be estimated 
based on the above
+   * formula:
+   * T(A IJ B) = T(A) * T(B) / (max(V(A.k1), V(B.k1)) * max(V(A.k2), 
V(B.k2)) * ... * max(V(A.kn), V(B.kn)))
+   * However, the denominator can become very large and excessively reduce 
the result, so we use a
+   * conservative strategy to take only the largest max(V(A.ki), V(B.ki)) 
as the denominator.
+   *
+   * @return Return the updated statistics after join. Return `None` if 
the join type is not
+   * supported, or we don't have enough statistics for estimation.
+   */
+  // scalastyle:on
+  def estimate(join: Join): Option[Statistics] = join match {
+case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, 
left, right)
+if supportsJoinType(joinType) && hasRowCountStat(left, right) =>
+
+  // 1. Compute the denominator
+  var ndvDenom: BigInt = -1
+  val keyPairs = extractJoinKeys(leftKeys, rightKeys)
+  val leftStats = left.statistics
+  val rightStats = right.statistics
+  val intersectedStats = new mutable.HashMap[String, ColumnStat]()
+  var i = 0
+  while(i < keyPairs.length && ndvDenom != 0) {
+val (leftKey, rightKey) = keyPairs(i)
+// Do estimation if we have enough statistics
+if (hasColumnStat((left, leftKey), (right, rightKey))) {
+  val leftKeyStats = leftStats.colStats(leftKey.name)
+  val rightKeyStats = rightStats.colStats(rightKey.name)
+
+  // Check if the two sides are disjoint
+  val lRange = Range(leftKeyStats.min, leftKeyStats.max, 
leftKey.dataType)
+  val rRange = Range(rightKeyStats.min, rightKeyStats.max, 
rightKey.dataType)
+  if (Range.isIntersected(lRange, rRange)) {
+// Get the largest ndv among pairs of join keys
+val maxNdv = 
leftKeyStats.distinctCount.max(rightKeyStats.distinctCount)
+if (maxNdv > ndvDenom) ndvDenom = maxNdv
+
+// Update intersected column stats
+val minNdv = 
leftKeyStats.distinctCount.min(rightKeyStats.distinctCount)
+val (newMin1, newMax1, newMin2, newMax2) =
+  Range.intersect(lRange, rRange, leftKey.dataType, 
rightKey.dataType)
+intersectedStats.put(leftKey.name, 
intersectedColumnStat(leftKeyStats, minNdv,
+  newMin1, newMax1))
+intersectedStats.put(rightKey.name, 
intersectedColumnStat(rightKeyStats, minNdv,
+  newMin2, newMax2))
+  } else {
+// Set ndvDenom to zero to indicate that this join should have 
no output
+ndvDenom = 0
  

[GitHub] spark pull request #16228: [WIP] [SPARK-17076] [SQL] Cardinality estimation ...

2016-12-09 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/16228#discussion_r91826564
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/JoinEstimation.scala
 ---
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.estimation
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression}
+import org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys
+import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, Join, 
Statistics}
+import org.apache.spark.sql.types.DataType
+
+
+object JoinEstimation {
+  import EstimationUtils._
+
+  // scalastyle:off
+  /**
+   * Estimate output size and number of rows after a join operator, and 
propogate updated column
+   * statsitics.
+   * The number of rows of A inner join B on A.k1 = B.k1 is estimated by 
this basic formula:
+   * T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1)), where V is the 
number of distinct values of
+   * that column. The underlying assumption for this formula is: each 
value of the smaller domain
+   * is included in the larger domain.
+   * Generally, inner join with multiple join keys can also be estimated 
based on the above
+   * formula:
+   * T(A IJ B) = T(A) * T(B) / (max(V(A.k1), V(B.k1)) * max(V(A.k2), 
V(B.k2)) * ... * max(V(A.kn), V(B.kn)))
+   * However, the denominator can become very large and excessively reduce 
the result, so we use a
+   * conservative strategy to take only the largest max(V(A.ki), V(B.ki)) 
as the denominator.
+   *
+   * @return Return the updated statistics after join. Return `None` if 
the join type is not
+   * supported, or we don't have enough statistics for estimation.
+   */
+  // scalastyle:on
+  def estimate(join: Join): Option[Statistics] = join match {
+case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, 
left, right)
+if supportsJoinType(joinType) && hasRowCountStat(left, right) =>
+
+  // 1. Compute the denominator
+  var ndvDenom: BigInt = -1
+  val keyPairs = extractJoinKeys(leftKeys, rightKeys)
+  val leftStats = left.statistics
+  val rightStats = right.statistics
+  val intersectedStats = new mutable.HashMap[String, ColumnStat]()
+  var i = 0
+  while(i < keyPairs.length && ndvDenom != 0) {
+val (leftKey, rightKey) = keyPairs(i)
+// Do estimation if we have enough statistics
+if (hasColumnStat((left, leftKey), (right, rightKey))) {
+  val leftKeyStats = leftStats.colStats(leftKey.name)
+  val rightKeyStats = rightStats.colStats(rightKey.name)
+
+  // Check if the two sides are disjoint
+  val lRange = Range(leftKeyStats.min, leftKeyStats.max, 
leftKey.dataType)
+  val rRange = Range(rightKeyStats.min, rightKeyStats.max, 
rightKey.dataType)
+  if (Range.isIntersected(lRange, rRange)) {
+// Get the largest ndv among pairs of join keys
+val maxNdv = 
leftKeyStats.distinctCount.max(rightKeyStats.distinctCount)
+if (maxNdv > ndvDenom) ndvDenom = maxNdv
+
+// Update intersected column stats
+val minNdv = 
leftKeyStats.distinctCount.min(rightKeyStats.distinctCount)
+val (newMin1, newMax1, newMin2, newMax2) =
+  Range.intersect(lRange, rRange, leftKey.dataType, 
rightKey.dataType)
+intersectedStats.put(leftKey.name, 
intersectedColumnStat(leftKeyStats, minNdv,
+  newMin1, newMax1))
+intersectedStats.put(rightKey.name, 
intersectedColumnStat(rightKeyStats, minNdv,
+  newMin2, newMax2))
+  } else {
+// Set ndvDenom to zero to indicate that this join should have 
no output
+ndvDenom = 0
  

[GitHub] spark issue #16228: [WIP] [SPARK-17076] [SQL] Cardinality estimation for joi...

2016-12-09 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/16228
  
@Tagar Thanks for sharing this information. Yes, it would be better to use 
PK/FK, but it won't be done in this pr, and we need to implement PK/FK 
constraints in Spark first. 

> the assumption is if two tables being joined by columns with the same 
name, join columns have the same stats / set of values?

It is true for inner join, but not true for outer joins, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16230: [SPARK-13747][Core]Fix potential ThreadLocal leaks in RP...

2016-12-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16230
  
**[Test build #3486 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3486/consoleFull)**
 for PR 16230 at commit 
[`3fade95`](https://github.com/apache/spark/commit/3fade9526fda131535a5e37894474dc8e6ac09af).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16230: [SPARK-13747][Core]Fix potential ThreadLocal leaks in RP...

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16230
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16230: [SPARK-13747][Core]Fix potential ThreadLocal leaks in RP...

2016-12-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16230
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69950/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >