[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16243#discussion_r91831031 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -180,35 +214,28 @@ abstract class StatisticsCollectionTestBase extends QueryTest with SQLTestUtils "ctimestamp" -> ColumnStat(2, Some(t1), Some(t2), 1, 8, 8) ) - test("column stats round trip serialization") { -// Make sure we serialize and then deserialize and we will get the result data -val df = data.toDF(stats.keys.toSeq :+ "carray" : _*) -stats.zip(df.schema).foreach { case ((k, v), field) => - withClue(s"column $k with type ${field.dataType}") { -val roundtrip = ColumnStat.fromMap("table_is_foo", field, v.toMap) -assert(roundtrip == Some(v)) - } -} - } - - test("analyze column command - result verification") { -val tableName = "column_stats_test2" -// (data.head.productArity - 1) because the last column does not support stats collection. -assert(stats.size == data.head.productArity - 1) -val df = data.toDF(stats.keys.toSeq :+ "carray" : _*) + private val randomName = new Random(31) + /** + * Compute column stats for the given DataFrame and compare it with colStats. + */ + def checkColStats( + df: DataFrame, + colStats: mutable.LinkedHashMap[String, ColumnStat]): Unit = { +val tableName = "column_stats_test_" + randomName.nextInt(1000) --- End diff -- why not just use a fixed name like `stats_test_tbl`? It will be used in `withTable`, which means we will drop this table every time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13909 Ah, I understand my misunderstanding. In this discussion, you mean "intermediate array" is "new int[]". Yes, let me do it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16193: [SPARK-18766] [SQL] Push Down Filter Through BatchEvalPy...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16193 **[Test build #69961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69961/consoleFull)** for PR 16193 at commit [`2c8e593`](https://github.com/apache/spark/commit/2c8e593a2a705f536a284581f33c469574695015). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16244: [SQL][minor] simplify a test to fix the maven tests
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16244 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69955/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16244: [SQL][minor] simplify a test to fix the maven tests
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16244 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16244: [SQL][minor] simplify a test to fix the maven tests
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16244 **[Test build #69955 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69955/consoleFull)** for PR 16244 at commit [`2983832`](https://github.com/apache/spark/commit/2983832c39d1cd0fcb71ddd76d3feaa7a810cd11). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16243: [SPARK-18815] [SQL] Fix NPE when collecting column stats...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16243 **[Test build #69960 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69960/consoleFull)** for PR 16243 at commit [`618c998`](https://github.com/apache/spark/commit/618c998b44b5ead0df193fc052e8aa59cf221e90). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/16243#discussion_r91830908 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -180,35 +214,28 @@ abstract class StatisticsCollectionTestBase extends QueryTest with SQLTestUtils "ctimestamp" -> ColumnStat(2, Some(t1), Some(t2), 1, 8, 8) ) - test("column stats round trip serialization") { -// Make sure we serialize and then deserialize and we will get the result data -val df = data.toDF(stats.keys.toSeq :+ "carray" : _*) -stats.zip(df.schema).foreach { case ((k, v), field) => - withClue(s"column $k with type ${field.dataType}") { -val roundtrip = ColumnStat.fromMap("table_is_foo", field, v.toMap) -assert(roundtrip == Some(v)) - } -} - } - - test("analyze column command - result verification") { -val tableName = "column_stats_test2" -// (data.head.productArity - 1) because the last column does not support stats collection. -assert(stats.size == data.head.productArity - 1) -val df = data.toDF(stats.keys.toSeq :+ "carray" : _*) + private val randomName = new Random(31) + /** + * Compute column stats for the given DataFrame and compare it with colStats. + */ + def checkColStats( + df: DataFrame, + colStats: mutable.LinkedHashMap[String, ColumnStat]): Unit = { +val tableName = "column_stats_test_" + randomName.nextInt(10) --- End diff -- ok, done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16243#discussion_r91830853 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -180,35 +214,28 @@ abstract class StatisticsCollectionTestBase extends QueryTest with SQLTestUtils "ctimestamp" -> ColumnStat(2, Some(t1), Some(t2), 1, 8, 8) ) - test("column stats round trip serialization") { -// Make sure we serialize and then deserialize and we will get the result data -val df = data.toDF(stats.keys.toSeq :+ "carray" : _*) -stats.zip(df.schema).foreach { case ((k, v), field) => - withClue(s"column $k with type ${field.dataType}") { -val roundtrip = ColumnStat.fromMap("table_is_foo", field, v.toMap) -assert(roundtrip == Some(v)) - } -} - } - - test("analyze column command - result verification") { -val tableName = "column_stats_test2" -// (data.head.productArity - 1) because the last column does not support stats collection. -assert(stats.size == data.head.productArity - 1) -val df = data.toDF(stats.keys.toSeq :+ "carray" : _*) + private val randomName = new Random(31) + /** + * Compute column stats for the given DataFrame and compare it with colStats. + */ + def checkColStats( + df: DataFrame, + colStats: mutable.LinkedHashMap[String, ColumnStat]): Unit = { +val tableName = "column_stats_test_" + randomName.nextInt(10) --- End diff -- the chance of collision here is not low. just use a broader key space nextInt --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16243: [SPARK-18815] [SQL] Fix NPE when collecting column stats...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16243 **[Test build #69959 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69959/consoleFull)** for PR 16243 at commit [`153e382`](https://github.com/apache/spark/commit/153e382efa54b1cc6da9eb697f04fe734d038d7d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13909 I think it's nothing to do with physical operators, we are talking about `CreateArray` right? Avoiding to create the intermedia primitive array should be faster generally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13909 I understand what you want to do. I agree with this in this case. This is because the child of `Project` is an array creation. == Physical Plan == *Project [array((value#2 + 1.1), (value#2 + 2.2)) AS array((value + 1.1), (value + 2.2))#10] +- *SerializeFromObject [input[0, double, false] AS value#2] +- Scan ExternalRDDScan[obj#1] I still have two questions. 1. Is this optimization general? If we have additional operations between `Project` and an array creation, what physical plan is generated? 2. How can we know this case at an array creation? An evaluation order of a physical plan tree is bottom-up. Let me clarify these questions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/16243#discussion_r91830643 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -133,6 +133,79 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } + test("column stats round trip serialization") { +// Make sure we serialize and then deserialize and we will get the result data +val df = data.toDF(stats.keys.toSeq :+ "carray" : _*) +stats.zip(df.schema).foreach { case ((k, v), field) => + withClue(s"column $k with type ${field.dataType}") { +val roundtrip = ColumnStat.fromMap("table_is_foo", field, v.toMap) +assert(roundtrip == Some(v)) + } +} + } + + test("analyze column command - result verification") { +val tableName = "column_stats_test2" +// (data.head.productArity - 1) because the last column does not support stats collection. +assert(stats.size == data.head.productArity - 1) +val df = data.toDF(stats.keys.toSeq :+ "carray" : _*) +checkColStats(df, tableName, stats) + } + + private def checkColStats( + df: DataFrame, + tableName: String, + colStats: mutable.LinkedHashMap[String, ColumnStat]): Unit = { +withTable(tableName) { + df.write.saveAsTable(tableName) + + // Collect statistics + sql(s"analyze table $tableName compute STATISTICS FOR COLUMNS " + +colStats.keys.mkString(", ")) + + // Validate statistics + val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName)) + assert(table.stats.isDefined) + assert(table.stats.get.colStats.size == colStats.size) + + colStats.foreach { case (k, v) => +withClue(s"column $k") { + assert(table.stats.get.colStats(k) == v) +} + } +} + } + + test("column stats collection for null columns") { +def nullColumnStat(dataType: DataType): ColumnStat = { + ColumnStat(0, None, None, 1, dataType.defaultSize.toLong, dataType.defaultSize.toLong) +} + +val tableName = "column_stats_test3" --- End diff -- That's much better, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16220: [SPARK-18796][SS]StreamingQueryManager should not block ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16220 **[Test build #69958 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69958/consoleFull)** for PR 16220 at commit [`4be4149`](https://github.com/apache/spark/commit/4be4149d81d9860445ce4b53ae5951c1467632f4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16238: [SPARK-18811] StreamSource resolution should happ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/16238#discussion_r91830440 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -214,6 +228,10 @@ class StreamExecution( // While active, repeatedly attempt to run batches. SparkSession.setActiveSession(sparkSession) + updateStatusMessage("Initializing sources") --- End diff -- NVM. My suggestion is not right since it will block `StreamingQuery.start`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13909 Yea we can avoid the intermediate primitive array. Maybe we can benchmark against it also. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16244: [SQL][minor] simplify a test to fix the maven tests
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16244 Actually I can't reproduce this issue locally, but by looking at the logs, I'm 90% percent sure this is the cause. The only way to verify it may be merging and checking the jenkins maven status again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15071: [SPARK-17517][SQL]Improve generated Code for BroadcastHa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15071 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13909 In `CreateArray` we can just evaluate each expression and write their results to unsafe array with the writer, no intermediate data need to be created. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16193: [SPARK-18766] [SQL] Push Down Filter Through BatchEvalPy...
Github user davies commented on the issue: https://github.com/apache/spark/pull/16193 Pushing down predicates into data source is also during optimization in planner, I think this one is not the first that do optimization outside Optimizer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16193: [SPARK-18766] [SQL] Push Down Filter Through BatchEvalPy...
Github user davies commented on the issue: https://github.com/apache/spark/pull/16193 The reason we move the PythonUDFEvaluator from logical plan into physical plan, because this one-off break many things, many rules need to treat specially. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHARD_END t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16213 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69956/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16238: [SPARK-18811] StreamSource resolution should happ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16238 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHARD_END t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16213 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHARD_END t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16213 **[Test build #69956 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69956/consoleFull)** for PR 16213 at commit [`adf4dd6`](https://github.com/apache/spark/commit/adf4dd6688f373bb3de26055face210f8265c0fb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16238: [SPARK-18811] StreamSource resolution should happen in s...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/16238 Merging to master and 2.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16238: [SPARK-18811] StreamSource resolution should happ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/16238#discussion_r91830357 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -214,6 +228,10 @@ class StreamExecution( // While active, repeatedly attempt to run batches. SparkSession.setActiveSession(sparkSession) + updateStatusMessage("Initializing sources") --- End diff -- Okey. I'm going to merge this PR since it passed tests. I will fix this in my PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16193: [SPARK-18766] [SQL] Push Down Filter Through BatchEvalPy...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16193 If we add a logical node for python evaluator, we'd push down the Filter through it, so the optimizer rule won't combine two Filter into one again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16238: [SPARK-18811] StreamSource resolution should happ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/16238#discussion_r91830324 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -214,6 +228,10 @@ class StreamExecution( // While active, repeatedly attempt to run batches. SparkSession.setActiveSession(sparkSession) + updateStatusMessage("Initializing sources") --- End diff -- Could you move `SparkSession.setActiveSession(sparkSession)`, `updateStatusMessage("Initializing sources")` and `logicalPlan` above `postEvent(new QueryStartedEvent(id, runId, name))`? Then we can initialize the logical plan before returning `StreamingQuery` to the user and ensure `logicalPlan` is always initialized in the stream thread. And you can also add `assert(Thread.currentThread().isInstanceOf[StreamExecutionThread])` when creating `logicalPlan`, and remove the unnecessary test you added since this assertion will be tested in all current stream tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16193: [SPARK-18766] [SQL] Push Down Filter Through BatchEvalPy...
Github user davies commented on the issue: https://github.com/apache/spark/pull/16193 @cloud-fan It's not trivial to do this in optimizer, for example, we should split one Filter into two, that will conflict with another optimizer rule, that combine two filter into one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13909 @cloud-fan Yeah, to keep primitive array and to use `UnsafeArrayWrite` can avoid extra bulkcopy. To achieve this avoidance, I think that we need to create `GenericArrayData` after enabling https://github.com/apache/spark/pull/13758 instead of `UnsfeArrayData`. To create `GenericArrayData` with https://github.com/apache/spark/pull/13758 can keep a primitive array inside it. Does it mean what you want to do? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16244: [SQL][minor] simplify a test to fix the maven tests
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16244 How do we make sure the fixed test passes Maven-based tests? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16134: [SPARK-18703] [SQL] Drop Staging Directories and ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16134#discussion_r91829975 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala --- @@ -166,6 +166,30 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef sql("DROP TABLE tmp_table") } + test("Delete the temporary staging directory and files after each insert") { +withTempDir { tmpDir => + withTable("tab") { +sql( + s""" + |CREATE TABLE tab(c1 string) + |location '${tmpDir.toURI.toString}' + """.stripMargin) + +(1 to 3).map { i => + sql(s"INSERT OVERWRITE TABLE tab SELECT '$i'") +} +def listFiles(path: File): List[String] = { + val dir = path.listFiles() + val folders = dir.filter(_.isDirectory).toList + val filePaths = dir.map(_.getName).toList + filePaths ::: folders.flatMap(listFiles) --- End diff -- what's `:::`? streaming collection? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16243: [SPARK-18815] [SQL] Fix NPE when collecting column stats...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16243 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69953/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16243: [SPARK-18815] [SQL] Fix NPE when collecting column stats...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16243 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16134: [SPARK-18703] [SQL] Drop Staging Directories and ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16134#discussion_r91829943 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -85,6 +87,7 @@ case class InsertIntoHiveTable( def output: Seq[Attribute] = Seq.empty val hadoopConf = sessionState.newHadoopConf() + val createdTempDir = new scala.collection.mutable.ArrayBuffer[Path] --- End diff -- use `Option`? It looks to me that we will only create one temp dir during one insertion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16243: [SPARK-18815] [SQL] Fix NPE when collecting column stats...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16243 **[Test build #69953 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69953/consoleFull)** for PR 16243 at commit [`674cd7c`](https://github.com/apache/spark/commit/674cd7cac3d542f1b1379fef416e3fce2bbc8cbc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16244: [SQL][minor] simplify a test to fix the maven tes...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16244#discussion_r91829910 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala --- @@ -98,20 +98,15 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper { } test("SPARK-18091: split large if expressions into blocks due to JVM code size limit") { --- End diff -- yea of course, I reverted SPARK-18091 and ran this test locally, it failed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16222: [SPARK-18797][SparkR]:Update spark.logit in sparkr-vigne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16222 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16193: [SPARK-18766] [SQL] Push Down Filter Through BatchEvalPy...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16193 It's a little hacky to me that we do optimization in a planner. How hard is it if we introduce a logical node for python evaluator? We can define an interface in catalyst, e.g. `ExternalUDFEvaluator`, so that R(or other languages in the future) UDF can also benefit from it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16222: [SPARK-18797][SparkR]:Update spark.logit in sparkr-vigne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16222 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69954/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16222: [SPARK-18797][SparkR]:Update spark.logit in sparkr-vigne...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16222 **[Test build #69954 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69954/consoleFull)** for PR 16222 at commit [`5fe125f`](https://github.com/apache/spark/commit/5fe125f11f04d481507cae246c33bc4969c43e2e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16193: [SPARK-18766] [SQL] Push Down Filter Through Batc...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16193#discussion_r91829876 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.api.python.PythonFunction +import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, In} +import org.apache.spark.sql.execution.{FilterExec, SparkPlanTest} +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types.BooleanType + +class BatchEvalPythonExecSuite extends SparkPlanTest with SharedSQLContext { + import testImplicits.newProductEncoder + import testImplicits.localSeqToDatasetHolder + + override def beforeAll(): Unit = { +super.beforeAll() +spark.udf.registerPython("dummyPythonUDF", new MyDummyPythonUDF) + } + + override def afterAll(): Unit = { +spark.sessionState.functionRegistry.dropFunction("dummyPythonUDF") +super.afterAll() + } + + test("Python UDF: push down deterministic FilterExec predicates") { +val df = Seq(("Hello", 4)).toDF("a", "b") + .where("dummyPythonUDF(b) and dummyPythonUDF(a) and a in (3, 4)") +val qualifiedPlanNodes = df.queryExecution.executedPlan.collect { + case f @ FilterExec(And(_: AttributeReference, _: AttributeReference), _) => f + case b: BatchEvalPythonExec => b + case f @ FilterExec(_: In, _) => f +} +assert(qualifiedPlanNodes.size == 3) + } + + test("Nested Python UDF: push down deterministic FilterExec predicates") { +val df = Seq(("Hello", 4)).toDF("a", "b") + .where("dummyPythonUDF(a, dummyPythonUDF(a, b)) and a in (3, 4)") +val qualifiedPlanNodes = df.queryExecution.executedPlan.collect { + case f @ FilterExec(_: AttributeReference, _) => f + case b: BatchEvalPythonExec => b + case f @ FilterExec(_: In, _) => f +} +assert(qualifiedPlanNodes.size == 4) + } + + test("Python UDF: no push down on non-deterministic") { +val df = Seq(("Hello", 4)).toDF("a", "b") + .where("b > 4 and dummyPythonUDF(a) and rand() > 3") +val qualifiedPlanNodes = df.queryExecution.executedPlan.collect { + case f: FilterExec => f + case b: BatchEvalPythonExec => b +} +assert(qualifiedPlanNodes.size == 3) --- End diff -- it's really hard to tell the correctness by checking the number of plan nodes... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16175: [SPARK-17460][SQL]Make sure sizeInBytes in Statistics wi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16175 **[Test build #69957 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69957/consoleFull)** for PR 16175 at commit [`f516613`](https://github.com/apache/spark/commit/f516613f21581dd583d6deb34579fe51adf984c8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHARD_END t...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/16213 Thanks for these comments! ya, I do not like this approach, too. But, since those who reshard streams always hit this issue and resharding is important for load-balancing in Kinesis streams (recently, a new API [`UpdteShardCount`](http://docs.aws.amazon.com/kinesis/latest/APIReference/API_UpdateShardCount.html) has been implemented for resharding), I'd like to fix this as soon as possible, but I couldn't find better approaches than this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHARD_END t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16213 **[Test build #69956 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69956/consoleFull)** for PR 16213 at commit [`adf4dd6`](https://github.com/apache/spark/commit/adf4dd6688f373bb3de26055face210f8265c0fb). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHAR...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/16213#discussion_r91829798 --- Diff: external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala --- @@ -225,6 +225,74 @@ abstract class KinesisStreamTests(aggregateTestData: Boolean) extends KinesisFun ssc.stop(stopSparkContext = false) } + testIfEnabled("split and merge shards in a stream") { +// Since this test tries to split and merge shards in a stream, we create another +// temporary stream and then remove it when finished. +val localAppName = s"KinesisStreamSuite-${math.abs(Random.nextLong())}" +val localTestUtils = new KPLBasedKinesisTestUtils(1) +localTestUtils.createStream() +try { + val awsCredentials = KinesisTestUtils.getAWSCredentials() + val stream = KinesisUtils.createStream(ssc, localAppName, localTestUtils.streamName, +localTestUtils.endpointUrl, localTestUtils.regionName, InitialPositionInStream.LATEST, +Seconds(10), StorageLevel.MEMORY_ONLY, +awsCredentials.getAWSAccessKeyId, awsCredentials.getAWSSecretKey) + + val collected = new mutable.HashSet[Int] + stream.map { bytes => new String(bytes).toInt }.foreachRDD { rdd => +collected.synchronized { + collected ++= rdd.collect() + logInfo("Collected = " + collected.mkString(", ")) +} + } + ssc.start() + + val (testData1, testData2, testData3) = (1 to 10, 11 to 20, 21 to 30) + + eventually(timeout(60 seconds), interval(10 second)) { +localTestUtils.pushData(testData1, aggregateTestData) +assert(collected.synchronized { collected === testData1.toSet }, + "\nData received does not match data sent") + } + + val shardToSplit = localTestUtils.getShards().head + localTestUtils.splitShard(shardToSplit.getShardId) + val (splitOpenShards, splitCloseShards) = localTestUtils.getShards().partition { shard => +shard.getSequenceNumberRange.getEndingSequenceNumber == null + } + + // We should have one closed shard and two open shards + assert(splitCloseShards.size == 1) + assert(splitOpenShards.size == 2) + + eventually(timeout(60 seconds), interval(10 second)) { +localTestUtils.pushData(testData2, aggregateTestData) +assert(collected.synchronized { collected === (testData1 ++ testData2).toSet }, + "\nData received does not match data sent after splitting a shard") + } + + val (shardToMerge, adjShared) = splitOpenShards match { case Seq(e1, e2) => (e1, e2) } --- End diff -- Yea, I tried to use `adjacentShard`, but it returned `null`. I didn't why it behaved like this though, do you know that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16175: [SPARK-17460][SQL]Make sure sizeInBytes in Statistics wi...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16175 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13909 creating a primitive array and setting element values is also element-wise copy right? And we need an extra bulkcopy to write the primitive array to unsafe array. By using unsafe array writer, we can save the extra bulkcopy --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16222: [SPARK-18797][SparkR]:Update spark.logit in sparkr-vigne...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/16222 @mengxr I used the following R code and glmnet to check whether `regParam = 0.5` fits a good model. > iris2 <- iris[iris$Species %in% c("versicolor", "virginica"), ] > iris.x = as.matrix(iris2[, 1:4]) > iris.y = as.factor(as.character(iris2[, 5])) > cvfit = cv.glmnet(iris.x, iris.y, family = "binomial", type.measure = "class") > cvfit$lambda.min [1] 0.000423808 > cvfit = cv.glmnet(iris.x, iris.y, family = "multinomial", type.measure = "class") > cvfit$lambda.min [1] 0.01324703 If I understand correctly, `regParam = 0.5` doesn't fit a good model for both binomial and multinomial cases, as the minimal lambda is < 0.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHAR...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/16213#discussion_r91829740 --- Diff: external/kinesis-asl/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/CheckpointerShim.java --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package com.amazonaws.services.kinesis.clientlibrary.lib.worker; + +import com.amazonaws.services.kinesis.clientlibrary.exceptions.InvalidStateException; +import com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException; +import com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessorCheckpointer; +import com.amazonaws.services.kinesis.clientlibrary.types.ExtendedSequenceNumber; + + +/** + * This is a class to fix an issue in SPARK-18020. When resharding Kinesis streams, + * the KCL throws an exception because Spark does not checkpoint `SHARD_END` to finish reading + * closed shards in `KinesisRecordProcessor#shutdown`. This bug finally leads to stopping + * subscribing new split (or merged) shards. However, trying checkpoints with `SHARD_END` throws + * `IllegalArgumentException` exceptions with messages like "Sequence number must be numeric, + * but was SHARD_END". This fix is a workaround and the class will be removed in future + * if we find other better solutions. + */ +public class CheckpointerShim { + + public static void shutdown(IRecordProcessorCheckpointer checkpointer) --- End diff -- Yea, I tried by [this version](https://github.com/maropu/spark-kinesis-sql-asl/blob/SPARK-18020-tests/spark/spark-2.0/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisCheckpointer.scala#L78), but I got the same exception; ``` 16/12/10 05:49:00 ERROR ShutdownTask: Application exception. java.lang.IllegalArgumentException: Sequence number must be numeric, but was SHARD_END at com.amazonaws.services.kinesis.clientlibrary.lib.worker.SequenceNumberValidator.validateSequenceNumber(SequenceNumberValidator.java:75) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:120) at org.apache.spark.streaming.kinesis.KinesisCheckpointer.removeCheckpointer(KinesisCheckpointer.scala:77) at org.apache.spark.streaming.kinesis.KinesisReceiver.removeCheckpointer(KinesisReceiver.scala:258) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16244: [SQL][minor] simplify a test to fix the maven tes...
Github user kapilsingh5050 commented on a diff in the pull request: https://github.com/apache/spark/pull/16244#discussion_r91829731 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala --- @@ -98,20 +98,15 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper { } test("SPARK-18091: split large if expressions into blocks due to JVM code size limit") { --- End diff -- This test simulates the scenario where large code for components of If expression causes JVM's method code size limit to be hit. So the point of having this test is if it fails before the fix for SPARK-18091 was committed and it passes afterwards. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16228: [WIP] [SPARK-17076] [SQL] Cardinality estimation ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16228#discussion_r91829681 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/JoinEstimation.scala --- @@ -0,0 +1,175 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.estimation + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} +import org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys +import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, Join, Statistics} +import org.apache.spark.sql.types.DataType + + +object JoinEstimation { + import EstimationUtils._ + + // scalastyle:off + /** + * Estimate output size and number of rows after a join operator, and propogate updated column + * statsitics. + * The number of rows of A inner join B on A.k1 = B.k1 is estimated by this basic formula: + * T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1)), where V is the number of distinct values of + * that column. The underlying assumption for this formula is: each value of the smaller domain + * is included in the larger domain. + * Generally, inner join with multiple join keys can also be estimated based on the above + * formula: + * T(A IJ B) = T(A) * T(B) / (max(V(A.k1), V(B.k1)) * max(V(A.k2), V(B.k2)) * ... * max(V(A.kn), V(B.kn))) + * However, the denominator can become very large and excessively reduce the result, so we use a + * conservative strategy to take only the largest max(V(A.ki), V(B.ki)) as the denominator. + * + * @return Return the updated statistics after join. Return `None` if the join type is not + * supported, or we don't have enough statistics for estimation. + */ + // scalastyle:on + def estimate(join: Join): Option[Statistics] = join match { +case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) +if supportsJoinType(joinType) && hasRowCountStat(left, right) => + + // 1. Compute the denominator + var ndvDenom: BigInt = -1 + val keyPairs = extractJoinKeys(leftKeys, rightKeys) + val leftStats = left.statistics + val rightStats = right.statistics + val intersectedStats = new mutable.HashMap[String, ColumnStat]() + var i = 0 + while(i < keyPairs.length && ndvDenom != 0) { +val (leftKey, rightKey) = keyPairs(i) +// Do estimation if we have enough statistics +if (hasColumnStat((left, leftKey), (right, rightKey))) { + val leftKeyStats = leftStats.colStats(leftKey.name) + val rightKeyStats = rightStats.colStats(rightKey.name) + + // Check if the two sides are disjoint + val lRange = Range(leftKeyStats.min, leftKeyStats.max, leftKey.dataType) + val rRange = Range(rightKeyStats.min, rightKeyStats.max, rightKey.dataType) + if (Range.isIntersected(lRange, rRange)) { +// Get the largest ndv among pairs of join keys +val maxNdv = leftKeyStats.distinctCount.max(rightKeyStats.distinctCount) +if (maxNdv > ndvDenom) ndvDenom = maxNdv + +// Update intersected column stats +val minNdv = leftKeyStats.distinctCount.min(rightKeyStats.distinctCount) +val (newMin1, newMax1, newMin2, newMax2) = + Range.intersect(lRange, rRange, leftKey.dataType, rightKey.dataType) +intersectedStats.put(leftKey.name, intersectedColumnStat(leftKeyStats, minNdv, + newMin1, newMax1)) +intersectedStats.put(rightKey.name, intersectedColumnStat(rightKeyStats, minNdv, + newMin2, newMax2)) + } else { +// Set ndvDenom to zero to indicate that this join should have no output +ndvDenom =
[GitHub] spark pull request #16244: [SQL][minor] simplify a test to fix the maven tes...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16244#discussion_r91829627 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala --- @@ -98,20 +98,15 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper { } test("SPARK-18091: split large if expressions into blocks due to JVM code size limit") { --- End diff -- what do you mean? this test was added by SPARK-18091 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16244: [SQL][minor] simplify a test to fix the maven tes...
Github user kapilsingh5050 commented on a diff in the pull request: https://github.com/apache/spark/pull/16244#discussion_r91829600 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala --- @@ -98,20 +98,15 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper { } test("SPARK-18091: split large if expressions into blocks due to JVM code size limit") { --- End diff -- @cloud-fan Just confirming, have you made sure that this test fails without the fix for SPARK-18091? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/16243#discussion_r91829574 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala --- @@ -213,7 +214,8 @@ object ColumnStat extends Logging { val nullLit = Literal(null, col.dataType) struct( ndv, nullLit, nullLit, numNulls, - Ceil(Average(Length(col))), Cast(Max(Length(col)), LongType)) + Coalesce(Seq(Ceil(Average(Length(col))), defaultSize)), --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16179: [SPARK-18752][hive] "isSrcLocal" value should be ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16179#discussion_r91829517 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala --- @@ -418,4 +431,19 @@ class HiveCommandSuite extends QueryTest with SQLTestUtils with TestHiveSingleto assert(sql("SHOW PARTITIONS part_datasrc").count() == 3) } } + + /** + * Run a function with a copy of the input file. Use this for tests that use "LOAD DATA" + * (instead of "LOAD DATA LOCAL") since, according to Hive's semantics, files are moved --- End diff -- then can we test `LOAD DATA` and `LOAD DATA LOCAL` separately? We can add comments to explain the semantic difference between them and why we need to copy the file --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16243#discussion_r91829511 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -133,6 +133,79 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } + test("column stats round trip serialization") { +// Make sure we serialize and then deserialize and we will get the result data +val df = data.toDF(stats.keys.toSeq :+ "carray" : _*) +stats.zip(df.schema).foreach { case ((k, v), field) => + withClue(s"column $k with type ${field.dataType}") { +val roundtrip = ColumnStat.fromMap("table_is_foo", field, v.toMap) +assert(roundtrip == Some(v)) + } +} + } + + test("analyze column command - result verification") { +val tableName = "column_stats_test2" +// (data.head.productArity - 1) because the last column does not support stats collection. +assert(stats.size == data.head.productArity - 1) +val df = data.toDF(stats.keys.toSeq :+ "carray" : _*) +checkColStats(df, tableName, stats) + } + + private def checkColStats( + df: DataFrame, + tableName: String, + colStats: mutable.LinkedHashMap[String, ColumnStat]): Unit = { +withTable(tableName) { + df.write.saveAsTable(tableName) + + // Collect statistics + sql(s"analyze table $tableName compute STATISTICS FOR COLUMNS " + +colStats.keys.mkString(", ")) + + // Validate statistics + val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName)) + assert(table.stats.isDefined) + assert(table.stats.get.colStats.size == colStats.size) + + colStats.foreach { case (k, v) => +withClue(s"column $k") { + assert(table.stats.get.colStats(k) == v) +} + } +} + } + + test("column stats collection for null columns") { +def nullColumnStat(dataType: DataType): ColumnStat = { + ColumnStat(0, None, None, 1, dataType.defaultSize.toLong, dataType.defaultSize.toLong) +} + +val tableName = "column_stats_test3" --- End diff -- You can simplify this a lot by doing it programmatically, e.g. ``` val dataTypes = Seq(BooleanType, ByteType, ...) val df = sql("select " + dataTypes.map(tpe => s"cast(null as ${tpe.sql})").mkString(", ")) val expectedStats = dataTypes.map { tpe => ColumnStat(0, None, None, 1, dataType.defaultSize.toLong, dataType.defaultSize.toLong) } ... ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16244: [SQL][minor] simplify a test to fix the maven tests
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16244 **[Test build #69955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69955/consoleFull)** for PR 16244 at commit [`2983832`](https://github.com/apache/spark/commit/2983832c39d1cd0fcb71ddd76d3feaa7a810cd11). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16244: [SQL][minor] simplify a test to fix the maven tests
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16244 cc @srowen @kapilsingh5050 @ueshin @viirya --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16244: [SQL][minor] simplify a test to fix the maven tes...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/16244 [SQL][minor] simplify a test to fix the maven tests ## What changes were proposed in this pull request? After https://github.com/apache/spark/pull/15620 , all of the Maven-based 2.0 Jenkins jobs time out consistently. As I pointed out in https://github.com/apache/spark/pull/15620#discussion_r91829129 , it seems that the regression test is an overkill and may hit constants pool size limitation, which is a known issue and hasn't been fixed yet. Since #15620 only fix the code size limitation problem, we can simplify the test to avoid hitting constants pool size limitation. ## How was this patch tested? test only change You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark minor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16244.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16244 commit 2983832c39d1cd0fcb71ddd76d3feaa7a810cd11 Author: Wenchen Fan Date: 2016-12-10T05:14:00Z simplify a test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16222: [SPARK-18797][SparkR]:Update spark.logit in sparkr-vigne...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16222 **[Test build #69954 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69954/consoleFull)** for PR 16222 at commit [`5fe125f`](https://github.com/apache/spark/commit/5fe125f11f04d481507cae246c33bc4969c43e2e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16243#discussion_r91829428 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -133,6 +133,79 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } + test("column stats round trip serialization") { +// Make sure we serialize and then deserialize and we will get the result data +val df = data.toDF(stats.keys.toSeq :+ "carray" : _*) +stats.zip(df.schema).foreach { case ((k, v), field) => + withClue(s"column $k with type ${field.dataType}") { +val roundtrip = ColumnStat.fromMap("table_is_foo", field, v.toMap) +assert(roundtrip == Some(v)) + } +} + } + + test("analyze column command - result verification") { +val tableName = "column_stats_test2" +// (data.head.productArity - 1) because the last column does not support stats collection. +assert(stats.size == data.head.productArity - 1) +val df = data.toDF(stats.keys.toSeq :+ "carray" : _*) +checkColStats(df, tableName, stats) + } + + private def checkColStats( --- End diff -- Need to add a line of comment documenting this function and change the signature to ``` /** * Compute stats for the given DataFrame and compare it with colStats. */ private def checkStats(df: DataFrame, colStats: LinkedHashMap[String, ColumnStat]): Unit = { } ``` Just generate a random tableName. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16243#discussion_r91829432 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -133,6 +133,79 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } + test("column stats round trip serialization") { +// Make sure we serialize and then deserialize and we will get the result data +val df = data.toDF(stats.keys.toSeq :+ "carray" : _*) +stats.zip(df.schema).foreach { case ((k, v), field) => + withClue(s"column $k with type ${field.dataType}") { +val roundtrip = ColumnStat.fromMap("table_is_foo", field, v.toMap) +assert(roundtrip == Some(v)) + } +} + } + + test("analyze column command - result verification") { +val tableName = "column_stats_test2" +// (data.head.productArity - 1) because the last column does not support stats collection. +assert(stats.size == data.head.productArity - 1) +val df = data.toDF(stats.keys.toSeq :+ "carray" : _*) +checkColStats(df, tableName, stats) + } + + private def checkColStats( --- End diff -- Also don't put this in the middle of the function. Move it into the abstract class StatisticsCollectionTestBase --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16179: [SPARK-18752][hive] "isSrcLocal" value should be ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/16179#discussion_r91829413 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala --- @@ -418,4 +431,19 @@ class HiveCommandSuite extends QueryTest with SQLTestUtils with TestHiveSingleto assert(sql("SHOW PARTITIONS part_datasrc").count() == 3) } } + + /** + * Run a function with a copy of the input file. Use this for tests that use "LOAD DATA" + * (instead of "LOAD DATA LOCAL") since, according to Hive's semantics, files are moved --- End diff -- Ah, the tests need to be updated because now `loadTable` is being called with "isSrcLocal = false". That makes the source file be moved instead of copied, and that makes subsequent unit tests fail. (That's the cause of the initial test failures in this PR.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16179: [SPARK-18752][hive] "isSrcLocal" value should be set fro...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16179 The changes LGTM, as we do propagate the `isSrcLocal` incorrectly. It would be better if we can also fix the inconsistent behavior of `LOAD DATA` between spark and hive, and improve the test coverage, in a follow-up --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16243#discussion_r91829403 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala --- @@ -213,7 +214,8 @@ object ColumnStat extends Logging { val nullLit = Literal(null, col.dataType) struct( ndv, nullLit, nullLit, numNulls, - Ceil(Average(Length(col))), Cast(Max(Length(col)), LongType)) + Coalesce(Seq(Ceil(Average(Length(col))), defaultSize)), --- End diff -- add a line of comment explaining we set the value to defaultSize if all the values are null (of if there is no value). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16179: [SPARK-18752][hive] "isSrcLocal" value should be ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16179#discussion_r91829355 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala --- @@ -418,4 +431,19 @@ class HiveCommandSuite extends QueryTest with SQLTestUtils with TestHiveSingleto assert(sql("SHOW PARTITIONS part_datasrc").count() == 3) } } + + /** + * Run a function with a copy of the input file. Use this for tests that use "LOAD DATA" + * (instead of "LOAD DATA LOCAL") since, according to Hive's semantics, files are moved --- End diff -- The semantic change happened in Hive 2.1, looks we don't need update the tests for now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16222: [SPARK-18797][SparkR]:Update spark.logit in spark...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/16222#discussion_r91829349 --- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd --- @@ -768,8 +768,46 @@ newDF <- createDataFrame(data.frame(x = c(1.5, 3.2))) head(predict(isoregModel, newDF)) ``` - What's More? -We also expect Decision Tree, Random Forest, Kolmogorov-Smirnov Test coming in the next version 2.1.0. +### Logistic Regression Model + +(Added in 2.1.0) + +[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a widely-used model when the response is categorical. It can be seen as a special case of the [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model). +There are two types of logistic regression models, namely binomial logistic regression (i.e., response is binary) and multinomial +logistic regression (i.e., response falls into multiple classes). We provide `spark.logit` on top of `spark.glm` to support logistic regression with advanced hyper-parameters. +It supports both binary and multiclass classification, elastic-net regularization, and feature standardization, similar to `glmnet`. + + +`spark.logit` fits an Logistic Regression Model against a Spark DataFrame. The `family` parameter can be used to select between the +binomial and multinomial algorithms, or leave it unset and Spark will infer the correct variant. + +We use a simple example to demonstrate `spark.logit` usage. In general, there are three steps of using `spark.logit`: +1). Create a dataframe from a proper data source; 2). Fit a logistic regression model using `spark.logit` with a proper parameter setting; +and 3). Obtain the coefficient matrix of the fitted model using `summary` and use the model for prediction with `predict`. + +Binomial logistic regression +```{r, warning=FALSE} +df <- createDataFrame(iris) +# Create a dataframe containing two classes +training <- df[df$Species %in% c("versicolor", "virginica"), ] +model <- spark.logit(training, Species ~ ., regParam = 0.5) --- End diff -- I changed the test as an example. I didn't check whether `regParam = 0.5` returns good model or not. I can do some experiments to check it out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16243#discussion_r91829332 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -133,6 +133,79 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } + test("column stats round trip serialization") { --- End diff -- why did you move the test cases? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15620: [SPARK-18091] [SQL] Deep if expressions cause Gen...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15620#discussion_r91829129 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala --- @@ -97,6 +97,27 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper { assert(actual(0) == cases) } + test("SPARK-18091: split large if expressions into blocks due to JVM code size limit") { +val inStr = "StringForTesting" +val row = create_row(inStr) +val inputStrAttr = 'a.string.at(0) + +var strExpr: Expression = inputStrAttr +for (_ <- 1 to 13) { + strExpr = If(EqualTo(Decode(Encode(strExpr, "utf-8"), "utf-8"), inputStrAttr), --- End diff -- cc @srowen I think this is the root cause. This test is an overkill, although this PR fixed the code size limitation problem, this test may still hit constants pool size limitation, which is a known limiation and has not been fixed yet. It seems that maven and sbt have different JVM settings when run test, so the problem only exists at maven side. I'm going to submit a PR to simplify this test a bit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16230: [SPARK-13747][Core]Fix potential ThreadLocal leaks in RP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16230 **[Test build #3486 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3486/consoleFull)** for PR 16230 at commit [`3fade95`](https://github.com/apache/spark/commit/3fade9526fda131535a5e37894474dc8e6ac09af). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15717: [SPARK-17910][SQL] Allow users to update the comment of ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15717 ok then the parser rule looks good to me, my only concern is the new APIs in `ExternalCatalog`, I don't think they are necessary, @jiangxb1987 what's the motivation you added them? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16135#discussion_r91828427 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -53,6 +56,18 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log tableIdent.table.toLowerCase) } + /** ReadWriteLock for each tables, protect the read and write cached */ + private val tableLockStripes = Striped.lazyWeakLock(10) --- End diff -- nit: `tableCreationLocks`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16135#discussion_r91828417 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -53,6 +56,18 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log tableIdent.table.toLowerCase) } + /** ReadWriteLock for each tables, protect the read and write cached */ + private val tableLockStripes = Striped.lazyWeakLock(10) --- End diff -- Hm, may as well make it 100 if it's a lazy weak locks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16135#discussion_r91828385 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -53,6 +56,18 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log tableIdent.table.toLowerCase) } + /** ReadWriteLock for each tables, protect the read and write cached */ --- End diff -- Could you update this comment to say that the reason we lock is to prevent concurrent table instantiation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16135#discussion_r91828390 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -53,6 +56,18 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log tableIdent.table.toLowerCase) } + /** ReadWriteLock for each tables, protect the read and write cached */ + private val tableLockStripes = Striped.lazyWeakLock(10) + + /** Acquires a lock on the table cache for the duration of `f`. */ + private def cacheLock[A](tableName: QualifiedTableName, f: => A): A = { --- End diff -- withTableCreationLock --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16135#discussion_r91828378 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -33,6 +35,7 @@ import org.apache.spark.sql.hive.orc.OrcFileFormat import org.apache.spark.sql.types._ + --- End diff -- nit: extra newline --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16135#discussion_r91828410 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala --- @@ -352,4 +353,28 @@ class PartitionedTablePerfStatsSuite } } } + + test("SPARK-18700: add lock for each table's realation in cache") { --- End diff -- "table loaded only once even when resolved concurrently" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16135#discussion_r91828396 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala --- @@ -352,4 +353,28 @@ class PartitionedTablePerfStatsSuite } } } + + test("SPARK-18700: add lock for each table's realation in cache") { +withTable("test") { + withTempDir { dir => +HiveCatalogMetrics.reset() +setupPartitionedHiveTable("test", dir) +// select the table in multi-threads +val executorPool = Executors.newFixedThreadPool(10) +(1 to 10).map(threadId => { + val runnable = new Runnable { +override def run(): Unit = { + spark.sql("select * from test where partCol1 = 999").count() +} + } + executorPool.execute(runnable) + None +}) +executorPool.shutdown() +executorPool.awaitTermination(30, TimeUnit.SECONDS) +// check the cache hit, the cache only load once + assert(HiveCatalogMetrics.METRIC_DATASOUCE_TABLE_CACHE_HITS.getCount() == 9) --- End diff -- Does this test fail without the lock? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16135#discussion_r91828375 --- Diff: core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala --- @@ -97,6 +97,12 @@ object HiveCatalogMetrics extends Source { MetricRegistry.name("parallelListingJobCount")) /** + * Tracks the total number of cachedDataSourceTables hits. + */ + val METRIC_DATASOUCE_TABLE_CACHE_HITS = metricRegistry.counter( +MetricRegistry.name("dataSourceTableCacheHits")) --- End diff -- Could we use one of the other metrics, rather than add a new one? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16204: [SPARK-18775][SQL] Limit the max number of records writt...
Github user ericl commented on the issue: https://github.com/apache/spark/pull/16204 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16243: [SPARK-18815] [SQL] Fix NPE when collecting column stats...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16243 **[Test build #69953 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69953/consoleFull)** for PR 16243 at commit [`674cd7c`](https://github.com/apache/spark/commit/674cd7cac3d542f1b1379fef416e3fce2bbc8cbc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16238: [SPARK-18811] StreamSource resolution should happen in s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16238 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16238: [SPARK-18811] StreamSource resolution should happen in s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16238 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69952/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16238: [SPARK-18811] StreamSource resolution should happen in s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16238 **[Test build #69952 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69952/consoleFull)** for PR 16238 at commit [`e6d1e25`](https://github.com/apache/spark/commit/e6d1e251097cc5a71ef67dec995f5cf86f977891). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class BlockingSource extends StreamSourceProvider with StreamSinkProvider ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16243: [SPARK-18815] [SQL] Fix NPE when collecting column stats...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/16243 cc @cloud-fan @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16243: [SPARK-18815] [SQL] Fix NPE when collecting colum...
GitHub user wzhfy opened a pull request: https://github.com/apache/spark/pull/16243 [SPARK-18815] [SQL] Fix NPE when collecting column stats for string/binary column having only null values ## What changes were proposed in this pull request? During column stats collection, average and max length will be null if a column of string/binary type has only null values. To fix this, I use default size when avg/max length is null. ## How was this patch tested? Add a test for handling null columns You can merge this pull request into a Git repository by running: $ git pull https://github.com/wzhfy/spark nullStats Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16243.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16243 commit 674cd7cac3d542f1b1379fef416e3fce2bbc8cbc Author: wangzhenhua Date: 2016-12-10T03:19:19Z use default size for null columns --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16220: [SPARK-18796][SS]StreamingQueryManager should not block ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16220 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16220: [SPARK-18796][SS]StreamingQueryManager should not block ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16220 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69951/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16220: [SPARK-18796][SS]StreamingQueryManager should not block ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16220 **[Test build #69951 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69951/consoleFull)** for PR 16220 at commit [`2db339e`](https://github.com/apache/spark/commit/2db339ed0e14cb3732a8359a9e88fcf379965a1a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16237: [SPARK-18807][SPARKR] Should suppress output prin...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16237 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16237: [SPARK-18807][SPARKR] Should suppress output print for c...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/16237 Merging into master, branch-2.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15936: [SPARK-18504][SQL] Scalar subquery with extra group by c...
Github user ericl commented on the issue: https://github.com/apache/spark/pull/15936 This seems to be causing https://issues.apache.org/jira/browse/SPARK-18814 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16228: [WIP] [SPARK-17076] [SQL] Cardinality estimation ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/16228#discussion_r91826623 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/JoinEstimation.scala --- @@ -0,0 +1,175 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.estimation + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} +import org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys +import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, Join, Statistics} +import org.apache.spark.sql.types.DataType + + +object JoinEstimation { + import EstimationUtils._ + + // scalastyle:off + /** + * Estimate output size and number of rows after a join operator, and propogate updated column + * statsitics. + * The number of rows of A inner join B on A.k1 = B.k1 is estimated by this basic formula: + * T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1)), where V is the number of distinct values of + * that column. The underlying assumption for this formula is: each value of the smaller domain + * is included in the larger domain. + * Generally, inner join with multiple join keys can also be estimated based on the above + * formula: + * T(A IJ B) = T(A) * T(B) / (max(V(A.k1), V(B.k1)) * max(V(A.k2), V(B.k2)) * ... * max(V(A.kn), V(B.kn))) + * However, the denominator can become very large and excessively reduce the result, so we use a + * conservative strategy to take only the largest max(V(A.ki), V(B.ki)) as the denominator. + * + * @return Return the updated statistics after join. Return `None` if the join type is not + * supported, or we don't have enough statistics for estimation. + */ + // scalastyle:on + def estimate(join: Join): Option[Statistics] = join match { +case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) +if supportsJoinType(joinType) && hasRowCountStat(left, right) => + + // 1. Compute the denominator + var ndvDenom: BigInt = -1 + val keyPairs = extractJoinKeys(leftKeys, rightKeys) + val leftStats = left.statistics + val rightStats = right.statistics + val intersectedStats = new mutable.HashMap[String, ColumnStat]() + var i = 0 + while(i < keyPairs.length && ndvDenom != 0) { +val (leftKey, rightKey) = keyPairs(i) +// Do estimation if we have enough statistics +if (hasColumnStat((left, leftKey), (right, rightKey))) { + val leftKeyStats = leftStats.colStats(leftKey.name) + val rightKeyStats = rightStats.colStats(rightKey.name) + + // Check if the two sides are disjoint + val lRange = Range(leftKeyStats.min, leftKeyStats.max, leftKey.dataType) + val rRange = Range(rightKeyStats.min, rightKeyStats.max, rightKey.dataType) + if (Range.isIntersected(lRange, rRange)) { +// Get the largest ndv among pairs of join keys +val maxNdv = leftKeyStats.distinctCount.max(rightKeyStats.distinctCount) +if (maxNdv > ndvDenom) ndvDenom = maxNdv + +// Update intersected column stats +val minNdv = leftKeyStats.distinctCount.min(rightKeyStats.distinctCount) +val (newMin1, newMax1, newMin2, newMax2) = + Range.intersect(lRange, rRange, leftKey.dataType, rightKey.dataType) +intersectedStats.put(leftKey.name, intersectedColumnStat(leftKeyStats, minNdv, + newMin1, newMax1)) +intersectedStats.put(rightKey.name, intersectedColumnStat(rightKeyStats, minNdv, + newMin2, newMax2)) + } else { +// Set ndvDenom to zero to indicate that this join should have no output +ndvDenom = 0
[GitHub] spark pull request #16228: [WIP] [SPARK-17076] [SQL] Cardinality estimation ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/16228#discussion_r91826564 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/JoinEstimation.scala --- @@ -0,0 +1,175 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans.logical.estimation + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} +import org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys +import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, Join, Statistics} +import org.apache.spark.sql.types.DataType + + +object JoinEstimation { + import EstimationUtils._ + + // scalastyle:off + /** + * Estimate output size and number of rows after a join operator, and propogate updated column + * statsitics. + * The number of rows of A inner join B on A.k1 = B.k1 is estimated by this basic formula: + * T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1)), where V is the number of distinct values of + * that column. The underlying assumption for this formula is: each value of the smaller domain + * is included in the larger domain. + * Generally, inner join with multiple join keys can also be estimated based on the above + * formula: + * T(A IJ B) = T(A) * T(B) / (max(V(A.k1), V(B.k1)) * max(V(A.k2), V(B.k2)) * ... * max(V(A.kn), V(B.kn))) + * However, the denominator can become very large and excessively reduce the result, so we use a + * conservative strategy to take only the largest max(V(A.ki), V(B.ki)) as the denominator. + * + * @return Return the updated statistics after join. Return `None` if the join type is not + * supported, or we don't have enough statistics for estimation. + */ + // scalastyle:on + def estimate(join: Join): Option[Statistics] = join match { +case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) +if supportsJoinType(joinType) && hasRowCountStat(left, right) => + + // 1. Compute the denominator + var ndvDenom: BigInt = -1 + val keyPairs = extractJoinKeys(leftKeys, rightKeys) + val leftStats = left.statistics + val rightStats = right.statistics + val intersectedStats = new mutable.HashMap[String, ColumnStat]() + var i = 0 + while(i < keyPairs.length && ndvDenom != 0) { +val (leftKey, rightKey) = keyPairs(i) +// Do estimation if we have enough statistics +if (hasColumnStat((left, leftKey), (right, rightKey))) { + val leftKeyStats = leftStats.colStats(leftKey.name) + val rightKeyStats = rightStats.colStats(rightKey.name) + + // Check if the two sides are disjoint + val lRange = Range(leftKeyStats.min, leftKeyStats.max, leftKey.dataType) + val rRange = Range(rightKeyStats.min, rightKeyStats.max, rightKey.dataType) + if (Range.isIntersected(lRange, rRange)) { +// Get the largest ndv among pairs of join keys +val maxNdv = leftKeyStats.distinctCount.max(rightKeyStats.distinctCount) +if (maxNdv > ndvDenom) ndvDenom = maxNdv + +// Update intersected column stats +val minNdv = leftKeyStats.distinctCount.min(rightKeyStats.distinctCount) +val (newMin1, newMax1, newMin2, newMax2) = + Range.intersect(lRange, rRange, leftKey.dataType, rightKey.dataType) +intersectedStats.put(leftKey.name, intersectedColumnStat(leftKeyStats, minNdv, + newMin1, newMax1)) +intersectedStats.put(rightKey.name, intersectedColumnStat(rightKeyStats, minNdv, + newMin2, newMax2)) + } else { +// Set ndvDenom to zero to indicate that this join should have no output +ndvDenom = 0
[GitHub] spark issue #16228: [WIP] [SPARK-17076] [SQL] Cardinality estimation for joi...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/16228 @Tagar Thanks for sharing this information. Yes, it would be better to use PK/FK, but it won't be done in this pr, and we need to implement PK/FK constraints in Spark first. > the assumption is if two tables being joined by columns with the same name, join columns have the same stats / set of values? It is true for inner join, but not true for outer joins, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16230: [SPARK-13747][Core]Fix potential ThreadLocal leaks in RP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16230 **[Test build #3486 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3486/consoleFull)** for PR 16230 at commit [`3fade95`](https://github.com/apache/spark/commit/3fade9526fda131535a5e37894474dc8e6ac09af). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16230: [SPARK-13747][Core]Fix potential ThreadLocal leaks in RP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16230 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16230: [SPARK-13747][Core]Fix potential ThreadLocal leaks in RP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16230 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69950/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org