spark git commit: [SPARK-9104][CORE] Expose Netty memory metrics in Spark
Repository: spark Updated Branches: refs/heads/master 6a2325448 -> 445f1790a [SPARK-9104][CORE] Expose Netty memory metrics in Spark ## What changes were proposed in this pull request? This PR exposes Netty memory usage for Spark's `TransportClientFactory` and `TransportServer`, including the details of each direct arena and heap arena metrics, as well as aggregated metrics. The purpose of adding the Netty metrics is to better know the memory usage of Netty in Spark shuffle, rpc and others network communications, and guide us to better configure the memory size of executors. This PR doesn't expose these metrics to any sink, to leverage this feature, still requires to connect to either MetricsSystem or collect them back to Driver to display. ## How was this patch tested? Add Unit test to verify it, also manually verified in real cluster. Author: jerryshao Closes #18935 from jerryshao/SPARK-9104. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/445f1790 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/445f1790 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/445f1790 Branch: refs/heads/master Commit: 445f1790ade1c53cf7eee1f282395648e4d0992c Parents: 6a23254 Author: jerryshao Authored: Tue Sep 5 21:28:54 2017 -0700 Committer: Shixiong Zhu Committed: Tue Sep 5 21:28:54 2017 -0700 -- common/network-common/pom.xml | 5 + .../network/client/TransportClientFactory.java | 13 +- .../spark/network/server/TransportServer.java | 14 +- .../spark/network/util/NettyMemoryMetrics.java | 145 .../spark/network/util/TransportConf.java | 10 ++ .../network/util/NettyMemoryMetricsSuite.java | 171 +++ dev/deps/spark-deps-hadoop-2.6 | 2 +- dev/deps/spark-deps-hadoop-2.7 | 2 +- pom.xml | 2 +- 9 files changed, 353 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/445f1790/common/network-common/pom.xml -- diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index ccd8504..18cbdad 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -61,6 +61,11 @@ jackson-annotations + + io.dropwizard.metrics + metrics-core + + org.slf4j http://git-wip-us.apache.org/repos/asf/spark/blob/445f1790/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java -- diff --git a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java index 8add4e1..16d242d 100644 --- a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java +++ b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java @@ -26,6 +26,7 @@ import java.util.Random; import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.atomic.AtomicReference; +import com.codahale.metrics.MetricSet; import com.google.common.base.Preconditions; import com.google.common.base.Throwables; import com.google.common.collect.Lists; @@ -42,10 +43,7 @@ import org.slf4j.LoggerFactory; import org.apache.spark.network.TransportContext; import org.apache.spark.network.server.TransportChannelHandler; -import org.apache.spark.network.util.IOMode; -import org.apache.spark.network.util.JavaUtils; -import org.apache.spark.network.util.NettyUtils; -import org.apache.spark.network.util.TransportConf; +import org.apache.spark.network.util.*; /** * Factory for creating {@link TransportClient}s by using createClient. @@ -87,6 +85,7 @@ public class TransportClientFactory implements Closeable { private final Class socketChannelClass; private EventLoopGroup workerGroup; private PooledByteBufAllocator pooledAllocator; + private final NettyMemoryMetrics metrics; public TransportClientFactory( TransportContext context, @@ -106,6 +105,12 @@ public class TransportClientFactory implements Closeable { conf.getModuleName() + "-client"); this.pooledAllocator = NettyUtils.createPooledByteBufAllocator( conf.preferDirectBufs(), false /* allowCache */, conf.clientThreads()); +this.metrics = new NettyMemoryMetrics( + this.pooledAllocator, conf.getModuleName() + "-client", conf); + } + + public MetricSet getAllMetrics() { +return metrics; } /** http://git-wip-us.apache.org/repos/asf/spark/blob/445f1790/common/network-common/src/main/jav
spark git commit: [SPARK-18061][THRIFTSERVER] Add spnego auth support for ThriftServer thrift/http protocol
Repository: spark Updated Branches: refs/heads/master 9e451bcf3 -> 6a2325448 [SPARK-18061][THRIFTSERVER] Add spnego auth support for ThriftServer thrift/http protocol Spark ThriftServer doesn't support spnego auth for thrift/http protocol, this mainly used for knox+thriftserver scenario. Since in HiveServer2 CLIService there already has existing codes to support it. So here copy it to Spark ThriftServer to make it support. Related Hive JIRA HIVE-6697. Manual verification. Author: jerryshao Closes #18628 from jerryshao/SPARK-21407. Change-Id: I61ef0c09f6972bba982475084a6b0ae3a74e385e Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6a232544 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6a232544 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6a232544 Branch: refs/heads/master Commit: 6a2325448000ba431ba3b982d181c017559abfe3 Parents: 9e451bc Author: jerryshao Authored: Wed Sep 6 09:39:39 2017 +0800 Committer: jerryshao Committed: Wed Sep 6 09:39:39 2017 +0800 -- .../sql/hive/thriftserver/SparkSQLCLIService.scala | 16 1 file changed, 16 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6a232544/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIService.scala -- diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIService.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIService.scala index 1b17a9a..ad1f5eb 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIService.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIService.scala @@ -25,6 +25,7 @@ import scala.collection.JavaConverters._ import org.apache.commons.logging.Log import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.conf.HiveConf.ConfVars import org.apache.hadoop.hive.shims.Utils import org.apache.hadoop.security.UserGroupInformation import org.apache.hive.service.{AbstractService, Service, ServiceException} @@ -47,6 +48,7 @@ private[hive] class SparkSQLCLIService(hiveServer: HiveServer2, sqlContext: SQLC setSuperField(this, "sessionManager", sparkSqlSessionManager) addService(sparkSqlSessionManager) var sparkServiceUGI: UserGroupInformation = null +var httpUGI: UserGroupInformation = null if (UserGroupInformation.isSecurityEnabled) { try { @@ -57,6 +59,20 @@ private[hive] class SparkSQLCLIService(hiveServer: HiveServer2, sqlContext: SQLC case e @ (_: IOException | _: LoginException) => throw new ServiceException("Unable to login to kerberos with given principal/keytab", e) } + + // Try creating spnego UGI if it is configured. + val principal = hiveConf.getVar(ConfVars.HIVE_SERVER2_SPNEGO_PRINCIPAL).trim + val keyTabFile = hiveConf.getVar(ConfVars.HIVE_SERVER2_SPNEGO_KEYTAB).trim + if (principal.nonEmpty && keyTabFile.nonEmpty) { +try { + httpUGI = HiveAuthFactory.loginFromSpnegoKeytabAndReturnUGI(hiveConf) + setSuperField(this, "httpUGI", httpUGI) +} catch { + case e: IOException => +throw new ServiceException("Unable to login to spnego with given principal " + + s"$principal and keytab $keyTabFile: $e", e) +} + } } initCompositeService(hiveConf) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][DOC] Update `Partition Discovery` section to enumerate all available file sources
Repository: spark Updated Branches: refs/heads/branch-2.2 1f7c4869b -> 7da8fbf08 [MINOR][DOC] Update `Partition Discovery` section to enumerate all available file sources ## What changes were proposed in this pull request? All built-in data sources support `Partition Discovery`. We had better update the document to give the users more benefit clearly. **AFTER** https://user-images.githubusercontent.com/9700541/30083628-14278908-9244-11e7-98dc-9ad45fe233a9.png";> ## How was this patch tested? ``` SKIP_API=1 jekyll serve --watch ``` Author: Dongjoon Hyun Closes #19139 from dongjoon-hyun/partitiondiscovery. (cherry picked from commit 9e451bcf36151bf401f72dcd66001b9ceb079738) Signed-off-by: gatorsmile Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7da8fbf0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7da8fbf0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7da8fbf0 Branch: refs/heads/branch-2.2 Commit: 7da8fbf08b492ae899bef5ea5a08e2bcf4c6db93 Parents: 1f7c486 Author: Dongjoon Hyun Authored: Tue Sep 5 14:35:09 2017 -0700 Committer: gatorsmile Committed: Tue Sep 5 14:35:27 2017 -0700 -- docs/sql-programming-guide.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7da8fbf0/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index b5eca76..9a54adc 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -733,8 +733,9 @@ SELECT * FROM parquetTable Table partitioning is a common optimization approach used in systems like Hive. In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in -the path of each partition directory. The Parquet data source is now able to discover and infer -partitioning information automatically. For example, we can store all our previously used +the path of each partition directory. All built-in file sources (including Text/CSV/JSON/ORC/Parquet) +are able to discover and infer partitioning information automatically. +For example, we can store all our previously used population data into a partitioned table using the following directory structure, with two extra columns, `gender` and `country` as partitioning columns: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][DOC] Update `Partition Discovery` section to enumerate all available file sources
Repository: spark Updated Branches: refs/heads/master fd60d4fa6 -> 9e451bcf3 [MINOR][DOC] Update `Partition Discovery` section to enumerate all available file sources ## What changes were proposed in this pull request? All built-in data sources support `Partition Discovery`. We had better update the document to give the users more benefit clearly. **AFTER** https://user-images.githubusercontent.com/9700541/30083628-14278908-9244-11e7-98dc-9ad45fe233a9.png";> ## How was this patch tested? ``` SKIP_API=1 jekyll serve --watch ``` Author: Dongjoon Hyun Closes #19139 from dongjoon-hyun/partitiondiscovery. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9e451bcf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9e451bcf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9e451bcf Branch: refs/heads/master Commit: 9e451bcf36151bf401f72dcd66001b9ceb079738 Parents: fd60d4f Author: Dongjoon Hyun Authored: Tue Sep 5 14:35:09 2017 -0700 Committer: gatorsmile Committed: Tue Sep 5 14:35:09 2017 -0700 -- docs/sql-programming-guide.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9e451bcf/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index ee231a9..032073b 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -733,8 +733,9 @@ SELECT * FROM parquetTable Table partitioning is a common optimization approach used in systems like Hive. In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in -the path of each partition directory. The Parquet data source is now able to discover and infer -partitioning information automatically. For example, we can store all our previously used +the path of each partition directory. All built-in file sources (including Text/CSV/JSON/ORC/Parquet) +are able to discover and infer partitioning information automatically. +For example, we can store all our previously used population data into a partitioned table using the following directory structure, with two extra columns, `gender` and `country` as partitioning columns: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-21652][SQL] Fix rule confliction between InferFiltersFromConstraints and ConstantPropagation
Repository: spark Updated Branches: refs/heads/master 8c954d2cd -> fd60d4fa6 [SPARK-21652][SQL] Fix rule confliction between InferFiltersFromConstraints and ConstantPropagation ## What changes were proposed in this pull request? For the given example below, the predicate added by `InferFiltersFromConstraints` is folded by `ConstantPropagation` later, this leads to unconverged optimize iteration: ``` Seq((1, 1)).toDF("col1", "col2").createOrReplaceTempView("t1") Seq(1, 2).toDF("col").createOrReplaceTempView("t2") sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col AND t1.col2 = t2.col") ``` We can fix this by adjusting the indent of the optimize rules. ## How was this patch tested? Add test case that would have failed in `SQLQuerySuite`. Author: Xingbo Jiang Closes #19099 from jiangxb1987/unconverge-optimization. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fd60d4fa Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fd60d4fa Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fd60d4fa Branch: refs/heads/master Commit: fd60d4fa6c516496a60d6979edd1b4630bf721bd Parents: 8c954d2 Author: Xingbo Jiang Authored: Tue Sep 5 13:12:39 2017 -0700 Committer: gatorsmile Committed: Tue Sep 5 13:12:39 2017 -0700 -- .../spark/sql/catalyst/optimizer/Optimizer.scala | 3 ++- .../scala/org/apache/spark/sql/SQLQuerySuite.scala| 14 ++ 2 files changed, 16 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fd60d4fa/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index b73f70a..d7e5906 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -79,11 +79,12 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) PushProjectionThroughUnion, ReorderJoin, EliminateOuterJoin, + InferFiltersFromConstraints, + BooleanSimplification, PushPredicateThroughJoin, PushDownPredicate, LimitPushDown, ColumnPruning, - InferFiltersFromConstraints, // Operator combine CollapseRepartition, CollapseProject, http://git-wip-us.apache.org/repos/asf/spark/blob/fd60d4fa/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala index 923c6d8..93a 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala @@ -2663,4 +2663,18 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { // In unit test, Spark will fail the query if memory leak detected. spark.range(100).groupBy("id").count().limit(1).collect() } + + test("SPARK-21652: rule confliction of InferFiltersFromConstraints and ConstantPropagation") { +withTempView("t1", "t2") { + Seq((1, 1)).toDF("col1", "col2").createOrReplaceTempView("t1") + Seq(1, 2).toDF("col").createOrReplaceTempView("t2") + val df = sql( +""" + |SELECT * + |FROM t1, t2 + |WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col AND t1.col2 = t2.col +""".stripMargin) + checkAnswer(df, Row(1, 1, 1)) +} + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-21925] Update trigger interval documentation in docs with behavior change in Spark 2.2
Repository: spark Updated Branches: refs/heads/branch-2.2 fb1b5f08a -> 1f7c4869b [SPARK-21925] Update trigger interval documentation in docs with behavior change in Spark 2.2 Forgot to update docs with behavior change. Author: Burak Yavuz Closes #19138 from brkyvz/trigger-doc-fix. (cherry picked from commit 8c954d2cd10a2cf729d2971fbeb19b2dd751a178) Signed-off-by: Tathagata Das Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1f7c4869 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1f7c4869 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1f7c4869 Branch: refs/heads/branch-2.2 Commit: 1f7c4869b811f9a05cd1fb54e168e739cde7933f Parents: fb1b5f0 Author: Burak Yavuz Authored: Tue Sep 5 13:10:32 2017 -0700 Committer: Tathagata Das Committed: Tue Sep 5 13:10:47 2017 -0700 -- docs/structured-streaming-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1f7c4869/docs/structured-streaming-programming-guide.md -- diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 8367f5a..13a6a82 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -1168,7 +1168,7 @@ returned through `Dataset.writeStream()`. You will have to specify one or more o - *Query name:* Optionally, specify a unique name of the query for identification. -- *Trigger interval:* Optionally, specify the trigger interval. If it is not specified, the system will check for availability of new data as soon as the previous processing has completed. If a trigger time is missed because the previous processing has not completed, then the system will attempt to trigger at the next trigger point, not immediately after the processing has completed. +- *Trigger interval:* Optionally, specify the trigger interval. If it is not specified, the system will check for availability of new data as soon as the previous processing has completed. If a trigger time is missed because the previous processing has not completed, then the system will trigger processing immediately. - *Checkpoint location:* For some output sinks where the end-to-end fault-tolerance can be guaranteed, specify the location where the system will write all the checkpoint information. This should be a directory in an HDFS-compatible fault-tolerant file system. The semantics of checkpointing is discussed in more detail in the next section. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-21925] Update trigger interval documentation in docs with behavior change in Spark 2.2
Repository: spark Updated Branches: refs/heads/master 2974406d1 -> 8c954d2cd [SPARK-21925] Update trigger interval documentation in docs with behavior change in Spark 2.2 Forgot to update docs with behavior change. Author: Burak Yavuz Closes #19138 from brkyvz/trigger-doc-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8c954d2c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8c954d2c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8c954d2c Branch: refs/heads/master Commit: 8c954d2cd10a2cf729d2971fbeb19b2dd751a178 Parents: 2974406 Author: Burak Yavuz Authored: Tue Sep 5 13:10:32 2017 -0700 Committer: Tathagata Das Committed: Tue Sep 5 13:10:32 2017 -0700 -- docs/structured-streaming-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8c954d2c/docs/structured-streaming-programming-guide.md -- diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 8367f5a..13a6a82 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -1168,7 +1168,7 @@ returned through `Dataset.writeStream()`. You will have to specify one or more o - *Query name:* Optionally, specify a unique name of the query for identification. -- *Trigger interval:* Optionally, specify the trigger interval. If it is not specified, the system will check for availability of new data as soon as the previous processing has completed. If a trigger time is missed because the previous processing has not completed, then the system will attempt to trigger at the next trigger point, not immediately after the processing has completed. +- *Trigger interval:* Optionally, specify the trigger interval. If it is not specified, the system will check for availability of new data as soon as the previous processing has completed. If a trigger time is missed because the previous processing has not completed, then the system will trigger processing immediately. - *Checkpoint location:* For some output sinks where the end-to-end fault-tolerance can be guaranteed, specify the location where the system will write all the checkpoint information. This should be a directory in an HDFS-compatible fault-tolerant file system. The semantics of checkpointing is discussed in more detail in the next section. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-21845][SQL][TEST-MAVEN] Make codegen fallback of expressions configurable
Repository: spark Updated Branches: refs/heads/master 02a4386ae -> 2974406d1 [SPARK-21845][SQL][TEST-MAVEN] Make codegen fallback of expressions configurable ## What changes were proposed in this pull request? We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases. ## How was this patch tested? Added test cases Author: gatorsmile Closes #19119 from gatorsmile/fallbackCodegen. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2974406d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2974406d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2974406d Branch: refs/heads/master Commit: 2974406d17a3831c1897b8d99261419592f8042f Parents: 02a4386 Author: gatorsmile Authored: Tue Sep 5 09:04:03 2017 -0700 Committer: gatorsmile Committed: Tue Sep 5 09:04:03 2017 -0700 -- .../scala/org/apache/spark/sql/internal/SQLConf.scala | 6 +++--- .../org/apache/spark/sql/execution/SparkPlan.scala | 11 ++- .../spark/sql/execution/WholeStageCodegenExec.scala | 2 +- .../org/apache/spark/sql/DataFrameFunctionsSuite.scala | 2 +- .../scala/org/apache/spark/sql/DataFrameSuite.scala | 12 +++- .../org/apache/spark/sql/test/SharedSQLContext.scala| 2 ++ .../scala/org/apache/spark/sql/hive/test/TestHive.scala | 1 + 7 files changed, 25 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2974406d/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index c407874..db5d65c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -559,9 +559,9 @@ object SQLConf { .intConf .createWithDefault(100) - val WHOLESTAGE_FALLBACK = buildConf("spark.sql.codegen.fallback") + val CODEGEN_FALLBACK = buildConf("spark.sql.codegen.fallback") .internal() -.doc("When true, whole stage codegen could be temporary disabled for the part of query that" + +.doc("When true, (whole stage) codegen could be temporary disabled for the part of query that" + " fail to compile generated code") .booleanConf .createWithDefault(true) @@ -1051,7 +1051,7 @@ class SQLConf extends Serializable with Logging { def wholeStageMaxNumFields: Int = getConf(WHOLESTAGE_MAX_NUM_FIELDS) - def wholeStageFallback: Boolean = getConf(WHOLESTAGE_FALLBACK) + def codegenFallback: Boolean = getConf(CODEGEN_FALLBACK) def maxCaseBranchesForCodegen: Int = getConf(MAX_CASES_BRANCHES) http://git-wip-us.apache.org/repos/asf/spark/blob/2974406d/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala index c7277c2..b263f10 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala @@ -56,15 +56,17 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializ protected def sparkContext = sqlContext.sparkContext - // sqlContext will be null when we are being deserialized on the slaves. In this instance - // the value of subexpressionEliminationEnabled will be set by the deserializer after the - // constructor has run. + // sqlContext will be null when SparkPlan nodes are created without the active sessions. + // So far, this only happens in the test cases. val subexpressionEliminationEnabled: Boolean = if (sqlContext != null) { sqlContext.conf.subexpressionEliminationEnabled } else { false } + // whether we should fallback when hitting compilation errors caused by codegen + private val codeGenFallBack = (sqlContext == null) || sqlContext.conf.codegenFallback + /** Overridden make copy also propagates sqlContext to copied plan. */ override def makeCopy(newArgs: Array[AnyRef]): SparkPlan = { SparkSession.setActiveSession(sqlContext.sparkSession) @@ -370,8 +372,7 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializ try { GeneratePredicate.generate(expression, inputSchema) } catch { - case e @ (_: JaninoRuntimeException | _: CompileException) - if sqlCon
spark git commit: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
Repository: spark Updated Branches: refs/heads/master 7f3c6ff4f -> 02a4386ae [SPARK-20978][SQL] Bump up Univocity version to 2.5.4 ## What changes were proposed in this pull request? There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below: ```scala val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS()) df.show() ``` **Before** ``` java.lang.NullPointerException at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89) at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207) ... ``` **After** ``` +---+++ | a| b|unparsed| +---+++ | a|null| a| +---+++ ``` It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this. ## How was this patch tested? Unit test added in `CSVSuite.scala`. Author: hyukjinkwon Closes #19113 from HyukjinKwon/bump-up-univocity. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/02a4386a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/02a4386a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/02a4386a Branch: refs/heads/master Commit: 02a4386aec5f83f41ca1abc5f56e223b6fae015c Parents: 7f3c6ff Author: hyukjinkwon Authored: Tue Sep 5 23:21:43 2017 +0800 Committer: Wenchen Fan Committed: Tue Sep 5 23:21:43 2017 +0800 -- dev/deps/spark-deps-hadoop-2.6 | 2 +- dev/deps/spark-deps-hadoop-2.7 | 2 +- sql/core/pom.xml | 2 +- .../spark/sql/execution/datasources/csv/CSVSuite.scala | 8 4 files changed, 11 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/02a4386a/dev/deps/spark-deps-hadoop-2.6 -- diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6 index 1535103..e3b9ce0 100644 --- a/dev/deps/spark-deps-hadoop-2.6 +++ b/dev/deps/spark-deps-hadoop-2.6 @@ -182,7 +182,7 @@ stax-api-1.0.1.jar stream-2.7.0.jar stringtemplate-3.2.1.jar super-csv-2.2.0.jar -univocity-parsers-2.2.1.jar +univocity-parsers-2.5.4.jar validation-api-1.1.0.Final.jar xbean-asm5-shaded-4.4.jar xercesImpl-2.9.1.jar http://git-wip-us.apache.org/repos/asf/spark/blob/02a4386a/dev/deps/spark-deps-hadoop-2.7 -- diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7 index deaa288..a3f3f32 100644 --- a/dev/deps/spark-deps-hadoop-2.7 +++ b/dev/deps/spark-deps-hadoop-2.7 @@ -183,7 +183,7 @@ stax-api-1.0.1.jar stream-2.7.0.jar stringtemplate-3.2.1.jar super-csv-2.2.0.jar -univocity-parsers-2.2.1.jar +univocity-parsers-2.5.4.jar validation-api-1.1.0.Final.jar xbean-asm5-shaded-4.4.jar xercesImpl-2.9.1.jar http://git-wip-us.apache.org/repos/asf/spark/blob/02a4386a/sql/core/pom.xml -- diff --git a/sql/core/pom.xml b/sql/core/pom.xml index 9a3cacb..7ee002e 100644 --- a/sql/core/pom.xml +++ b/sql/core/pom.xml @@ -38,7 +38,7 @@ com.univocity univocity-parsers - 2.2.1 + 2.5.4 jar http://git-wip-us.apache.org/repos/asf/spark/blob/02a4386a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index 243a55c..be89141 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala @@ -1195,4 +1195,12 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { .csv(Seq("10u12").toDS()) checkAnswer(results, Row(null)) } + + test("SPARK-20978: Fill the malformed column when the number of tokens is less than schem
spark git commit: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Repository: spark Updated Branches: refs/heads/master 4e7a29efd -> 7f3c6ff4f [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0. ## What changes were proposed in this pull request? 1.0.0 fixes an issue with import order, explicit type for public methods, line length limitation and comment validation: ``` [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala:50:16: Are you sure you want to println? If yes, wrap the code block with [error] // scalastyle:off println [error] println(...) [error] // scalastyle:on println [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:49: File line length exceeds 100 characters [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:22:21: Are you sure you want to println? If yes, wrap the code block with [error] // scalastyle:off println [error] println(...) [error] // scalastyle:on println [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:35:6: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:51:6: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:93:15: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:98:15: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:47:2: Insert a space after the start of the comment [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:26:43: JavaDStream should come before JavaDStreamLike. ``` This PR also fixes the workaround added in SPARK-16877 for `org.scalastyle.scalariform.OverrideJavaChecker` feature, added from 0.9.0. ## How was this patch tested? Manually tested. Author: hyukjinkwon Closes #19116 from HyukjinKwon/scalastyle-1.0.0. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7f3c6ff4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7f3c6ff4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7f3c6ff4 Branch: refs/heads/master Commit: 7f3c6ff4ff0a501cc7f1fb53a90ea7b5787f68e1 Parents: 4e7a29e Author: hyukjinkwon Authored: Tue Sep 5 19:40:05 2017 +0900 Committer: hyukjinkwon Committed: Tue Sep 5 19:40:05 2017 +0900 -- project/SparkBuild.scala| 5 +++-- project/plugins.sbt | 3 +-- .../src/main/scala/org/apache/spark/repl/Main.scala | 2 ++ .../main/scala/org/apache/spark/repl/SparkILoop.scala | 5 - scalastyle-config.xml | 5 + .../java/org/apache/spark/streaming/JavaTestUtils.scala | 12 ++-- 6 files changed, 17 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7f3c6ff4/project/SparkBuild.scala -- diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 9d903ed..20848f0 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -163,14 +163,15 @@ object SparkBuild extends PomBuild { val configUrlV = scalastyleConfigUrl.in(config).value val streamsV = streams.in(config).value val failOnErrorV = true +val failOnWarningV = false val scalastyleTargetV = scalastyleTarget.in(config).value val configRefreshHoursV = scalastyleConfigRefreshHours.in(config).value val targetV = target.in(config).value val configCacheFileV = scalastyleConfigUrlCacheFile.in(config).value logger.info(s"Running scalastyle on ${name.value} in ${config.name}") -Tasks.doScalastyle(args, configV, configUrlV, failOnErrorV, scalaSourceV, scalastyleTargetV, - streamsV, configRefreshHoursV, targetV, configCacheFileV) +Tasks.doScalastyle(args, configV, configUrlV, failOnErrorV, failOnWarningV, scalaSourceV, + scalastyleTargetV, streamsV, configRefreshHoursV, targetV, configCacheFileV) Set.empty } http://git-wip-us.apache.org/repos/asf/spark/blob/7f3c6ff4/project/plugins.sbt -- diff --git a/project/plugins.sbt b/project/plugins.sbt index f67e0a1..3c5442b 100644 --- a/project/plugins.sbt +++ b/project/plugins.sbt @@ -7,8 +7,7 @@ addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "5.1.0") // sbt 1.0.0 support: https://github.com/jrudolph/sbt-dependency-graph/issues/134 addSbtPlugin("net.virtual-void" % "sbt-dependency-graph" % "0.8.2") -// need to
spark git commit: [SPARK-21913][SQL][TEST] withDatabase` should drop database with CASCADE
Repository: spark Updated Branches: refs/heads/master ca59445ad -> 4e7a29efd [SPARK-21913][SQL][TEST] withDatabase` should drop database with CASCADE ## What changes were proposed in this pull request? Currently, `withDatabase` fails if the database is not empty. It would be great if we drop cleanly with CASCADE. ## How was this patch tested? This is a change on test util. Pass the existing Jenkins. Author: Dongjoon Hyun Closes #19125 from dongjoon-hyun/SPARK-21913. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4e7a29ef Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4e7a29ef Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4e7a29ef Branch: refs/heads/master Commit: 4e7a29efdba6972a4713a62dfccb495504a25ab9 Parents: ca59445 Author: Dongjoon Hyun Authored: Tue Sep 5 00:20:16 2017 -0700 Committer: gatorsmile Committed: Tue Sep 5 00:20:16 2017 -0700 -- .../src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4e7a29ef/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala index e68db3b..a14a144 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala @@ -247,7 +247,7 @@ private[sql] trait SQLTestUtils protected def withDatabase(dbNames: String*)(f: => Unit): Unit = { try f finally { dbNames.foreach { name => -spark.sql(s"DROP DATABASE IF EXISTS $name") +spark.sql(s"DROP DATABASE IF EXISTS $name CASCADE") } spark.sql(s"USE $DEFAULT_DATABASE") } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org