spark git commit: [SPARK-19816][SQL][TESTS] Fix an issue that DataFrameCallbackSuite doesn't recover the log level
Repository: spark Updated Branches: refs/heads/branch-2.1 da04d45c2 -> 664c9795c [SPARK-19816][SQL][TESTS] Fix an issue that DataFrameCallbackSuite doesn't recover the log level ## What changes were proposed in this pull request? "DataFrameCallbackSuite.execute callback functions when a DataFrame action failed" sets the log level to "fatal" but doesn't recover it. Hence, tests running after it won't output any logs except fatal logs. This PR uses `testQuietly` instead to avoid changing the log level. ## How was this patch tested? Jenkins Author: Shixiong Zhu Closes #17156 from zsxwing/SPARK-19816. (cherry picked from commit fbc4058037cf5b0be9f14a7dd28105f7f8151bed) Signed-off-by: Yin Huai Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/664c9795 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/664c9795 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/664c9795 Branch: refs/heads/branch-2.1 Commit: 664c9795c94d3536ff9fe54af06e0fb6c0012862 Parents: da04d45 Author: Shixiong Zhu Authored: Fri Mar 3 19:00:35 2017 -0800 Committer: Yin Huai Committed: Fri Mar 3 19:09:38 2017 -0800 -- .../scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/664c9795/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala index 3ae5ce6..f372e94 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala @@ -58,7 +58,7 @@ class DataFrameCallbackSuite extends QueryTest with SharedSQLContext { spark.listenerManager.unregister(listener) } - test("execute callback functions when a DataFrame action failed") { + testQuietly("execute callback functions when a DataFrame action failed") { val metrics = ArrayBuffer.empty[(String, QueryExecution, Exception)] val listener = new QueryExecutionListener { override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = { @@ -73,8 +73,6 @@ class DataFrameCallbackSuite extends QueryTest with SharedSQLContext { val errorUdf = udf[Int, Int] { _ => throw new RuntimeException("udf error") } val df = sparkContext.makeRDD(Seq(1 -> "a")).toDF("i", "j") -// Ignore the log when we are expecting an exception. -sparkContext.setLogLevel("FATAL") val e = intercept[SparkException](df.select(errorUdf($"i")).collect()) assert(metrics.length == 1) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19816][SQL][TESTS] Fix an issue that DataFrameCallbackSuite doesn't recover the log level
Repository: spark Updated Branches: refs/heads/master 9e5b4ce72 -> fbc405803 [SPARK-19816][SQL][TESTS] Fix an issue that DataFrameCallbackSuite doesn't recover the log level ## What changes were proposed in this pull request? "DataFrameCallbackSuite.execute callback functions when a DataFrame action failed" sets the log level to "fatal" but doesn't recover it. Hence, tests running after it won't output any logs except fatal logs. This PR uses `testQuietly` instead to avoid changing the log level. ## How was this patch tested? Jenkins Author: Shixiong Zhu Closes #17156 from zsxwing/SPARK-19816. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fbc40580 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fbc40580 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fbc40580 Branch: refs/heads/master Commit: fbc4058037cf5b0be9f14a7dd28105f7f8151bed Parents: 9e5b4ce Author: Shixiong Zhu Authored: Fri Mar 3 19:00:35 2017 -0800 Committer: Xiao Li Committed: Fri Mar 3 19:00:35 2017 -0800 -- .../scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fbc40580/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala index 9f27d06..7c9ea7d 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala @@ -60,7 +60,7 @@ class DataFrameCallbackSuite extends QueryTest with SharedSQLContext { spark.listenerManager.unregister(listener) } - test("execute callback functions when a DataFrame action failed") { + testQuietly("execute callback functions when a DataFrame action failed") { val metrics = ArrayBuffer.empty[(String, QueryExecution, Exception)] val listener = new QueryExecutionListener { override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = { @@ -75,8 +75,6 @@ class DataFrameCallbackSuite extends QueryTest with SharedSQLContext { val errorUdf = udf[Int, Int] { _ => throw new RuntimeException("udf error") } val df = sparkContext.makeRDD(Seq(1 -> "a")).toDF("i", "j") -// Ignore the log when we are expecting an exception. -sparkContext.setLogLevel("FATAL") val e = intercept[SparkException](df.select(errorUdf($"i")).collect()) assert(metrics.length == 1) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19084][SQL] Ensure context class loader is set when initializing Hive.
Repository: spark Updated Branches: refs/heads/master a6a7a95e2 -> 9e5b4ce72 [SPARK-19084][SQL] Ensure context class loader is set when initializing Hive. A change in Hive 2.2 (most probably HIVE-13149) causes this code path to fail, since the call to "state.getConf.setClassLoader" does not actually change the context's class loader. Spark doesn't yet officially support Hive 2.2, but some distribution-specific metastore client libraries may have that change (as certain versions of CDH already do), and this also makes it easier to support 2.2 when it comes out. Tested with existing unit tests; we've also used this patch extensively with Hive metastore client jars containing the offending patch. Author: Marcelo Vanzin Closes #17154 from vanzin/SPARK-19804. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9e5b4ce7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9e5b4ce7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9e5b4ce7 Branch: refs/heads/master Commit: 9e5b4ce727cf262a14a411efded85ee1e50a88ed Parents: a6a7a95 Author: Marcelo Vanzin Authored: Fri Mar 3 18:44:31 2017 -0800 Committer: Xiao Li Committed: Fri Mar 3 18:44:31 2017 -0800 -- .../apache/spark/sql/hive/client/HiveClientImpl.scala| 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9e5b4ce7/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala index 8f98c8f..7acaa9a 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala @@ -269,16 +269,21 @@ private[hive] class HiveClientImpl( */ def withHiveState[A](f: => A): A = retryLocked { val original = Thread.currentThread().getContextClassLoader -// Set the thread local metastore client to the client associated with this HiveClientImpl. -Hive.set(client) +val originalConfLoader = state.getConf.getClassLoader // The classloader in clientLoader could be changed after addJar, always use the latest -// classloader +// classloader. We explicitly set the context class loader since "conf.setClassLoader" does +// not do that, and the Hive client libraries may need to load classes defined by the client's +// class loader. +Thread.currentThread().setContextClassLoader(clientLoader.classLoader) state.getConf.setClassLoader(clientLoader.classLoader) +// Set the thread local metastore client to the client associated with this HiveClientImpl. +Hive.set(client) // setCurrentSessionState will use the classLoader associated // with the HiveConf in `state` to override the context class loader of the current // thread. shim.setCurrentSessionState(state) val ret = try f finally { + state.getConf.setClassLoader(originalConfLoader) Thread.currentThread().setContextClassLoader(original) HiveCatalogMetrics.incrementHiveClientCalls(1) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19718][SS] Handle more interrupt cases properly for Hadoop
Repository: spark Updated Branches: refs/heads/master f5fdbe043 -> a6a7a95e2 [SPARK-19718][SS] Handle more interrupt cases properly for Hadoop ## What changes were proposed in this pull request? [SPARK-19617](https://issues.apache.org/jira/browse/SPARK-19617) changed `HDFSMetadataLog` to enable interrupts when using the local file system. However, now we hit [HADOOP-12074](https://issues.apache.org/jira/browse/HADOOP-12074): `Shell.runCommand` converts `InterruptedException` to `new IOException(ie.toString())` before Hadoop 2.8. This is the Hadoop patch to fix HADOOP-1207: https://github.com/apache/hadoop/commit/95c73d49b1bb459b626a9ac52acadb8f5fa724de This PR adds new logic to handle the following cases related to `InterruptedException`. - Check if the message of IOException starts with `java.lang.InterruptedException`. If so, treat it as `InterruptedException`. This is for pre-Hadoop 2.8. - Treat `InterruptedIOException` as `InterruptedException`. This is for Hadoop 2.8+ and other places that may throw `InterruptedIOException` when the thread is interrupted. ## How was this patch tested? The new unit test. Author: Shixiong Zhu Closes #17044 from zsxwing/SPARK-19718. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a6a7a95e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a6a7a95e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a6a7a95e Branch: refs/heads/master Commit: a6a7a95e2f3482d84fcd744713e43f80ea90e33a Parents: f5fdbe0 Author: Shixiong Zhu Authored: Fri Mar 3 17:10:11 2017 -0800 Committer: Shixiong Zhu Committed: Fri Mar 3 17:10:11 2017 -0800 -- .../execution/streaming/StreamExecution.scala | 20 +++- .../spark/sql/streaming/StreamSuite.scala | 109 ++- 2 files changed, 119 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a6a7a95e/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala index 6e77f35..70912d1 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala @@ -17,6 +17,7 @@ package org.apache.spark.sql.execution.streaming +import java.io.{InterruptedIOException, IOException} import java.util.UUID import java.util.concurrent.{CountDownLatch, TimeUnit} import java.util.concurrent.atomic.AtomicReference @@ -37,6 +38,12 @@ import org.apache.spark.sql.execution.command.StreamingExplainCommand import org.apache.spark.sql.streaming._ import org.apache.spark.util.{Clock, UninterruptibleThread, Utils} +/** States for [[StreamExecution]]'s lifecycle. */ +trait State +case object INITIALIZING extends State +case object ACTIVE extends State +case object TERMINATED extends State + /** * Manages the execution of a streaming Spark SQL query that is occurring in a separate thread. * Unlike a standard query, a streaming query executes repeatedly each time new data arrives at any @@ -298,7 +305,14 @@ class StreamExecution( // `stop()` is already called. Let `finally` finish the cleanup. } } catch { - case _: InterruptedException if state.get == TERMINATED => // interrupted by stop() + case _: InterruptedException | _: InterruptedIOException if state.get == TERMINATED => +// interrupted by stop() +updateStatusMessage("Stopped") + case e: IOException if e.getMessage != null +&& e.getMessage.startsWith(classOf[InterruptedException].getName) +&& state.get == TERMINATED => +// This is a workaround for HADOOP-12074: `Shell.runCommand` converts `InterruptedException` +// to `new IOException(ie.toString())` before Hadoop 2.8. updateStatusMessage("Stopped") case e: Throwable => streamDeathCause = new StreamingQueryException( @@ -721,10 +735,6 @@ class StreamExecution( } } - trait State - case object INITIALIZING extends State - case object ACTIVE extends State - case object TERMINATED extends State } http://git-wip-us.apache.org/repos/asf/spark/blob/a6a7a95e/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala index f44cfad..6dfcd8b 100644 --- a/sql/core/src/test/scala/org/apache/spark/
spark git commit: [SPARK-13446][SQL] Support reading data from Hive 2.0.1 metastore
Repository: spark Updated Branches: refs/heads/master 44281ca81 -> f5fdbe043 [SPARK-13446][SQL] Support reading data from Hive 2.0.1 metastore ### What changes were proposed in this pull request? This PR is to make Spark work with Hive 2.0's metastores. Compared with Hive 1.2, Hive 2.0's metastore has an API update due to removal of `HOLD_DDLTIME` in https://issues.apache.org/jira/browse/HIVE-12224. Based on the following Hive JIRA description, `HOLD_DDLTIME` should be removed from our internal API too. (https://github.com/apache/spark/pull/17063 was submitted for it): > This arcane feature was introduced long ago via HIVE-1394 It was broken as > soon as it landed, HIVE-1442 and is thus useless. Fact that no one has fixed > it since informs that its not really used by anyone. Better is to remove it > so no one hits the bug of HIVE-1442 In the next PR, we will support 2.1.0 metastore, whose APIs were changed due to https://issues.apache.org/jira/browse/HIVE-12730. However, before that, we need a code cleanup for stats collection and setting. ### How was this patch tested? Added test cases to VersionsSuite.scala Author: Xiao Li Closes #17061 from gatorsmile/Hive2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5fdbe04 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5fdbe04 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5fdbe04 Branch: refs/heads/master Commit: f5fdbe04369d2c04972b7d9686d0c6c046f3d3dc Parents: 44281ca Author: Xiao Li Authored: Fri Mar 3 16:59:52 2017 -0800 Committer: Wenchen Fan Committed: Fri Mar 3 16:59:52 2017 -0800 -- .../spark/sql/hive/client/HiveClientImpl.scala | 1 + .../apache/spark/sql/hive/client/HiveShim.scala | 74 .../sql/hive/client/IsolatedClientLoader.scala | 1 + .../apache/spark/sql/hive/client/package.scala | 8 ++- .../hive/execution/InsertIntoHiveTable.scala| 11 ++- .../sql/hive/client/HiveClientBuilder.scala | 12 ++-- .../spark/sql/hive/client/VersionsSuite.scala | 9 ++- 7 files changed, 107 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f5fdbe04/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala index c326ac4..8f98c8f 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala @@ -96,6 +96,7 @@ private[hive] class HiveClientImpl( case hive.v1_0 => new Shim_v1_0() case hive.v1_1 => new Shim_v1_1() case hive.v1_2 => new Shim_v1_2() +case hive.v2_0 => new Shim_v2_0() } // Create an internal session state for this HiveClientImpl. http://git-wip-us.apache.org/repos/asf/spark/blob/f5fdbe04/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala index 9fe1c76..7280748 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala @@ -833,3 +833,77 @@ private[client] class Shim_v1_2 extends Shim_v1_1 { } } + +private[client] class Shim_v2_0 extends Shim_v1_2 { + private lazy val loadPartitionMethod = +findMethod( + classOf[Hive], + "loadPartition", + classOf[Path], + classOf[String], + classOf[JMap[String, String]], + JBoolean.TYPE, + JBoolean.TYPE, + JBoolean.TYPE, + JBoolean.TYPE, + JBoolean.TYPE) + private lazy val loadTableMethod = +findMethod( + classOf[Hive], + "loadTable", + classOf[Path], + classOf[String], + JBoolean.TYPE, + JBoolean.TYPE, + JBoolean.TYPE, + JBoolean.TYPE) + private lazy val loadDynamicPartitionsMethod = +findMethod( + classOf[Hive], + "loadDynamicPartitions", + classOf[Path], + classOf[String], + classOf[JMap[String, String]], + JBoolean.TYPE, + JInteger.TYPE, + JBoolean.TYPE, + JBoolean.TYPE, + JLong.TYPE) + + override def loadPartition( + hive: Hive, + loadPath: Path, + tableName: String, + partSpec: JMap[String, String], + replace: Boolean, + inheritTableSpecs: Boolean, + isSkewedStoreAsSubdir: Boolean, + isSrcLocal: Boolean): Unit = { +loadPartitionMethod.invoke(hive,
spark git commit: [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe
Repository: spark Updated Branches: refs/heads/master 2a7921a81 -> 44281ca81 [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe ## What changes were proposed in this pull request? The `keyword_only` decorator in PySpark is not thread-safe. It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`. If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten. See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code. This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition. It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize. ## How was this patch tested? Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances. Author: Bryan Cutler Closes #16782 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/44281ca8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/44281ca8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/44281ca8 Branch: refs/heads/master Commit: 44281ca81d4eda02b627ba21841108438b7d1c27 Parents: 2a7921a Author: Bryan Cutler Authored: Fri Mar 3 16:43:45 2017 -0800 Committer: Joseph K. Bradley Committed: Fri Mar 3 16:43:45 2017 -0800 -- python/pyspark/__init__.py | 10 +-- python/pyspark/ml/classification.py | 32 - python/pyspark/ml/clustering.py | 16 ++--- python/pyspark/ml/evaluation.py | 12 ++-- python/pyspark/ml/feature.py| 120 +++ python/pyspark/ml/pipeline.py | 4 +- python/pyspark/ml/recommendation.py | 4 +- python/pyspark/ml/regression.py | 28 python/pyspark/ml/tests.py | 8 +-- python/pyspark/ml/tuning.py | 8 +-- python/pyspark/tests.py | 39 ++ 11 files changed, 161 insertions(+), 120 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/44281ca8/python/pyspark/__init__.py -- diff --git a/python/pyspark/__init__.py b/python/pyspark/__init__.py index 9331e74..14c51a3 100644 --- a/python/pyspark/__init__.py +++ b/python/pyspark/__init__.py @@ -93,13 +93,15 @@ def keyword_only(func): """ A decorator that forces keyword arguments in the wrapped method and saves actual input keyword arguments in `_input_kwargs`. + +.. note:: Should only be used to wrap a method where first arg is `self` """ @wraps(func) -def wrapper(*args, **kwargs): -if len(args) > 1: +def wrapper(self, *args, **kwargs): +if len(args) > 0: raise TypeError("Method %s forces keyword arguments." % func.__name__) -wrapper._input_kwargs = kwargs -return func(*args, **kwargs) +self._input_kwargs = kwargs +return func(self, **kwargs) return wrapper http://git-wip-us.apache.org/repos/asf/spark/blob/44281ca8/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index ac40fce..b4fc357 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -124,7 +124,7 @@ class LinearSVC(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, Ha "org.apache.spark.ml.classification.LinearSVC", self.uid) self._setDefault(maxIter=100, regParam=0.0, tol=1e-6, fitIntercept=True, standardization=True, threshold=0.0, aggregationDepth=2) -kwargs = self.__init__._input_kwargs +kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only @@ -140,7 +140,7 @@ class LinearSVC(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, Ha aggregationDepth=2): Sets params for Linear SVM Classifier. """ -kwargs = self.setParams._input_kwargs +kwargs = self._input_kwargs return self._set(**kwargs) def _create_model(self, java_model): @@ -266,7 +266,7 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti self._java_obj = self._new_java_obj( "org.apache.spark.ml.classification.LogisticRegression", self.uid) self._setDefault(maxIter=100, regParam=0.0, tol=1E-6, t
spark git commit: [SPARK-18939][SQL] Timezone support in partition values.
Repository: spark Updated Branches: refs/heads/master ba186a841 -> 2a7921a81 [SPARK-18939][SQL] Timezone support in partition values. ## What changes were proposed in this pull request? This is a follow-up pr of #16308 and #16750. This pr enables timezone support in partition values. We should use `timeZone` option introduced at #16750 to parse/format partition values of the `TimestampType`. For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT` which will be used for partition values, the values written by the default timezone option, which is `"GMT"` because the session local timezone is `"GMT"` here, are: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "GMT") scala> val df = Seq((1, new java.sql.Timestamp(145160640L))).toDF("i", "ts") df: org.apache.spark.sql.DataFrame = [i: int, ts: timestamp] scala> df.show() +---+---+ | i| ts| +---+---+ | 1|2016-01-01 00:00:00| +---+---+ scala> df.write.partitionBy("ts").save("/path/to/gmtpartition") ``` ```sh $ ls /path/to/gmtpartition/ _SUCCESSts=2016-01-01 00%3A00%3A00 ``` whereas setting the option to `"PST"`, they are: ```scala scala> df.write.option("timeZone", "PST").partitionBy("ts").save("/path/to/pstpartition") ``` ```sh $ ls /path/to/pstpartition/ _SUCCESSts=2015-12-31 16%3A00%3A00 ``` We can properly read the partition values if the session local timezone and the timezone of the partition values are the same: ```scala scala> spark.read.load("/path/to/gmtpartition").show() +---+---+ | i| ts| +---+---+ | 1|2016-01-01 00:00:00| +---+---+ ``` And even if the timezones are different, we can properly read the values with setting corrent timezone option: ```scala // wrong result scala> spark.read.load("/path/to/pstpartition").show() +---+---+ | i| ts| +---+---+ | 1|2015-12-31 16:00:00| +---+---+ // correct result scala> spark.read.option("timeZone", "PST").load("/path/to/pstpartition").show() +---+---+ | i| ts| +---+---+ | 1|2016-01-01 00:00:00| +---+---+ ``` ## How was this patch tested? Existing tests and added some tests. Author: Takuya UESHIN Closes #17053 from ueshin/issues/SPARK-18939. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2a7921a8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2a7921a8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2a7921a8 Branch: refs/heads/master Commit: 2a7921a813ecd847fd933ffef10edc64684e9df7 Parents: ba186a8 Author: Takuya UESHIN Authored: Fri Mar 3 16:35:54 2017 -0800 Committer: Wenchen Fan Committed: Fri Mar 3 16:35:54 2017 -0800 -- .../sql/catalyst/catalog/ExternalCatalog.scala | 4 +- .../sql/catalyst/catalog/InMemoryCatalog.scala | 3 +- .../sql/catalyst/catalog/SessionCatalog.scala | 2 +- .../spark/sql/catalyst/catalog/interface.scala | 10 ++-- .../execution/OptimizeMetadataOnlyQuery.scala | 10 ++-- .../datasources/CatalogFileIndex.scala | 3 +- .../datasources/FileFormatWriter.scala | 18 +++--- .../PartitioningAwareFileIndex.scala| 16 +++-- .../datasources/PartitioningUtils.scala | 42 + .../execution/datasources/csv/CSVSuite.scala| 15 +++-- .../ParquetPartitionDiscoverySuite.scala| 62 +++- .../sql/sources/PartitionedWriteSuite.scala | 35 +++ .../spark/sql/hive/HiveExternalCatalog.scala| 9 ++- .../sql/hive/execution/HiveTableScanExec.scala | 3 +- .../sql/hive/HiveExternalCatalogSuite.scala | 2 +- 15 files changed, 175 insertions(+), 59 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2a7921a8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala index a3a4ab3..31eded4 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala @@ -244,11 +244,13 @@ abstract class ExternalCatalog { * @param db database name * @param table table name * @param predicates partition-pruning predicates + * @param defaultTimeZoneId default timezone id to parse partition values of TimestampType */ def listPartitionsByFilter( db: String, tabl
spark git commit: [MINOR][DOC] Fix doc for web UI https configuration
Repository: spark Updated Branches: refs/heads/master 9314c0837 -> ba186a841 [MINOR][DOC] Fix doc for web UI https configuration ## What changes were proposed in this pull request? Doc about enabling web UI https is not correct, "spark.ui.https.enabled" is not existed, actually enabling SSL is enough for https. ## How was this patch tested? N/A Author: jerryshao Closes #17147 from jerryshao/fix-doc-ssl. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ba186a84 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ba186a84 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ba186a84 Branch: refs/heads/master Commit: ba186a841fcfcd73a1530ca2418cc08bb0df92e1 Parents: 9314c08 Author: jerryshao Authored: Fri Mar 3 14:23:31 2017 -0800 Committer: Marcelo Vanzin Committed: Fri Mar 3 14:23:31 2017 -0800 -- docs/security.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ba186a84/docs/security.md -- diff --git a/docs/security.md b/docs/security.md index a479676..9eda428 100644 --- a/docs/security.md +++ b/docs/security.md @@ -12,7 +12,7 @@ Spark currently supports authentication via a shared secret. Authentication can ## Web UI The Spark UI can be secured by using [javax servlet filters](http://docs.oracle.com/javaee/6/api/javax/servlet/Filter.html) via the `spark.ui.filters` setting -and by using [https/SSL](http://en.wikipedia.org/wiki/HTTPS) via the `spark.ui.https.enabled` setting. +and by using [https/SSL](http://en.wikipedia.org/wiki/HTTPS) via [SSL settings](security.html#ssl-configuration). ### Authentication - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19774] StreamExecution should call stop() on sources when a stream fails
Repository: spark Updated Branches: refs/heads/branch-2.1 accbed7c2 -> da04d45c2 [SPARK-19774] StreamExecution should call stop() on sources when a stream fails ## What changes were proposed in this pull request? We call stop() on a Structured Streaming Source only when the stream is shutdown when a user calls streamingQuery.stop(). We should actually stop all sources when the stream fails as well, otherwise we may leak resources, e.g. connections to Kafka. ## How was this patch tested? Unit tests in `StreamingQuerySuite`. Author: Burak Yavuz Closes #17107 from brkyvz/close-source. (cherry picked from commit 9314c08377cc8da88f4e31d1a9d41376e96a81b3) Signed-off-by: Shixiong Zhu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/da04d45c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/da04d45c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/da04d45c Branch: refs/heads/branch-2.1 Commit: da04d45c2c3c9830c57cee90be78cf2093d0 Parents: accbed7 Author: Burak Yavuz Authored: Fri Mar 3 10:35:15 2017 -0800 Committer: Shixiong Zhu Committed: Fri Mar 3 10:35:24 2017 -0800 -- .../execution/streaming/StreamExecution.scala | 14 +++- .../sql/streaming/StreamingQuerySuite.scala | 75 +- .../sql/streaming/util/MockSourceProvider.scala | 83 3 files changed, 169 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/da04d45c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala index 93face4..dd80a28 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala @@ -322,6 +322,7 @@ class StreamExecution( initializationLatch.countDown() try { +stopSources() state.set(TERMINATED) currentStatus = status.copy(isTriggerActive = false, isDataAvailable = false) @@ -559,6 +560,18 @@ class StreamExecution( sparkSession.streams.postListenerEvent(event) } + /** Stops all streaming sources safely. */ + private def stopSources(): Unit = { +uniqueSources.foreach { source => + try { +source.stop() + } catch { +case NonFatal(e) => + logWarning(s"Failed to stop streaming source: $source. Resources may have leaked.", e) + } +} + } + /** * Signals to the thread executing micro-batches that it should stop running after the next * batch. This method blocks until the thread stops running. @@ -571,7 +584,6 @@ class StreamExecution( microBatchThread.interrupt() microBatchThread.join() } -uniqueSources.foreach(_.stop()) logInfo(s"Query $prettyIdString was stopped") } http://git-wip-us.apache.org/repos/asf/spark/blob/da04d45c/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala index 1525ad5..a0a2b2b 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala @@ -20,10 +20,12 @@ package org.apache.spark.sql.streaming import java.util.concurrent.CountDownLatch import org.apache.commons.lang3.RandomStringUtils +import org.mockito.Mockito._ import org.scalactic.TolerantNumerics import org.scalatest.concurrent.Eventually._ import org.scalatest.BeforeAndAfter import org.scalatest.concurrent.PatienceConfiguration.Timeout +import org.scalatest.mock.MockitoSugar import org.apache.spark.internal.Logging import org.apache.spark.sql.{DataFrame, Dataset} @@ -32,11 +34,11 @@ import org.apache.spark.SparkException import org.apache.spark.sql.execution.streaming._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.internal.SQLConf -import org.apache.spark.sql.streaming.util.BlockingSource +import org.apache.spark.sql.streaming.util.{BlockingSource, MockSourceProvider} import org.apache.spark.util.ManualClock -class StreamingQuerySuite extends StreamTest with BeforeAndAfter with Logging { +class StreamingQuerySuite extends StreamTest with BeforeAndAfter with Logging with MockitoSugar { import AwaitTerminationTester._ import testImplicits._ @@
spark git commit: [SPARK-19774] StreamExecution should call stop() on sources when a stream fails
Repository: spark Updated Branches: refs/heads/master 37a1c0e46 -> 9314c0837 [SPARK-19774] StreamExecution should call stop() on sources when a stream fails ## What changes were proposed in this pull request? We call stop() on a Structured Streaming Source only when the stream is shutdown when a user calls streamingQuery.stop(). We should actually stop all sources when the stream fails as well, otherwise we may leak resources, e.g. connections to Kafka. ## How was this patch tested? Unit tests in `StreamingQuerySuite`. Author: Burak Yavuz Closes #17107 from brkyvz/close-source. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9314c083 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9314c083 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9314c083 Branch: refs/heads/master Commit: 9314c08377cc8da88f4e31d1a9d41376e96a81b3 Parents: 37a1c0e Author: Burak Yavuz Authored: Fri Mar 3 10:35:15 2017 -0800 Committer: Shixiong Zhu Committed: Fri Mar 3 10:35:15 2017 -0800 -- .../execution/streaming/StreamExecution.scala | 14 +++- .../sql/streaming/StreamingQuerySuite.scala | 75 +- .../sql/streaming/util/MockSourceProvider.scala | 83 3 files changed, 169 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9314c083/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala index 4bd6431..6e77f35 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala @@ -321,6 +321,7 @@ class StreamExecution( initializationLatch.countDown() try { +stopSources() state.set(TERMINATED) currentStatus = status.copy(isTriggerActive = false, isDataAvailable = false) @@ -558,6 +559,18 @@ class StreamExecution( sparkSession.streams.postListenerEvent(event) } + /** Stops all streaming sources safely. */ + private def stopSources(): Unit = { +uniqueSources.foreach { source => + try { +source.stop() + } catch { +case NonFatal(e) => + logWarning(s"Failed to stop streaming source: $source. Resources may have leaked.", e) + } +} + } + /** * Signals to the thread executing micro-batches that it should stop running after the next * batch. This method blocks until the thread stops running. @@ -570,7 +583,6 @@ class StreamExecution( microBatchThread.interrupt() microBatchThread.join() } -uniqueSources.foreach(_.stop()) logInfo(s"Query $prettyIdString was stopped") } http://git-wip-us.apache.org/repos/asf/spark/blob/9314c083/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala index 1525ad5..a0a2b2b 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala @@ -20,10 +20,12 @@ package org.apache.spark.sql.streaming import java.util.concurrent.CountDownLatch import org.apache.commons.lang3.RandomStringUtils +import org.mockito.Mockito._ import org.scalactic.TolerantNumerics import org.scalatest.concurrent.Eventually._ import org.scalatest.BeforeAndAfter import org.scalatest.concurrent.PatienceConfiguration.Timeout +import org.scalatest.mock.MockitoSugar import org.apache.spark.internal.Logging import org.apache.spark.sql.{DataFrame, Dataset} @@ -32,11 +34,11 @@ import org.apache.spark.SparkException import org.apache.spark.sql.execution.streaming._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.internal.SQLConf -import org.apache.spark.sql.streaming.util.BlockingSource +import org.apache.spark.sql.streaming.util.{BlockingSource, MockSourceProvider} import org.apache.spark.util.ManualClock -class StreamingQuerySuite extends StreamTest with BeforeAndAfter with Logging { +class StreamingQuerySuite extends StreamTest with BeforeAndAfter with Logging with MockitoSugar { import AwaitTerminationTester._ import testImplicits._ @@ -481,6 +483,75 @@ class StreamingQuerySuite extends StreamTest with BeforeAndAfter with Logging { }
spark git commit: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query results
Repository: spark Updated Branches: refs/heads/master 98bcc188f -> 37a1c0e46 [SPARK-19710][SQL][TESTS] Fix ordering of rows in query results ## What changes were proposed in this pull request? Changes to SQLQueryTests to make the order of the results constant. Where possible ORDER BY has been added to match the existing expected output ## How was this patch tested? Test runs on x86, zLinux (big endian), ppc (big endian) Author: Pete Robbins Closes #17039 from robbinspg/SPARK-19710. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/37a1c0e4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/37a1c0e4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/37a1c0e4 Branch: refs/heads/master Commit: 37a1c0e461737d4a4bbb03d397b651ec5ba00e96 Parents: 98bcc18 Author: Pete Robbins Authored: Fri Mar 3 07:53:46 2017 -0800 Committer: Herman van Hovell Committed: Fri Mar 3 07:53:46 2017 -0800 -- .../sql-tests/inputs/subquery/in-subquery/in-joins.sql| 2 +- .../inputs/subquery/in-subquery/in-set-operations.sql | 6 +++--- .../inputs/subquery/in-subquery/not-in-joins.sql | 2 +- .../results/subquery/in-subquery/in-joins.sql.out | 2 +- .../subquery/in-subquery/in-set-operations.sql.out| 10 +- .../results/subquery/in-subquery/not-in-joins.sql.out | 4 ++-- 6 files changed, 13 insertions(+), 13 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/37a1c0e4/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-joins.sql -- diff --git a/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-joins.sql b/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-joins.sql index b10c419..880175f 100644 --- a/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-joins.sql +++ b/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-joins.sql @@ -79,7 +79,7 @@ GROUP BY t1a, t3a, t3b, t3c -ORDER BY t1a DESC; +ORDER BY t1a DESC, t3b DESC; -- TC 01.03 SELECT Count(DISTINCT(t1a)) http://git-wip-us.apache.org/repos/asf/spark/blob/37a1c0e4/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-set-operations.sql -- diff --git a/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-set-operations.sql b/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-set-operations.sql index 6b9e8bf..5c371d2 100644 --- a/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-set-operations.sql +++ b/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-set-operations.sql @@ -287,7 +287,7 @@ WHERE t1a IN (SELECT t3a WHERE t1b > 6) AS t5) GROUP BY t1a, t1b, t1c, t1d HAVING t1c IS NOT NULL AND t1b IS NOT NULL -ORDER BY t1c DESC; +ORDER BY t1c DESC, t1a DESC; -- TC 01.08 SELECT t1a, @@ -351,7 +351,7 @@ WHERE t1b IN FROM t1 WHERE t1b > 6) AS t4 WHERE t2b = t1b) -ORDER BY t1c DESC NULLS last; +ORDER BY t1c DESC NULLS last, t1a DESC; -- TC 01.11 SELECT * @@ -468,5 +468,5 @@ HAVING t1b NOT IN EXCEPT SELECT t3b FROM t3) -ORDER BY t1c DESC NULLS LAST; +ORDER BY t1c DESC NULLS LAST, t1i; http://git-wip-us.apache.org/repos/asf/spark/blob/37a1c0e4/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-joins.sql -- diff --git a/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-joins.sql b/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-joins.sql index 505366b..e09b91f 100644 --- a/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-joins.sql +++ b/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-joins.sql @@ -85,7 +85,7 @@ AND t1b != t3b AND t1d = t2d GROUP BYt1a, t1b, t1c, t3a, t3b, t3c HAVING count(distinct(t3a)) >= 1 -ORDER BYt1a; +ORDER BYt1a, t3b; -- TC 01.03 SELECT t1a, http://git-wip-us.apache.org/repos/asf/spark/blob/37a1c0e4/sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-joins.sql.out -- diff --git a/sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-joins.sql.out b/sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-joins.sql.out index 7258bcf..ab6a11a 100644 --- a/sql/core/src/test/resources/sql-t
spark git commit: [SPARK-19758][SQL] Resolving timezone aware expressions with time zone when resolving inline table
Repository: spark Updated Branches: refs/heads/master 776fac398 -> 98bcc188f [SPARK-19758][SQL] Resolving timezone aware expressions with time zone when resolving inline table ## What changes were proposed in this pull request? When we resolve inline tables in analyzer, we will evaluate the expressions of inline tables. When it evaluates a `TimeZoneAwareExpression` expression, an error will happen because the `TimeZoneAwareExpression` is not associated with timezone yet. So we need to resolve these `TimeZoneAwareExpression`s with time zone when resolving inline tables. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh Closes #17114 from viirya/resolve-timeawareexpr-inline-table. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/98bcc188 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/98bcc188 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/98bcc188 Branch: refs/heads/master Commit: 98bcc188f98e44c1675d8b3a28f44f4f900abc43 Parents: 776fac3 Author: Liang-Chi Hsieh Authored: Fri Mar 3 07:14:37 2017 -0800 Committer: Herman van Hovell Committed: Fri Mar 3 07:14:37 2017 -0800 -- .../spark/sql/catalyst/analysis/Analyzer.scala | 2 +- .../catalyst/analysis/ResolveInlineTables.scala | 16 +--- .../analysis/ResolveInlineTablesSuite.scala | 40 .../resources/sql-tests/inputs/inline-table.sql | 3 ++ .../sql-tests/results/inline-table.sql.out | 10 - 5 files changed, 48 insertions(+), 23 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/98bcc188/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index c477cb4..6d569b6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -146,7 +146,7 @@ class Analyzer( GlobalAggregates :: ResolveAggregateFunctions :: TimeWindowing :: - ResolveInlineTables :: + ResolveInlineTables(conf) :: TypeCoercion.typeCoercionRules ++ extendedResolutionRules : _*), Batch("Post-Hoc Resolution", Once, postHocResolutionRules: _*), http://git-wip-us.apache.org/repos/asf/spark/blob/98bcc188/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala index 7323197..d5b3ea8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala @@ -19,8 +19,8 @@ package org.apache.spark.sql.catalyst.analysis import scala.util.control.NonFatal -import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.expressions.Cast +import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow} +import org.apache.spark.sql.catalyst.expressions.{Cast, TimeZoneAwareExpression} import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan} import org.apache.spark.sql.catalyst.rules.Rule import org.apache.spark.sql.types.{StructField, StructType} @@ -28,7 +28,7 @@ import org.apache.spark.sql.types.{StructField, StructType} /** * An analyzer rule that replaces [[UnresolvedInlineTable]] with [[LocalRelation]]. */ -object ResolveInlineTables extends Rule[LogicalPlan] { +case class ResolveInlineTables(conf: CatalystConf) extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { case table: UnresolvedInlineTable if table.expressionsResolved => validateInputDimension(table) @@ -95,11 +95,15 @@ object ResolveInlineTables extends Rule[LogicalPlan] { InternalRow.fromSeq(row.zipWithIndex.map { case (e, ci) => val targetType = fields(ci).dataType try { - if (e.dataType.sameType(targetType)) { -e.eval() + val castedExpr = if (e.dataType.sameType(targetType)) { +e } else { -Cast(e, targetType).eval() +Cast(e, targetType) } + castedExpr.transform { +case e: TimeZoneAwareExpression if e.ti
spark-website git commit: Update commiter list
Repository: spark-website Updated Branches: refs/heads/asf-site 470b7ed51 -> c1b9ad3cb Update commiter list Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/c1b9ad3c Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/c1b9ad3c Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/c1b9ad3c Branch: refs/heads/asf-site Commit: c1b9ad3cbe413b10f872c6a3363f1028c31b1a16 Parents: 470b7ed Author: Holden Karau Authored: Wed Mar 1 22:15:10 2017 -0800 Committer: Sean Owen Committed: Fri Mar 3 12:31:03 2017 +0100 -- committers.md| 4 site/committers.html | 15 ++- 2 files changed, 18 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark-website/blob/c1b9ad3c/committers.md -- diff --git a/committers.md b/committers.md index 03defa6..a97bb72 100644 --- a/committers.md +++ b/committers.md @@ -28,6 +28,7 @@ navigation: |Herman van Hovell|QuestTec B.V.| |Yin Huai|Databricks| |Shane Huang|Intel| +|Holden Karau|IBM| |Andy Konwinski|Databricks| |Ryan LeCompte|Quantifind| |Haoyuan Li|Alluxio, UC Berkeley| @@ -50,11 +51,13 @@ navigation: |Prashant Sharma|IBM| |Ram Sriharsha|Databricks| |DB Tsai|Netflix| +|Takuya Ueshin|| |Marcelo Vanzin|Cloudera| |Shivaram Venkataraman|UC Berkeley| |Patrick Wendell|Databricks| |Andrew Xia|Alibaba| |Reynold Xin|Databricks| +|Burak Yavuz|Databricks| |Matei Zaharia|Databricks, Stanford| |Shixiong Zhu|Databricks| @@ -117,6 +120,7 @@ You can verify the result is one change with `git log`. Then resume the script i Also, please remember to set Assignee on JIRAs where applicable when they are resolved. The script can't do this automatically. +Once a PR is merged please leave a comment on the PR stating which branch(es) it has been merged with.
spark git commit: [SPARK-19801][BUILD] Remove JDK7 from Travis CI
Repository: spark Updated Branches: refs/heads/master 0bac3e4cd -> 776fac398 [SPARK-19801][BUILD] Remove JDK7 from Travis CI ## What changes were proposed in this pull request? Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR verification (JDK7/JDK8 maven compilation and Java Linter) and contributors can see the additional result via their Travis CI dashboard (or PC). This PR aims to make `.travis.yml` up-to-date by removing JDK7 which was removed via SPARK-19550. ## How was this patch tested? See the result via Travis CI. - https://travis-ci.org/dongjoon-hyun/spark/builds/207111713 Author: Dongjoon Hyun Closes #17143 from dongjoon-hyun/SPARK-19801. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/776fac39 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/776fac39 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/776fac39 Branch: refs/heads/master Commit: 776fac3988271a1e4128cb31f21e5f7f3b7bcf0e Parents: 0bac3e4 Author: Dongjoon Hyun Authored: Fri Mar 3 12:00:54 2017 +0100 Committer: Sean Owen Committed: Fri Mar 3 12:00:54 2017 +0100 -- .travis.yml | 1 - 1 file changed, 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/776fac39/.travis.yml -- diff --git a/.travis.yml b/.travis.yml index d94872d..d7e9f8c 100644 --- a/.travis.yml +++ b/.travis.yml @@ -28,7 +28,6 @@ dist: trusty # 2. Choose language and target JDKs for parallel builds. language: java jdk: - - oraclejdk7 - oraclejdk8 # 3. Setup cache directory for SBT and Maven. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19797][DOC] ML pipeline document correction
Repository: spark Updated Branches: refs/heads/master fa50143cd -> 0bac3e4cd [SPARK-19797][DOC] ML pipeline document correction ## What changes were proposed in this pull request? Description about pipeline in this paragraph is incorrect https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works > If the Pipeline had more **stages**, it would call the > LogisticRegressionModelâs transform() method on the DataFrame before > passing the DataFrame to the next stage. Reason: Transformer could also be a stage. But only another Estimator will invoke an transform call and pass the data to next stage. The description in the document misleads ML pipeline users. ## How was this patch tested? This is a tiny modification of **docs/ml-pipelines.md**. I jekyll build the modification and check the compiled document. Author: Zhe Sun Closes #17137 from ymwdalex/SPARK-19797-ML-pipeline-document-correction. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0bac3e4c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0bac3e4c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0bac3e4c Branch: refs/heads/master Commit: 0bac3e4cde75678beac02e67b8873fe779e9ad34 Parents: fa50143 Author: Zhe Sun Authored: Fri Mar 3 11:55:57 2017 +0100 Committer: Sean Owen Committed: Fri Mar 3 11:55:57 2017 +0100 -- docs/ml-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0bac3e4c/docs/ml-pipeline.md -- diff --git a/docs/ml-pipeline.md b/docs/ml-pipeline.md index 7cbb146..aa92c0a 100644 --- a/docs/ml-pipeline.md +++ b/docs/ml-pipeline.md @@ -132,7 +132,7 @@ The `Pipeline.fit()` method is called on the original `DataFrame`, which has raw The `Tokenizer.transform()` method splits the raw text documents into words, adding a new column with words to the `DataFrame`. The `HashingTF.transform()` method converts the words column into feature vectors, adding a new column with those vectors to the `DataFrame`. Now, since `LogisticRegression` is an `Estimator`, the `Pipeline` first calls `LogisticRegression.fit()` to produce a `LogisticRegressionModel`. -If the `Pipeline` had more stages, it would call the `LogisticRegressionModel`'s `transform()` +If the `Pipeline` had more `Estimator`s, it would call the `LogisticRegressionModel`'s `transform()` method on the `DataFrame` before passing the `DataFrame` to the next stage. A `Pipeline` is an `Estimator`. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19797][DOC] ML pipeline document correction
Repository: spark Updated Branches: refs/heads/branch-2.1 1237aaea2 -> accbed7c2 [SPARK-19797][DOC] ML pipeline document correction ## What changes were proposed in this pull request? Description about pipeline in this paragraph is incorrect https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works > If the Pipeline had more **stages**, it would call the > LogisticRegressionModelâs transform() method on the DataFrame before > passing the DataFrame to the next stage. Reason: Transformer could also be a stage. But only another Estimator will invoke an transform call and pass the data to next stage. The description in the document misleads ML pipeline users. ## How was this patch tested? This is a tiny modification of **docs/ml-pipelines.md**. I jekyll build the modification and check the compiled document. Author: Zhe Sun Closes #17137 from ymwdalex/SPARK-19797-ML-pipeline-document-correction. (cherry picked from commit 0bac3e4cde75678beac02e67b8873fe779e9ad34) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/accbed7c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/accbed7c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/accbed7c Branch: refs/heads/branch-2.1 Commit: accbed7c2cfbe46fa6f55e97241b617c6ad4431f Parents: 1237aae Author: Zhe Sun Authored: Fri Mar 3 11:55:57 2017 +0100 Committer: Sean Owen Committed: Fri Mar 3 11:56:07 2017 +0100 -- docs/ml-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/accbed7c/docs/ml-pipeline.md -- diff --git a/docs/ml-pipeline.md b/docs/ml-pipeline.md index 7cbb146..aa92c0a 100644 --- a/docs/ml-pipeline.md +++ b/docs/ml-pipeline.md @@ -132,7 +132,7 @@ The `Pipeline.fit()` method is called on the original `DataFrame`, which has raw The `Tokenizer.transform()` method splits the raw text documents into words, adding a new column with words to the `DataFrame`. The `HashingTF.transform()` method converts the words column into feature vectors, adding a new column with those vectors to the `DataFrame`. Now, since `LogisticRegression` is an `Estimator`, the `Pipeline` first calls `LogisticRegression.fit()` to produce a `LogisticRegressionModel`. -If the `Pipeline` had more stages, it would call the `LogisticRegressionModel`'s `transform()` +If the `Pipeline` had more `Estimator`s, it would call the `LogisticRegressionModel`'s `transform()` method on the `DataFrame` before passing the `DataFrame` to the next stage. A `Pipeline` is an `Estimator`. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-19739][CORE] propagate S3 session token to cluser
Repository: spark Updated Branches: refs/heads/master d556b3170 -> fa50143cd [SPARK-19739][CORE] propagate S3 session token to cluser ## What changes were proposed in this pull request? propagate S3 session token to cluser ## How was this patch tested? existing ut Author: uncleGen Closes #17080 from uncleGen/SPARK-19739. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fa50143c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fa50143c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fa50143c Branch: refs/heads/master Commit: fa50143cd33586f4658892f434c9f6c23346e1bf Parents: d556b31 Author: uncleGen Authored: Fri Mar 3 11:49:00 2017 +0100 Committer: Sean Owen Committed: Fri Mar 3 11:49:00 2017 +0100 -- .../org/apache/spark/deploy/SparkHadoopUtil.scala | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fa50143c/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala index 941e2d1..f475ce8 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala @@ -82,17 +82,20 @@ class SparkHadoopUtil extends Logging { // the behavior of the old implementation of this code, for backwards compatibility. if (conf != null) { // Explicitly check for S3 environment variables - if (System.getenv("AWS_ACCESS_KEY_ID") != null && - System.getenv("AWS_SECRET_ACCESS_KEY") != null) { -val keyId = System.getenv("AWS_ACCESS_KEY_ID") -val accessKey = System.getenv("AWS_SECRET_ACCESS_KEY") - + val keyId = System.getenv("AWS_ACCESS_KEY_ID") + val accessKey = System.getenv("AWS_SECRET_ACCESS_KEY") + if (keyId != null && accessKey != null) { hadoopConf.set("fs.s3.awsAccessKeyId", keyId) hadoopConf.set("fs.s3n.awsAccessKeyId", keyId) hadoopConf.set("fs.s3a.access.key", keyId) hadoopConf.set("fs.s3.awsSecretAccessKey", accessKey) hadoopConf.set("fs.s3n.awsSecretAccessKey", accessKey) hadoopConf.set("fs.s3a.secret.key", accessKey) + +val sessionToken = System.getenv("AWS_SESSION_TOKEN") +if (sessionToken != null) { + hadoopConf.set("fs.s3a.session.token", sessionToken) +} } // Copy any "spark.hadoop.foo=bar" system properties into conf as "foo=bar" conf.getAll.foreach { case (key, value) => - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor cleanup
Repository: spark Updated Branches: refs/heads/master 982f3223b -> d556b3170 [SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor cleanup ## What changes were proposed in this pull request? This PR suggests adding some comments in `UnivocityParser` logics to explain what happens. Also, it proposes, IMHO, a little bit cleaner (at least easy for me to explain). ## How was this patch tested? Unit tests in `CSVSuite`. Author: hyukjinkwon Closes #17142 from HyukjinKwon/SPARK-18699. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d556b317 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d556b317 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d556b317 Branch: refs/heads/master Commit: d556b317038455dc25e193f3add723fccdc54958 Parents: 982f322 Author: hyukjinkwon Authored: Fri Mar 3 00:50:58 2017 -0800 Committer: Wenchen Fan Committed: Fri Mar 3 00:50:58 2017 -0800 -- .../datasources/csv/UnivocityParser.scala | 97 ++-- 1 file changed, 68 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d556b317/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala index 804031a..3b3b87e 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala @@ -54,39 +54,77 @@ private[csv] class UnivocityParser( private val dataSchema = StructType(schema.filter(_.name != options.columnNameOfCorruptRecord)) - private val valueConverters = -dataSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray - private val tokenizer = new CsvParser(options.asParserSettings) private var numMalformedRecords = 0 private val row = new GenericInternalRow(requiredSchema.length) - // This gets the raw input that is parsed lately. + // In `PERMISSIVE` parse mode, we should be able to put the raw malformed row into the field + // specified in `columnNameOfCorruptRecord`. The raw input is retrieved by this method. private def getCurrentInput(): String = tokenizer.getContext.currentParsedContent().stripLineEnd - // This parser loads an `indexArr._1`-th position value in input tokens, - // then put the value in `row(indexArr._2)`. - private val indexArr: Array[(Int, Int)] = { -val fields = if (options.dropMalformed) { - // If `dropMalformed` is enabled, then it needs to parse all the values - // so that we can decide which row is malformed. - requiredSchema ++ schema.filterNot(requiredSchema.contains(_)) -} else { - requiredSchema -} -// TODO: Revisit this; we need to clean up code here for readability. -// See an URL below for related discussions: -// https://github.com/apache/spark/pull/16928#discussion_r102636720 -val fieldsWithIndexes = fields.zipWithIndex -corruptFieldIndex.map { case corrFieldIndex => - fieldsWithIndexes.filter { case (_, i) => i != corrFieldIndex } -}.getOrElse { - fieldsWithIndexes -}.map { case (f, i) => - (dataSchema.indexOf(f), i) -}.toArray + // This parser loads an `tokenIndexArr`-th position value in input tokens, + // then put the value in `row(rowIndexArr)`. + // + // For example, let's say there is CSV data as below: + // + // a,b,c + // 1,2,A + // + // Also, let's say `columnNameOfCorruptRecord` is set to "_unparsed", `header` is `true` + // by user and the user selects "c", "b", "_unparsed" and "a" fields. In this case, we need + // to map those values below: + // + // required schema - ["c", "b", "_unparsed", "a"] + // CSV data schema - ["a", "b", "c"] + // required CSV data schema - ["c", "b", "a"] + // + // with the input tokens, + // + // input tokens - [1, 2, "A"] + // + // Each input token is placed in each output row's position by mapping these. In this case, + // + // output row - ["A", 2, null, 1] + // + // In more details, + // - `valueConverters`, input tokens - CSV data schema + // `valueConverters` keeps the positions of input token indices (by its index) to each + // value's converter (by its value) in an order of CSV data schema. In this case, + // [string->int, string->int, string->string]. + // + // - `tokenIndexArr`, input tokens - required CSV data schema + // `tokenIndexArr` keeps the positions of input token indices (by its index) to reordered