[GitHub] felixcheung commented on issue #175: CVE-2018-11760
felixcheung commented on issue #175: CVE-2018-11760 URL: https://github.com/apache/spark-website/pull/175#issuecomment-458441893 sometimes it's the Latest News. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 16990f9 [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0 16990f9 is described below commit 16990f929921b3f784a85f3afbe1a22fbe77d895 Author: Bryan Cutler AuthorDate: Tue Jan 29 14:18:45 2019 +0800 [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0 ## What changes were proposed in this pull request? Upgrade Apache Arrow to version 0.12.0. This includes the Java artifacts and fixes to enable usage with pyarrow 0.12.0 Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users: * Safe cast fails from numpy float64 array with nans to integer, ARROW-4258 * Java, Reduce heap usage for variable width vectors, ARROW-4147 * Binary identity cast not implemented, ARROW-4101 * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 * conversion to date object no longer needed, ARROW-3910 * Error reading IPC file with no record batches, ARROW-3894 * Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790 * from_pandas gives incorrect results when converting floating point to bool, ARROW-3428 * Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048 * Java update to official Flatbuffers version 1.9.0, ARROW-3175 complete list [here](https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0) PySpark requires the following fixes to work with PyArrow 0.12.0 * Encrypted pyspark worker fails due to ChunkedStream missing closed property * pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64 * ArrowTests fails due to difference in raised error message * pyarrow.open_stream deprecated * tests fail because groupby adds index column with duplicate name ## How was this patch tested? Ran unit tests with pyarrow versions 0.8.0, 0.10.0, 0.11.1, 0.12.0 Closes #23657 from BryanCutler/arrow-upgrade-012. Authored-by: Bryan Cutler Signed-off-by: Hyukjin Kwon --- dev/deps/spark-deps-hadoop-2.7 | 8 dev/deps/spark-deps-hadoop-3.1 | 8 pom.xml | 2 +- python/pyspark/serializers.py | 13 +++-- python/pyspark/sql/tests/test_arrow.py | 2 +- python/pyspark/sql/tests/test_pandas_udf_grouped_map.py | 10 +- python/pyspark/sql/types.py | 13 ++--- 7 files changed, 36 insertions(+), 20 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7 index 1af29fc..0154fd2 100644 --- a/dev/deps/spark-deps-hadoop-2.7 +++ b/dev/deps/spark-deps-hadoop-2.7 @@ -14,9 +14,9 @@ apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar arpack_combined_all-0.1.jar -arrow-format-0.10.0.jar -arrow-memory-0.10.0.jar -arrow-vector-0.10.0.jar +arrow-format-0.12.0.jar +arrow-memory-0.12.0.jar +arrow-vector-0.12.0.jar automaton-1.11-8.jar avro-1.8.2.jar avro-ipc-1.8.2.jar @@ -58,7 +58,7 @@ datanucleus-core-3.2.10.jar datanucleus-rdbms-3.2.9.jar derby-10.12.1.1.jar eigenbase-properties-1.1.5.jar -flatbuffers-1.2.0-3f79e055.jar +flatbuffers-java-1.9.0.jar generex-1.0.1.jar gson-2.2.4.jar guava-14.0.1.jar diff --git a/dev/deps/spark-deps-hadoop-3.1 b/dev/deps/spark-deps-hadoop-3.1 index 05f180b..7d5325c 100644 --- a/dev/deps/spark-deps-hadoop-3.1 +++ b/dev/deps/spark-deps-hadoop-3.1 @@ -12,9 +12,9 @@ aopalliance-1.0.jar aopalliance-repackaged-2.4.0-b34.jar apache-log4j-extras-1.2.17.jar arpack_combined_all-0.1.jar -arrow-format-0.10.0.jar -arrow-memory-0.10.0.jar -arrow-vector-0.10.0.jar +arrow-format-0.12.0.jar +arrow-memory-0.12.0.jar +arrow-vector-0.12.0.jar automaton-1.11-8.jar avro-1.8.2.jar avro-ipc-1.8.2.jar @@ -57,7 +57,7 @@ derby-10.12.1.1.jar dnsjava-2.1.7.jar ehcache-3.3.1.jar eigenbase-properties-1.1.5.jar -flatbuffers-1.2.0-3f79e055.jar +flatbuffers-java-1.9.0.jar generex-1.0.1.jar geronimo-jcache_1.0_spec-1.0-alpha-1.jar gson-2.2.4.jar diff --git a/pom.xml b/pom.xml index 29a281a..6676c5d 100644 --- a/pom.xml +++ b/pom.xml @@ -193,7 +193,7 @@ If you are changing Arrow version specification, please check ./python/pyspark/sql/utils.py, ./python/run-tests.py and ./python/setup.py too. --> -0.10.0 +0.12.0 ${java.home} diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py index 741dfb2..1d17053 100644 ---
[GitHub] dongjoon-hyun commented on issue #177: Update versioning policy with 2.2.x
dongjoon-hyun commented on issue #177: Update versioning policy with 2.2.x URL: https://github.com/apache/spark-website/pull/177#issuecomment-458377606 Thanks, @maropu and @HyukjinKwon . This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] maropu commented on issue #177: Update versioning policy with 2.2.x
maropu commented on issue #177: Update versioning policy with 2.2.x URL: https://github.com/apache/spark-website/pull/177#issuecomment-458359112 Really cool! Thanks, @dongjoon-hyun! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] zsxwing closed pull request #176: Add Jose Torres to committers list
zsxwing closed pull request #176: Add Jose Torres to committers list URL: https://github.com/apache/spark-website/pull/176 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-website] branch asf-site updated: Add Jose Torres to committers list
This is an automated email from the ASF dual-hosted git repository. zsxwing pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new fb1a7b4 Add Jose Torres to committers list fb1a7b4 is described below commit fb1a7b407e149e133e35bb506d48cfe034a4d351 Author: Jose Torres AuthorDate: Mon Jan 28 15:59:37 2019 -0800 Add Jose Torres to committers list Author: Jose Torres Closes #176 from jose-torres/addjose. --- committers.md| 1 + site/committers.html | 4 2 files changed, 5 insertions(+) diff --git a/committers.md b/committers.md index c3daf10..8049106 100644 --- a/committers.md +++ b/committers.md @@ -65,6 +65,7 @@ navigation: |Saisai Shao|Tencent| |Prashant Sharma|IBM| |Ram Sriharsha|Databricks| +|Jose Torres|Databricks| |DB Tsai|Apple| |Takuya Ueshin|Databricks| |Marcelo Vanzin|Cloudera| diff --git a/site/committers.html b/site/committers.html index ec5814b..3066b5d 100644 --- a/site/committers.html +++ b/site/committers.html @@ -431,6 +431,10 @@ Databricks + Jose Torres + Databricks + + DB Tsai Apple - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-website] branch asf-site updated: Update versioning policy with 2.2.x (#177)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new 51aa3b7 Update versioning policy with 2.2.x (#177) 51aa3b7 is described below commit 51aa3b7e4ed52be7806326f8d55ddd061aa6b97e Author: Dongjoon Hyun AuthorDate: Mon Jan 28 15:23:24 2019 -0800 Update versioning policy with 2.2.x (#177) This PR updates EOL documents with release 2.2.x explicitly. --- site/versioning-policy.html | 4 ++-- versioning-policy.md| 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/site/versioning-policy.html b/site/versioning-policy.html index e2665f0..1ddf507 100644 --- a/site/versioning-policy.html +++ b/site/versioning-policy.html @@ -283,8 +283,8 @@ in between feature releases. Major releases do not happen according to a fixed s Maintenance Releases and EOL Feature release branches will, generally, be maintained with bug fix releases for a period of 18 months. -For example, branch 2.1.x is no longer considered maintained as of July 2018, 18 months after the release -of 2.1.0 in December 2016. No more 2.1.x releases should be expected after that point, even for bug fixes. +For example, branch 2.2.x is no longer considered maintained as of January 2019, 18 months after the release +of 2.2.0 in July 2017. No more 2.2.x releases should be expected after that point, even for bug fixes. diff --git a/versioning-policy.md b/versioning-policy.md index de6f1e6..7c104a1 100644 --- a/versioning-policy.md +++ b/versioning-policy.md @@ -67,5 +67,5 @@ in between feature releases. Major releases do not happen according to a fixed s Maintenance Releases and EOL Feature release branches will, generally, be maintained with bug fix releases for a period of 18 months. -For example, branch 2.1.x is no longer considered maintained as of July 2018, 18 months after the release -of 2.1.0 in December 2016. No more 2.1.x releases should be expected after that point, even for bug fixes. +For example, branch 2.2.x is no longer considered maintained as of January 2019, 18 months after the release +of 2.2.0 in July 2017. No more 2.2.x releases should be expected after that point, even for bug fixes. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] dongjoon-hyun merged pull request #177: Update versioning policy with 2.2.x
dongjoon-hyun merged pull request #177: Update versioning policy with 2.2.x URL: https://github.com/apache/spark-website/pull/177 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] dongjoon-hyun commented on issue #177: Update versioning policy with 2.2.x
dongjoon-hyun commented on issue #177: Update versioning policy with 2.2.x URL: https://github.com/apache/spark-website/pull/177#issuecomment-458342770 Thanks, @srowen ! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] dongjoon-hyun commented on issue #177: Update versioning policy with 2.2.x
dongjoon-hyun commented on issue #177: Update versioning policy with 2.2.x URL: https://github.com/apache/spark-website/pull/177#issuecomment-458340854 Hi, @maropu . This might be a better answer for the question you received. cc @srowen and @HyukjinKwon This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] dongjoon-hyun opened a new pull request #177: Update versioning policy with 2.2.x
dongjoon-hyun opened a new pull request #177: Update versioning policy with 2.2.x URL: https://github.com/apache/spark-website/pull/177 This PR updates EOL documents with release 2.2.x explicitly. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] vanzin merged pull request #175: CVE-2018-11760
vanzin merged pull request #175: CVE-2018-11760 URL: https://github.com/apache/spark-website/pull/175 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-website] branch asf-site updated: CVE-2018-11760 (#175)
This is an automated email from the ASF dual-hosted git repository. vanzin pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new d73a92d CVE-2018-11760 (#175) d73a92d is described below commit d73a92d911b3f9fa6cfab0acfd3558a5609dce72 Author: Imran Rashid AuthorDate: Mon Jan 28 17:03:42 2019 -0600 CVE-2018-11760 (#175) --- security.md| 30 ++ site/security.html | 35 +++ 2 files changed, 65 insertions(+) diff --git a/security.md b/security.md index c75e47b..340622b 100644 --- a/security.md +++ b/security.md @@ -18,6 +18,36 @@ non-public list that will reach the Apache Security team, as well as the Spark P Known Security Issues +CVE-2018-11760: Apache Spark local privilege escalation vulnerability + +Severity: Important + +Vendor: The Apache Software Foundation + +Versions affected: + +- All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions +- Spark 2.2.0 to 2.2.2 +- Spark 2.3.0 to 2.3.1 + +Description: + +When using PySpark, it's possible for a different local user +to connect to the Spark application and impersonate the user running +the Spark application. This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1. + +Mitigation: + +- 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer +- 2.3.x users should upgrade to 2.3.2 or newer +- Otherwise, affected users should avoid using PySpark in +multi-user environments. + +Credit: + +- Luca Canali and Jose Carlos Luna Duran, CERN + + CVE-2018-17190: Unsecured Apache Spark standalone executes user code diff --git a/site/security.html b/site/security.html index 0eca9a3..cabfb47 100644 --- a/site/security.html +++ b/site/security.html @@ -211,6 +211,41 @@ non-public list that will reach the Apache Security team, as well as the Spark P Known Security Issues +CVE-2018-11760: Apache Spark local privilege escalation vulnerability + +Severity: Important + +Vendor: The Apache Software Foundation + +Versions affected: + + + All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions + Spark 2.2.0 to 2.2.2 + Spark 2.3.0 to 2.3.1 + + +Description: + +When using PySpark, its possible for a different local user +to connect to the Spark application and impersonate the user running +the Spark application. This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1. + +Mitigation: + + + 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer + 2.3.x users should upgrade to 2.3.2 or newer + Otherwise, affected users should avoid using PySpark in +multi-user environments. + + +Credit: + + + Luca Canali and Jose Carlos Luna Duran, CERN + + CVE-2018-17190: Unsecured Apache Spark standalone executes user code Severity: Low - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] vanzin commented on issue #175: CVE-2018-11760
vanzin commented on issue #175: CVE-2018-11760 URL: https://github.com/apache/spark-website/pull/175#issuecomment-458337805 (Merge script seems broken, I'll just use the UI.) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] vanzin commented on issue #175: CVE-2018-11760
vanzin commented on issue #175: CVE-2018-11760 URL: https://github.com/apache/spark-website/pull/175#issuecomment-458337464 Cool, I'll merge this then. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] squito commented on issue #175: CVE-2018-11760
squito commented on issue #175: CVE-2018-11760 URL: https://github.com/apache/spark-website/pull/175#issuecomment-458327164 so it turns out I just had an old version of jekyll installed. After upgrading, then those other changes disappear. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-26747][SQL] Makes GetMapValue nullability more precise
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 92706e6 [SPARK-26747][SQL] Makes GetMapValue nullability more precise 92706e6 is described below commit 92706e657629b36c9b3dc1478b2b80e351562135 Author: Takeshi Yamamuro AuthorDate: Mon Jan 28 13:39:50 2019 -0800 [SPARK-26747][SQL] Makes GetMapValue nullability more precise ## What changes were proposed in this pull request? In master, `GetMapValue` nullable is always true; https://github.com/apache/spark/blob/cf133e611020ed178f90358464a1b88cdd9b7889/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L371 But, If input key is foldable, we could make its nullability more precise. This fix is the same with SPARK-26637(#23566). ## How was this patch tested? Added tests in `ComplexTypeSuite`. Closes #23669 from maropu/SPARK-26747. Authored-by: Takeshi Yamamuro Signed-off-by: Dongjoon Hyun --- .../catalyst/expressions/complexTypeExtractors.scala | 16 +++- .../sql/catalyst/expressions/ComplexTypeSuite.scala | 19 +++ 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala index 104ad98..55ed617 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala @@ -381,7 +381,21 @@ case class GetMapValue(child: Expression, key: Expression) override def right: Expression = key /** `Null` is returned for invalid ordinals. */ - override def nullable: Boolean = true + override def nullable: Boolean = if (key.foldable && !key.nullable) { +val keyObj = key.eval() +child match { + case m: CreateMap if m.resolved => +m.keys.zip(m.values).filter { case (k, _) => k.foldable && !k.nullable }.find { + case (k, _) if k.eval() == keyObj => true + case _ => false +}.map(_._2.nullable).getOrElse(true) + case _ => +true +} + } else { +true + } + override def dataType: DataType = child.dataType.asInstanceOf[MapType].valueType diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala index d8d6571..d65b49f 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala @@ -110,6 +110,25 @@ class ComplexTypeSuite extends SparkFunSuite with ExpressionEvalHelper { checkEvaluation(GetMapValue(nestedMap, Literal("a")), Map("b" -> "c")) } + test("SPARK-26747 handles GetMapValue nullability correctly when input key is foldable") { +// String key test +val k1 = Literal("k1") +val v1 = AttributeReference("v1", StringType, nullable = true)() +val k2 = Literal("k2") +val v2 = AttributeReference("v2", StringType, nullable = false)() +val map1 = CreateMap(k1 :: v1 :: k2 :: v2 :: Nil) +assert(GetMapValue(map1, Literal("k1")).nullable) +assert(!GetMapValue(map1, Literal("k2")).nullable) +assert(GetMapValue(map1, Literal("non-existent-key")).nullable) + +// Complex type key test +val k3 = Literal.create((1, "a")) +val k4 = Literal.create((2, "b")) +val map2 = CreateMap(k3 :: v1 :: k4 :: v2 :: Nil) +assert(GetMapValue(map2, Literal.create((1, "a"))).nullable) +assert(!GetMapValue(map2, Literal.create((2, "b"))).nullable) + } + test("GetStructField") { val typeS = StructType(StructField("a", IntegerType) :: Nil) val struct = Literal.create(create_row(1), typeS) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-26595][CORE] Allow credential renewal based on kerberos ticket cache.
This is an automated email from the ASF dual-hosted git repository. vanzin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2a67dbf [SPARK-26595][CORE] Allow credential renewal based on kerberos ticket cache. 2a67dbf is described below commit 2a67dbfbd341af166b1c85904875f26a6dea5ba8 Author: Marcelo Vanzin AuthorDate: Mon Jan 28 13:32:34 2019 -0800 [SPARK-26595][CORE] Allow credential renewal based on kerberos ticket cache. This change addes a new mode for credential renewal that does not require a keytab; it uses the local ticket cache instead, so it works while the user keeps the cache valid. This can be useful for, e.g., people running long spark-shell sessions where their kerberos login is kept up-to-date. The main change to enable this behavior is in HadoopDelegationTokenManager, with a small change in the HDFS token provider. The other changes are to avoid creating duplicate tokens when submitting the application to YARN; they allow the tokens from the scheduler to be sent to the YARN AM, reducing the round trips to HDFS. For that, the scheduler initialization code was changed a little bit so that the tokens are available when the YARN client is initialized. That basically takes care of a long-standing TODO that was in the code to clean up configuration propagation to the driver's RPC endpoint (in CoarseGrainedSchedulerBackend). Tested with an app designed to stress this functionality, with both keytab and cache-based logins. Some basic kerberos tests on k8s also. Closes #23525 from vanzin/SPARK-26595. Authored-by: Marcelo Vanzin Signed-off-by: Marcelo Vanzin --- .../security/HadoopDelegationTokenManager.scala| 38 +++- .../security/HadoopFSDelegationTokenProvider.scala | 34 +- .../org/apache/spark/internal/config/package.scala | 10 ++ .../cluster/CoarseGrainedSchedulerBackend.scala| 42 -- docs/security.md | 32 - .../k8s/KubernetesClusterSchedulerBackend.scala| 12 +++ .../mesos/MesosCoarseGrainedSchedulerBackend.scala | 5 ++- .../org/apache/spark/deploy/yarn/Client.scala | 25 ++--- .../cluster/YarnClientSchedulerBackend.scala | 8 ++--- .../cluster/YarnClusterSchedulerBackend.scala | 2 ++ .../scheduler/cluster/YarnSchedulerBackend.scala | 15 11 files changed, 123 insertions(+), 100 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala b/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala index 2763a46..487291e 100644 --- a/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala +++ b/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala @@ -41,7 +41,7 @@ import org.apache.spark.util.{ThreadUtils, Utils} /** * Manager for delegation tokens in a Spark application. * - * When configured with a principal and a keytab, this manager will make sure long-running apps can + * When delegation token renewal is enabled, this manager will make sure long-running apps can * run without interruption while accessing secured services. It periodically logs in to the KDC * with user-provided credentials, and contacts all the configured secure services to obtain * delegation tokens to be distributed to the rest of the application. @@ -50,6 +50,11 @@ import org.apache.spark.util.{ThreadUtils, Utils} * elapsed. The new tokens are sent to the Spark driver endpoint. The driver is tasked with * distributing the tokens to other processes that might need them. * + * Renewal can be enabled in two different ways: by providing a principal and keytab to Spark, or by + * enabling renewal based on the local credential cache. The latter has the drawback that Spark + * can't create new TGTs by itself, so the user has to manually update the Kerberos ticket cache + * externally. + * * This class can also be used just to create delegation tokens, by calling the * `obtainDelegationTokens` method. This option does not require calling the `start` method nor * providing a driver reference, but leaves it up to the caller to distribute the tokens that were @@ -81,7 +86,11 @@ private[spark] class HadoopDelegationTokenManager( private var renewalExecutor: ScheduledExecutorService = _ /** @return Whether delegation token renewal is enabled. */ - def renewalEnabled: Boolean = principal != null + def renewalEnabled: Boolean = sparkConf.get(KERBEROS_RENEWAL_CREDENTIALS) match { +case "keytab" => principal != null +case "ccache" => UserGroupInformation.getCurrentUser().hasKerberosCredentials() +case _ => false + }
[GitHub] zsxwing edited a comment on issue #176: Add Jose Torres to committers list
zsxwing edited a comment on issue #176: Add Jose Torres to committers list URL: https://github.com/apache/spark-website/pull/176#issuecomment-458307738 LGTM. Welcome! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] zsxwing commented on issue #176: Add Jose Torres to committers list
zsxwing commented on issue #176: Add Jose Torres to committers list URL: https://github.com/apache/spark-website/pull/176#issuecomment-458307738 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] jose-torres opened a new pull request #176: Add Jose Torres to committers list
jose-torres opened a new pull request #176: Add Jose Torres to committers list URL: https://github.com/apache/spark-website/pull/176 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] vanzin commented on issue #175: CVE-2018-11760
vanzin commented on issue #175: CVE-2018-11760 URL: https://github.com/apache/spark-website/pull/175#issuecomment-458300023 > jekyll build updated a few more pages under site That's normal. There are indices and other stuff to update. Also maybe the person who made the previous change forgot to build the site, it happens sometimes. Should probably add those back. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] abellina commented on issue #175: CVE-2018-11760
abellina commented on issue #175: CVE-2018-11760 URL: https://github.com/apache/spark-website/pull/175#issuecomment-458292634 oh ok, sounds good. +1 non binding This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] squito commented on issue #175: CVE-2018-11760
squito commented on issue #175: CVE-2018-11760 URL: https://github.com/apache/spark-website/pull/175#issuecomment-458286718 oops, I didn't realize I should also commit the generated page, `site/security.html` as well, just updated that. Oddly, `jekyll build` updated a few more pages under `site`, but I'm not including those changes here. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] abellina commented on a change in pull request #175: CVE-2018-11760
abellina commented on a change in pull request #175: CVE-2018-11760 URL: https://github.com/apache/spark-website/pull/175#discussion_r251576280 ## File path: security.md ## @@ -18,6 +18,36 @@ non-public list that will reach the Apache Security team, as well as the Spark P Known Security Issues +CVE-2018-11760: Apache Spark local privilege escalation vulnerability + +Severity: Important + +Vendor: The Apache Software Foundation + +Versions affected: + +- All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions +- Spark 2.2.0 to 2.2.2 +- Spark 2.3.0 to 2.3.1 + +Description: + +When using PySpark , it's possible for a different local user Review comment: minor nit, extra space after PySpark This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.3 updated: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.3 by this push: new a89f601 [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp a89f601 is described below commit a89f601a788ab6f1c89cefb1b4097444fb9847a4 Author: Jungtaek Lim (HeartSaVioR) AuthorDate: Mon Jan 28 12:13:51 2019 -0800 [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp ## What changes were proposed in this pull request? Spark replaces `CurrentTimestamp` with `CurrentBatchTimestamp`. However, `CurrentBatchTimestamp` is `TimeZoneAwareExpression` while `CurrentTimestamp` isn't. Without TimeZoneId, `CurrentBatchTimestamp` becomes unresolved and raises `UnresolvedException`. Since `CurrentDate` is `TimeZoneAwareExpression`, there is no problem with `CurrentDate`. ## How was this patch tested? Pass the Jenkins with the updated test cases. Closes #23656 from HeartSaVioR/SPARK-26379-branch-2.3. Lead-authored-by: Jungtaek Lim (HeartSaVioR) Co-authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../execution/streaming/MicroBatchExecution.scala | 5 ++- .../apache/spark/sql/streaming/StreamSuite.scala | 42 ++ 2 files changed, 46 insertions(+), 1 deletion(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala index 6a264ad..dbbb7e7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala @@ -430,8 +430,11 @@ class MicroBatchExecution( // Rewire the plan to use the new attributes that were returned by the source. val newAttributePlan = newBatchesPlan transformAllExpressions { case ct: CurrentTimestamp => +// CurrentTimestamp is not TimeZoneAwareExpression while CurrentBatchTimestamp is. +// Without TimeZoneId, CurrentBatchTimestamp is unresolved. Here, we use an explicit +// dummy string to prevent UnresolvedException and to prevent to be used in the future. CurrentBatchTimestamp(offsetSeqMetadata.batchTimestampMs, - ct.dataType) + ct.dataType, Some("Dummy TimeZoneId")) case cd: CurrentDate => CurrentBatchTimestamp(offsetSeqMetadata.batchTimestampMs, cd.dataType, cd.timeZoneId) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala index c65e5d3..92fdde8 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala @@ -33,6 +33,7 @@ import org.apache.spark.scheduler.{SparkListener, SparkListenerJobStart} import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.plans.logical.Range import org.apache.spark.sql.catalyst.streaming.InternalOutputModes +import org.apache.spark.sql.catalyst.util.DateTimeUtils import org.apache.spark.sql.execution.command.ExplainCommand import org.apache.spark.sql.execution.streaming._ import org.apache.spark.sql.execution.streaming.state.{StateStore, StateStoreConf, StateStoreId, StateStoreProvider} @@ -825,6 +826,47 @@ class StreamSuite extends StreamTest { assert(query.exception.isEmpty) } } + + test("SPARK-26379 Structured Streaming - Exception on adding current_timestamp " + +" to Dataset - use v2 sink") { +testCurrentTimestampOnStreamingQuery(useV2Sink = true) + } + + test("SPARK-26379 Structured Streaming - Exception on adding current_timestamp " + +" to Dataset - use v1 sink") { +testCurrentTimestampOnStreamingQuery(useV2Sink = false) + } + + private def testCurrentTimestampOnStreamingQuery(useV2Sink: Boolean): Unit = { +val input = MemoryStream[Int] +val df = input.toDS().withColumn("cur_timestamp", lit(current_timestamp())) + +def assertBatchOutputAndUpdateLastTimestamp( +rows: Seq[Row], +curTimestamp: Long, +curDate: Int, +expectedValue: Int): Long = { + assert(rows.size === 1) + val row = rows.head + assert(row.getInt(0) === expectedValue) + assert(row.getTimestamp(1).getTime >= curTimestamp) + row.getTimestamp(1).getTime +} + +var lastTimestamp = System.currentTimeMillis() +val currentDate = DateTimeUtils.millisToDays(lastTimestamp) +testStream(df, useV2Sink = useV2Sink) ( + AddData(input, 1), + CheckLastBatch { rows: Seq[Row] => +lastTimestamp =
[GitHub] squito opened a new pull request #175: CVE-2018-11760
squito opened a new pull request #175: CVE-2018-11760 URL: https://github.com/apache/spark-website/pull/175 ran `jekyll build` and `jekyll serve`, page looked OK locally. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8baf3ba [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary 8baf3ba is described below commit 8baf3ba35bc13d72f8ab911284d7d75ce897cb5a Author: Sean Owen AuthorDate: Mon Jan 28 13:47:32 2019 -0600 [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary ## What changes were proposed in this pull request? The warning introduced in https://github.com/apache/spark/pull/23580 has a bug: https://github.com/apache/spark/pull/23580#issuecomment-458000380 This just fixes the logic. ## How was this patch tested? N/A Closes #23668 from srowen/SPARK-26660.2. Authored-by: Sean Owen Signed-off-by: Sean Owen --- core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala | 2 +- .../main/scala/org/apache/spark/scheduler/TaskSetManager.scala| 8 .../scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala| 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala index ecb8ac0..75eb37c 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala @@ -1162,7 +1162,7 @@ private[spark] class DAGScheduler( partitions = stage.rdd.partitions } - if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) { + if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KIB * 1024) { logWarning(s"Broadcasting large task binary with size " + s"${Utils.bytesToString(taskBinaryBytes.length)}") } diff --git a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala b/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala index 6f3f77c..b7bf069 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala @@ -494,12 +494,12 @@ private[spark] class TaskSetManager( abort(s"$msg Exception during serialization: $e") throw new TaskNotSerializableException(e) } -if (serializedTask.limit() > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 && +if (serializedTask.limit() > TaskSetManager.TASK_SIZE_TO_WARN_KIB * 1024 && !emittedTaskSizeWarning) { emittedTaskSizeWarning = true logWarning(s"Stage ${task.stageId} contains a task of very large size " + -s"(${serializedTask.limit() / 1024} KB). The maximum recommended task size is " + -s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.") +s"(${serializedTask.limit() / 1024} KiB). The maximum recommended task size is " + +s"${TaskSetManager.TASK_SIZE_TO_WARN_KIB} KiB.") } addRunningTask(taskId) @@ -1101,5 +1101,5 @@ private[spark] class TaskSetManager( private[spark] object TaskSetManager { // The user will be warned if any stages contain a task that has a serialized size greater than // this. - val TASK_SIZE_TO_WARN_KB = 100 + val TASK_SIZE_TO_WARN_KIB = 100 } diff --git a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala index d0f98b5..60acd3e 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala @@ -153,7 +153,7 @@ class FakeTaskScheduler(sc: SparkContext, liveExecutors: (String, String)* /* ex */ class LargeTask(stageId: Int) extends Task[Array[Byte]](stageId, 0, 0) { - val randomBuffer = new Array[Byte](TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024) + val randomBuffer = new Array[Byte](TaskSetManager.TASK_SIZE_TO_WARN_KIB * 1024) val random = new Random(0) random.nextBytes(randomBuffer) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-26432][CORE] Obtain HBase delegation token operation compatible with HBase 2.x.x version API
This is an automated email from the ASF dual-hosted git repository. vanzin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new dfed439 [SPARK-26432][CORE] Obtain HBase delegation token operation compatible with HBase 2.x.x version API dfed439 is described below commit dfed439e33b7bf224dd412b0960402068d961c7b Author: s71955 AuthorDate: Mon Jan 28 10:08:23 2019 -0800 [SPARK-26432][CORE] Obtain HBase delegation token operation compatible with HBase 2.x.x version API ## What changes were proposed in this pull request? While obtaining token from hbase service , spark uses deprecated API of hbase , ```public static Token obtainToken(Configuration conf)``` This deprecated API is already been removed from hbase 2.x version as part of the hbase 2.x major release. https://issues.apache.org/jira/browse/HBASE-14713_ there is one more stable API in ```public static Token obtainToken(Connection conn)``` in TokenUtil class spark shall use this stable api for getting the delegation token. To invoke this api first connection object has to be retrieved from ConnectionFactory and the same connection can be passed to obtainToken(Connection conn) for getting token. eg: Call ```public static Connection createConnection(Configuration conf)``` , then call ```public static Token obtainToken( Connection conn)```. ## How was this patch tested? Manual testing is been done. Manual test result: Before fix: ![hbase-dep-obtaintok 1](https://user-images.githubusercontent.com/12999161/50699264-64cac200-106d-11e9-81b4-e50ae8097f27.png) After fix: 1. Create 2 tables in hbase shell >Launch hbase shell >Enter commands to create tables and load data create 'table1','cf' put 'table1','row1','cf:cid','20' create 'table2','cf' put 'table2','row1','cf:cid','30' >Show values command get 'table1','row1','cf:cid' will diplay value as 20 get 'table2','row1','cf:cid' will diplay value as 30 2.Run SparkHbasetoHbase class in testSpark.jar using spark-submit spark-submit --master yarn-cluster --class com.mrs.example.spark.SparkHbasetoHbase --conf "spark.yarn.security.credentials.hbase.enabled"="true" --conf "spark.security.credentials.hbase.enabled"="true" --keytab /opt/client/user.keytab --principal sen testSpark.jar The SparkHbasetoHbase test class will update the value of table2 with sum of values of table1 & table2. table2 = table1+table2 As we can see in the snapshot the spark job has been successfully able to interact with hbase service and able to update the row count. ![obtaintok_success 1](https://user-images.githubusercontent.com/12999161/50699393-bd9a5a80-106d-11e9-96c6-6c250d561efa.png) Closes #23429 from sujith71955/master_hbase_service. Authored-by: s71955 Signed-off-by: Marcelo Vanzin --- .../security/HBaseDelegationTokenProvider.scala| 56 -- 1 file changed, 52 insertions(+), 4 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/security/HBaseDelegationTokenProvider.scala b/core/src/main/scala/org/apache/spark/deploy/security/HBaseDelegationTokenProvider.scala index 3bf8c14..e345b0b 100644 --- a/core/src/main/scala/org/apache/spark/deploy/security/HBaseDelegationTokenProvider.scala +++ b/core/src/main/scala/org/apache/spark/deploy/security/HBaseDelegationTokenProvider.scala @@ -17,6 +17,8 @@ package org.apache.spark.deploy.security +import java.io.Closeable + import scala.reflect.runtime.universe import scala.util.control.NonFatal @@ -42,8 +44,8 @@ private[security] class HBaseDelegationTokenProvider try { val mirror = universe.runtimeMirror(Utils.getContextOrSparkClassLoader) val obtainToken = mirror.classLoader. -loadClass("org.apache.hadoop.hbase.security.token.TokenUtil"). -getMethod("obtainToken", classOf[Configuration]) +loadClass("org.apache.hadoop.hbase.security.token.TokenUtil") +.getMethod("obtainToken", classOf[Configuration]) logDebug("Attempting to fetch HBase security token.") val token = obtainToken.invoke(null, hbaseConf(hadoopConf)) @@ -52,12 +54,58 @@ private[security] class HBaseDelegationTokenProvider creds.addToken(token.getService, token) } catch { case NonFatal(e) => -logWarning(s"Failed to get token from service $serviceName", e) +logWarning(s"Failed to get token from service $serviceName due to " + e + + s" Retrying to fetch HBase security token with hbase connection parameter.") +// Seems to be spark is trying to get the token from HBase 2.x.x version or above where the +// obtainToken(Configuration conf) API has been removed. Lets try
[spark] branch master updated: [SPARK-26713][CORE] Interrupt pipe IO threads in PipedRDD when task is finished
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1280bfd [SPARK-26713][CORE] Interrupt pipe IO threads in PipedRDD when task is finished 1280bfd is described below commit 1280bfd7564ca6d201a6e4a54ecf93a20f142f3a Author: Xianjin YE AuthorDate: Mon Jan 28 10:54:18 2019 -0600 [SPARK-26713][CORE] Interrupt pipe IO threads in PipedRDD when task is finished ## What changes were proposed in this pull request? Manually release stdin writer and stderr reader thread when task is finished. This commit also marks ShuffleBlockFetchIterator as fully consumed if isZombie is set. ## How was this patch tested? Added new test Closes #23638 from advancedxy/SPARK-26713. Authored-by: Xianjin YE Signed-off-by: Sean Owen --- .../main/scala/org/apache/spark/rdd/PipedRDD.scala | 33 ++-- .../storage/ShuffleBlockFetcherIterator.scala | 18 +-- .../scala/org/apache/spark/rdd/PipedRDDSuite.scala | 24 + .../storage/ShuffleBlockFetcherIteratorSuite.scala | 59 ++ 4 files changed, 126 insertions(+), 8 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala b/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala index 02b28b7..f1daf62 100644 --- a/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala +++ b/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala @@ -113,7 +113,7 @@ private[spark] class PipedRDD[T: ClassTag]( val childThreadException = new AtomicReference[Throwable](null) // Start a thread to print the process's stderr to ours -new Thread(s"stderr reader for $command") { +val stderrReaderThread = new Thread(s"${PipedRDD.STDERR_READER_THREAD_PREFIX} $command") { override def run(): Unit = { val err = proc.getErrorStream try { @@ -128,10 +128,11 @@ private[spark] class PipedRDD[T: ClassTag]( err.close() } } -}.start() +} +stderrReaderThread.start() // Start a thread to feed the process input from our parent's iterator -new Thread(s"stdin writer for $command") { +val stdinWriterThread = new Thread(s"${PipedRDD.STDIN_WRITER_THREAD_PREFIX} $command") { override def run(): Unit = { TaskContext.setTaskContext(context) val out = new PrintWriter(new BufferedWriter( @@ -156,7 +157,28 @@ private[spark] class PipedRDD[T: ClassTag]( out.close() } } -}.start() +} +stdinWriterThread.start() + +// interrupts stdin writer and stderr reader threads when the corresponding task is finished. +// Otherwise, these threads could outlive the task's lifetime. For example: +// val pipeRDD = sc.range(1, 100).pipe(Seq("cat")) +// val abnormalRDD = pipeRDD.mapPartitions(_ => Iterator.empty) +// the iterator generated by PipedRDD is never involved. If the parent RDD's iterator takes a +// long time to generate(ShuffledRDD's shuffle operation for example), the stdin writer thread +// may consume significant memory and CPU time even if task is already finished. +context.addTaskCompletionListener[Unit] { _ => + if (proc.isAlive) { +proc.destroy() + } + + if (stdinWriterThread.isAlive) { +stdinWriterThread.interrupt() + } + if (stderrReaderThread.isAlive) { +stderrReaderThread.interrupt() + } +} // Return an iterator that read lines from the process's stdout val lines = Source.fromInputStream(proc.getInputStream)(encoding).getLines @@ -219,4 +241,7 @@ private object PipedRDD { } buf } + + val STDIN_WRITER_THREAD_PREFIX = "stdin writer for" + val STDERR_READER_THREAD_PREFIX = "stderr reader for" } diff --git a/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala b/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala index 28decf0..f73c21b 100644 --- a/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala +++ b/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala @@ -141,7 +141,14 @@ final class ShuffleBlockFetcherIterator( /** * Whether the iterator is still active. If isZombie is true, the callback interface will no - * longer place fetched blocks into [[results]]. + * longer place fetched blocks into [[results]] and the iterator is marked as fully consumed. + * + * When the iterator is inactive, [[hasNext]] and [[next]] calls will honor that as there are + * cases the iterator is still being consumed. For example, ShuffledRDD + PipedRDD if the + * subprocess command is failed. The task will be marked as failed, then the iterator will be + * cleaned up at task completion, the [[next]] call (called
[spark] branch master updated: [SPARK-26719][SQL] Get rid of java.util.Calendar in DateTimeUtils
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 58e42cf [SPARK-26719][SQL] Get rid of java.util.Calendar in DateTimeUtils 58e42cf is described below commit 58e42cf50639d5b9d6c877b902ac559e132d041b Author: Maxim Gekk AuthorDate: Mon Jan 28 10:52:17 2019 -0600 [SPARK-26719][SQL] Get rid of java.util.Calendar in DateTimeUtils ## What changes were proposed in this pull request? - Replacing `java.util.Calendar` in `DateTimeUtils. truncTimestamp` and in `DateTimeUtils.getOffsetFromLocalMillis ` by equivalent code using Java 8 API for timestamp manipulations. The reason is `java.util.Calendar` is based on the hybrid calendar (Julian+Gregorian) but *java.time* classes use Proleptic Gregorian calendar which assumes by SQL standard. - Replacing `Calendar.getInstance()` in `DateTimeUtilsSuite` by similar code in `DateTimeTestUtils` using *java.time* classes ## How was this patch tested? The changes were tested by existing suites: `DateExpressionsSuite`, `DateFunctionsSuite` and `DateTimeUtilsSuite`. Closes #23641 from MaxGekk/cleanup-date-time-utils. Authored-by: Maxim Gekk Signed-off-by: Sean Owen --- docs/sql-migration-guide-upgrade.md| 2 + .../spark/sql/catalyst/util/DateTimeUtils.scala| 52 +-- .../sql/catalyst/util/DateTimeTestUtils.scala | 50 +++ .../sql/catalyst/util/DateTimeUtilsSuite.scala | 433 - 4 files changed, 231 insertions(+), 306 deletions(-) diff --git a/docs/sql-migration-guide-upgrade.md b/docs/sql-migration-guide-upgrade.md index 98fc9fa..41f27a3 100644 --- a/docs/sql-migration-guide-upgrade.md +++ b/docs/sql-migration-guide-upgrade.md @@ -95,6 +95,8 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 3.0, the JDBC options `lowerBound` and `upperBound` are converted to TimestampType/DateType values in the same way as casting strings to TimestampType/DateType values. The conversion is based on Proleptic Gregorian calendar, and time zone defined by the SQL config `spark.sql.session.timeZone`. In Spark version 2.4 and earlier, the conversion is based on the hybrid calendar (Julian + Gregorian) and on default system time zone. + - Since Spark 3.0, the `date_trunc`, `from_utc_timestamp`, `to_utc_timestamp`, and `unix_timestamp` functions use java.time API based on Proleptic Gregorian calendar. In Spark version 2.4 and earlier, the hybrid calendar (Julian + Gregorian) is used for the same purpose. Results of the functions returned by Spark 3.0 and previous versions can be different for dates before October 15, 1582 (Gregorian). + ## Upgrading From Spark SQL 2.3 to 2.4 - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala index 911750e..a97c1f7 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala @@ -18,10 +18,10 @@ package org.apache.spark.sql.catalyst.util import java.sql.{Date, Timestamp} -import java.time.{Instant, LocalDate, LocalDateTime, LocalTime, ZonedDateTime} +import java.time.{Instant, LocalDate, LocalDateTime, LocalTime, Month, ZonedDateTime} import java.time.Year.isLeap import java.time.temporal.IsoFields -import java.util.{Calendar, Locale, TimeZone} +import java.util.{Locale, TimeZone} import java.util.concurrent.{ConcurrentHashMap, TimeUnit} import java.util.function.{Function => JFunction} @@ -744,18 +744,17 @@ object DateTimeUtils { daysToMillis(prevMonday, timeZone) case TRUNC_TO_QUARTER => val dDays = millisToDays(millis, timeZone) -millis = daysToMillis(truncDate(dDays, TRUNC_TO_MONTH), timeZone) -val cal = Calendar.getInstance() -cal.setTimeInMillis(millis) -val quarter = getQuarter(dDays) -val month = quarter match { - case 1 => Calendar.JANUARY - case 2 => Calendar.APRIL - case 3 => Calendar.JULY - case 4 => Calendar.OCTOBER +val month = getQuarter(dDays) match { + case 1 => Month.JANUARY + case 2 => Month.APRIL + case 3 => Month.JULY + case 4 => Month.OCTOBER } -
[spark] branch master updated: [SPARK-26700][CORE] enable fetch-big-block-to-disk by default
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ed71a82 [SPARK-26700][CORE] enable fetch-big-block-to-disk by default ed71a82 is described below commit ed71a825c56920327533ebb741707871848ccd6d Author: Wenchen Fan AuthorDate: Mon Jan 28 23:41:55 2019 +0800 [SPARK-26700][CORE] enable fetch-big-block-to-disk by default ## What changes were proposed in this pull request? This is a followup of #16989 The fetch-big-block-to-disk feature is disabled by default, because it's not compatible with external shuffle service prior to Spark 2.2. The client sends stream request to fetch block chunks, and old shuffle service can't support it. After 2 years, Spark 2.2 has EOL, and now it's safe to turn on this feature by default ## How was this patch tested? existing tests Closes #23625 from cloud-fan/minor. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan --- .../org/apache/spark/internal/config/package.scala | 14 +++-- docs/configuration.md | 24 ++ 2 files changed, 19 insertions(+), 19 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala index 71b0df4..32559ae 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/package.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala @@ -699,17 +699,19 @@ package object config { private[spark] val MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM = ConfigBuilder("spark.maxRemoteBlockSizeFetchToMem") .doc("Remote block will be fetched to disk when size of the block is above this threshold " + -"in bytes. This is to avoid a giant request takes too much memory. We can enable this " + -"config by setting a specific value(e.g. 200m). Note this configuration will affect " + -"both shuffle fetch and block manager remote block fetch. For users who enabled " + -"external shuffle service, this feature can only be worked when external shuffle" + -"service is newer than Spark 2.2.") +"in bytes. This is to avoid a giant request takes too much memory. Note this " + +"configuration will affect both shuffle fetch and block manager remote block fetch. " + +"For users who enabled external shuffle service, this feature can only work when " + +"external shuffle service is at least 2.3.0.") .bytesConf(ByteUnit.BYTE) // fetch-to-mem is guaranteed to fail if the message is bigger than 2 GB, so we might // as well use fetch-to-disk in that case. The message includes some metadata in addition // to the block data itself (in particular UploadBlock has a lot of metadata), so we leave // extra room. - .createWithDefault(Int.MaxValue - 512) + .checkValue( +_ <= Int.MaxValue - 512, +"maxRemoteBlockSizeFetchToMem cannot be larger than (Int.MaxValue - 512) bytes.") + .createWithDefaultString("200m") private[spark] val TASK_METRICS_TRACK_UPDATED_BLOCK_STATUSES = ConfigBuilder("spark.taskMetrics.trackUpdatedBlockStatuses") diff --git a/docs/configuration.md b/docs/configuration.md index 7d3bbf9..806e16a 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -627,19 +627,6 @@ Apart from these, the following properties are also available, and may be useful - spark.maxRemoteBlockSizeFetchToMem - Int.MaxValue - 512 - -The remote block will be fetched to disk when size of the block is above this threshold in bytes. -This is to avoid a giant request that takes too much memory. By default, this is only enabled -for blocks > 2GB, as those cannot be fetched directly into memory, no matter what resources are -available. But it can be turned down to a much lower value (eg. 200m) to avoid using too much -memory on smaller blocks as well. Note this configuration will affect both shuffle fetch -and block manager remote block fetch. For users who enabled external shuffle service, -this feature can only be used when external shuffle service is newer than Spark 2.2. - - - spark.shuffle.compress true @@ -1519,6 +1506,17 @@ Apart from these, the following properties are also available, and may be useful you can set larger value. + + spark.maxRemoteBlockSizeFetchToMem + 200m + +Remote block will be fetched to disk when size of the block is above this threshold +in bytes. This is to avoid a giant request takes too much memory. Note this +configuration will affect both shuffle fetch and block manager remote block fetch. +For users who enabled external shuffle service, this
[spark] branch master updated: [SPARK-26656][SQL] Benchmarks for date and timestamp functions
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new bd027f6 [SPARK-26656][SQL] Benchmarks for date and timestamp functions bd027f6 is described below commit bd027f6e0e64b7e12ba9d3c77e0797a00c61e72d Author: Maxim Gekk AuthorDate: Mon Jan 28 14:21:21 2019 +0100 [SPARK-26656][SQL] Benchmarks for date and timestamp functions ## What changes were proposed in this pull request? Added the following benchmarks: - Extract components from timestamp like year, month, day and etc. - Current date and time - Date arithmetic like date_add, date_sub - Format dates and timestamps - Convert timestamps from/to UTC Closes #23661 from MaxGekk/datetime-benchmark. Authored-by: Maxim Gekk Signed-off-by: Herman van Hovell --- sql/core/benchmarks/DateTimeBenchmark-results.txt | 416 + .../execution/benchmark/DateTimeBenchmark.scala| 123 ++ 2 files changed, 539 insertions(+) diff --git a/sql/core/benchmarks/DateTimeBenchmark-results.txt b/sql/core/benchmarks/DateTimeBenchmark-results.txt new file mode 100644 index 000..8bbe310 --- /dev/null +++ b/sql/core/benchmarks/DateTimeBenchmark-results.txt @@ -0,0 +1,416 @@ + +Extract components + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-ea-b03 on Mac OS X 10.14.2 +Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz +cast to timestamp: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +cast to timestamp wholestage off 276 / 290 36.2 27.6 1.0X +cast to timestamp wholestage on254 / 267 39.4 25.4 1.1X + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-ea-b03 on Mac OS X 10.14.2 +Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz +year of timestamp: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +year of timestamp wholestage off 699 / 700 14.3 69.9 1.0X +year of timestamp wholestage on680 / 689 14.7 68.0 1.0X + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-ea-b03 on Mac OS X 10.14.2 +Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz +quarter of timestamp:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +quarter of timestamp wholestage off848 / 864 11.8 84.8 1.0X +quarter of timestamp wholestage on 784 / 797 12.8 78.4 1.1X + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-ea-b03 on Mac OS X 10.14.2 +Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz +month of timestamp: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +month of timestamp wholestage off 652 / 653 15.3 65.2 1.0X +month of timestamp wholestage on 671 / 677 14.9 67.1 1.0X + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-ea-b03 on Mac OS X 10.14.2 +Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz +weekofyear of timestamp: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +weekofyear of timestamp wholestage off1233 / 1233 8.1 123.3 1.0X +weekofyear of timestamp wholestage on 1236 / 1240 8.1 123.6 1.0X + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-ea-b03 on Mac OS X 10.14.2 +Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz +day of timestamp:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +day of timestamp wholestage off649 / 655 15.4 64.9 1.0X +day of timestamp wholestage on 670 / 678 14.9 67.0 1.0X + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-ea-b03 on Mac OS X 10.14.2 +Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz +dayofyear of timestamp: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative