[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22192 @vanzin @NiharS I am uncomfortable with that change as well - which is why I wanted the initialization to be pushed into a separate thread (and then join) - if we really need to set the context classloader. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22318: [SPARK-25150][SQL] Rewrite condition when dedupli...
Github user peter-toth commented on a diff in the pull request: https://github.com/apache/spark/pull/22318#discussion_r215499599 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala --- @@ -23,12 +23,14 @@ package org.apache.spark.sql.catalyst.expressions * of the name, or the expected nullability). */ object AttributeMap { + val empty = new AttributeMap(Map.empty) --- End diff -- @gatorsmile this was the initial version. then came the idea of @mgaido91 and @maropu to use val (instead of def) either here or in ResolveReferences to spare some empty() calls --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test][kafka][kinesis][flume] Ensure to cal...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22337 Sure, I just focused on files under `external`. Let me address other files, too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22344: [SPARK-25352][SQL] Perform ordered global limit when lim...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22344 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2887/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22344: [SPARK-25352][SQL] Perform ordered global limit when lim...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22344 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22344: [SPARK-25352][SQL] Perform ordered global limit when lim...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22344 **[Test build #95737 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95737/testReport)** for PR 22344 at commit [`8d49c1a`](https://github.com/apache/spark/commit/8d49c1afdbd6c0219d6cc182e53311201f73489f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22344: [SPARK-25352][SQL] Perform ordered global limit w...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/22344 [SPARK-25352][SQL] Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold ## What changes were proposed in this pull request? We have optimization on global limit to evenly distribute limit rows across all partitions. This optimization doesn't work for ordered results. For a query ending with sort + limit, in most cases it is performed by `TakeOrderedAndProjectExec`. But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, global limit will be used. At this moment, we need to do ordered global limit. ## How was this patch tested? Unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 SPARK-25352 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22344.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22344 commit 8d49c1afdbd6c0219d6cc182e53311201f73489f Author: Liang-Chi Hsieh Date: 2018-09-06T04:43:15Z Do ordered global limit when limit number is bigger than topKSortFallbackThreshold. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18142: [SPARK-20918] [SQL] Use FunctionIdentifier as function i...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18142 BTW, this is changed at Spark 2.3.0. How did we handle this before? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22226: [SPARK-25252][SQL] Support arrays of any types by...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/6 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22226: [SPARK-25252][SQL] Support arrays of any types by to_jso...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/6 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22340: [SPARK-25337][SQL][TEST] `runSparkSubmit` should provide...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22340 It's great to have this in Spark 2.4! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22340: [SPARK-25337][SQL][TEST] `runSparkSubmit` should provide...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22340 Thank you for merging, @srowen . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test][kafka][kinesis][flume] Ensure to cal...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22337 Could you check the other suites like `ExternalAppendOnlyUnsafeRowArraySuite` and `TakeOrderedAndProjectSuite`, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22337: [SPARK-25338][Test][kafka][kinesis][flume] Ensure...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22337#discussion_r215491592 --- Diff: external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala --- @@ -83,6 +83,7 @@ abstract class KinesisStreamTests(aggregateTestData: Boolean) extends KinesisFun testUtils.deleteStream() testUtils.deleteDynamoDBTable(appName) } --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22337: [SPARK-25338][Test][kafka][kinesis][flume] Ensure...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22337#discussion_r215491496 --- Diff: external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisInputDStreamBuilderSuite.scala --- @@ -41,6 +41,7 @@ class KinesisInputDStreamBuilderSuite extends TestSuiteBase with BeforeAndAfterE override def afterAll(): Unit = { ssc.stop() --- End diff -- Here, too. Let's use `try` to make it sure the invocation of `super.afterAll()`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22340: [SPARK-25337][SQL][TEST] `runSparkSubmit` should ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22340 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22337: [SPARK-25338][Test][kafka][kinesis][flume] Ensure...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22337#discussion_r215490599 --- Diff: external/flume/src/test/scala/org/apache/spark/streaming/flume/FlumePollingStreamSuite.scala --- @@ -58,6 +58,7 @@ class FlumePollingStreamSuite extends SparkFunSuite with BeforeAndAfterAll with _sc.stop() _sc = null } --- End diff -- It seems that we need `try` for `_sc.stop()`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22340: [SPARK-25337][SQL][TEST] `runSparkSubmit` should provide...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22340 Merged to master (which looks like is still 2.4) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test][kafka][kinesis][flume] Ensure to cal...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22337 Hi, @kiszk . Oh, did you check them all? Are these all of them? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22339: SPARK-17159 Significant speed up for running spark strea...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22339 Hi, @ScrapCodes . Could you do the followings? - Update the title to `[SPARK-17159][SS]...` - Remove `Please review http://spark.apache.org/contributing.html ` from PR description - Share the numbers because the PR title has `Significant speed up` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22302: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22302 **[Test build #95736 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95736/testReport)** for PR 22302 at commit [`c3c5af2`](https://github.com/apache/spark/commit/c3c5af2b9f27e4effcc8f44f1b98f24a6d1fafa7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18853 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95733/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18853 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18853 **[Test build #95733 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95733/testReport)** for PR 18853 at commit [`d0a2089`](https://github.com/apache/spark/commit/d0a2089b8c52a3be3977f0bfc5538758ef7d2b55). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22340: [SPARK-25337][SQL][TEST] `runSparkSubmit` should provide...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22340 Thank you, @HyukjinKwon ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22302: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test f...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22302 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22302: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test f...
Github user fjh100456 commented on the issue: https://github.com/apache/spark/pull/22302 The pr [#20120](https://github.com/apache/spark/pull/20120) was closed, instead [SPARK-23355](https://issues.apache.org/jira/browse/SPARK-23355) had resolved it. Does it mean the same thing [#21269](https://github.com/apache/spark/pull/21269/files)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22270: [SPARK-25267][SQL][TEST] Disable ConvertToLocalRe...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22270#discussion_r215485269 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -85,12 +85,12 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { } val df5 = Seq((Seq("a", null), Seq(1, 2))).toDF("k", "v") -intercept[RuntimeException] { +intercept[Exception] { --- End diff -- @HyukjinKwon We get a SparkException here which in turn wraps a RuntimeException. When we have ConvertToLocalRelation active, we get a RuntimeException from driver. But when we disable it, the error is raised from the executor with a SparkException as the top level exception. Thats the reason i changed it to intercept `Exception` so that this test can run both when the rule is active vs when its not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22270: [SPARK-25267][SQL][TEST] Disable ConvertToLocalRelation ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22270 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22270: [SPARK-25267][SQL][TEST] Disable ConvertToLocalRelation ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22270 **[Test build #95735 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95735/testReport)** for PR 22270 at commit [`507f89c`](https://github.com/apache/spark/commit/507f89c4b7c2ee7ae4ad352cd66d4c2bed4adba5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22329: [SPARK-25328][PYTHON] Add an example for having two colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22329 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95734/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22270: [SPARK-25267][SQL][TEST] Disable ConvertToLocalRelation ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22270 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2886/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22329: [SPARK-25328][PYTHON] Add an example for having two colu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22329 **[Test build #95734 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95734/testReport)** for PR 22329 at commit [`1f342aa`](https://github.com/apache/spark/commit/1f342aa7158bc2440f504b7cb47b692fcdcce41d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22329: [SPARK-25328][PYTHON] Add an example for having two colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22329 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22270: [SPARK-25267][SQL][TEST] Disable ConvertToLocalRe...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22270#discussion_r215485064 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -85,12 +85,12 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { } val df5 = Seq((Seq("a", null), Seq(1, 2))).toDF("k", "v") -intercept[RuntimeException] { +intercept[Exception] { --- End diff -- @gatorsmile Thanks.. I am checking the message now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22270: [SPARK-25267][SQL][TEST] Disable ConvertToLocalRe...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22270#discussion_r215484920 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -85,12 +85,12 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { } val df5 = Seq((Seq("a", null), Seq(1, 2))).toDF("k", "v") -intercept[RuntimeException] { +intercept[Exception] { df5.select(map_from_arrays($"k", $"v")).collect --- End diff -- @maropu We get a SparkException here which in turn wraps a RuntimeException. When we have ConvertToLocalRelation active, we get a RuntimeException from driver. But when we disable it, the error is raised from the executor with a SparkException as the top level exception. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22140 @BryanCutler, for https://github.com/apache/spark/pull/22140#issuecomment-414802978, yea, to me it looks less sense actually but seems at least working for now: ```python from pyspark.sql import Row rowClass = Row("c1", "c2") spark.createDataFrame([rowClass(1)]).show() ``` ``` +---+ | c1| +---+ | 1| +---+ ``` I think we should consider disallowing it in 3.0.0 given the test above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22271: [SPARK-25268][GraphX]run Parallel Personalized PageRank ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22271 **[Test build #4332 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4332/testReport)** for PR 22271 at commit [`7651652`](https://github.com/apache/spark/commit/76516529e31a5507afc0a1de31b30104a4d7c6ad). * This patch passes all tests. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22140: [SPARK-25072][PySpark] Forbid extra value for cus...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22140#discussion_r215483530 --- Diff: python/pyspark/sql/tests.py --- @@ -269,6 +269,10 @@ def test_struct_field_type_name(self): struct_field = StructField("a", IntegerType()) self.assertRaises(TypeError, struct_field.typeName) +def test_invalid_create_row(slef): +rowClass = Row("c1", "c2") --- End diff -- nit: `rowClass` -> `row_class` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22340: [SPARK-25337][SQL][TEST] `runSparkSubmit` should provide...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22340 Seems okay to me too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22287: [SPARK-25135][SQL] FileFormatWriter should respec...
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/22287 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22320 @gengliangwang We need backport this pr to branch-2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22329: [SPARK-25328][PYTHON] Add an example for having two colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22329 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22329: [SPARK-25328][PYTHON] Add an example for having two colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22329 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2885/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22307: [SPARK-25301][SQL] When a view uses an UDF from a non de...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22307 @vinod, see the discussion made in https://github.com/apache/spark/pull/18142. Shall we close this? cc @cloud-fan as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18142: [SPARK-20918] [SQL] Use FunctionIdentifier as function i...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18142 I agree with this change too for clarification. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22329: [SPARK-25328][PYTHON] Add an example for having two colu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22329 **[Test build #95734 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95734/testReport)** for PR 22329 at commit [`1f342aa`](https://github.com/apache/spark/commit/1f342aa7158bc2440f504b7cb47b692fcdcce41d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18142: [SPARK-20918] [SQL] Use FunctionIdentifier as function i...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18142 @HyukjinKwon Thanks for the note! I think this behavior is better, I'm adding a `release_note` tag to the JIRA ticket, so that we don't forget to mention it in release notes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22320 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22320 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22320#discussion_r215479502 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala --- @@ -82,7 +83,7 @@ case class CreateHiveTableAsSelectCommand( query, overwrite = true, ifPartitionNotExists = false, - outputColumns = outputColumns).run(sparkSession, child) + outputColumnNames = outputColumnNames).run(sparkSession, child) --- End diff -- I feel it's better to specify parameters by name if the previous parameter is already specified by name, e.g. `ifPartitionNotExists = false` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21968: [SPARK-24999][SQL]Reduce unnecessary 'new' memory operat...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/21968 cc @cloud-fan @hvanhovell @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/21860 cc @cloud-fan @hvanhovell @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22302: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test f...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22302 btw, the behaivour differences of `TABLEPROPERTIES` and `OPTIONS` have already been documented somewhere? https://github.com/apache/spark/pull/20120#issuecomment-354704862 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18142: [SPARK-20918] [SQL] Use FunctionIdentifier as function i...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18142 I am leaving a note (at least to myself) since it looked anyhow caused behaviour change. With this statements in Hive side: ```sql CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address; CREATE DATABASE d100; CREATE FUNCTION d100.udf100 AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; ``` **Hive** ``` hive> SELECT d100.udf100(`emp`.`name`) FROM `emp`; USER hive> SELECT `d100.udf100`(`emp`.`name`) FROM `emp`; USER ``` **Spark** Before: ``` scala> spark.sql("SELECT d100.udf100(`emp`.`name`) FROM `emp`").show +-+ |d100.udf100(name)| +-+ | USER| +-+ scala> spark.sql("SELECT `d100.udf100`(`emp`.`name`) FROM `emp`").show +-+ |d100.udf100(name)| +-+ | USER| +-+ ``` After: ``` scala> spark.sql("SELECT d100.udf100(`emp`.`name`) FROM `emp`").show +-+ |d100.udf100(name)| +-+ | USER| +-+ scala> spark.sql("SELECT `d100.udf100`(`emp`.`name`) FROM `emp`").show org.apache.spark.sql.AnalysisException: Undefined function: 'd100.udf100'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7 ``` **MySQL** This change causes a inconsistency with Hive although looks consistent compating to MySQL. ``` mysql> SELECT `d100.udf100`(`emp`.`name`) FROM `emp`; ERROR 1305 (42000): FUNCTION hkwon.d100.udf100 does not exist mysql> SELECT d100.udf100(`emp`.`name`) FROM `emp`; +---+ | d100.udf100(`emp`.`name`) | +---+ | Hello, user! | +---+ 1 row in set (0.01 sec) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22302: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test f...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22302 nit. Maybe, it seems that you removed `##`. You may want to simplify the PR description. This only re-enables the test cases of previous your PR. ``` ## What changes were proposed in this pull request? ... ## How was this patch tested? ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22302: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test f...
Github user fjh100456 commented on the issue: https://github.com/apache/spark/pull/22302 @dongjoon-hyun @maropu I'm so sorry, and I have changed the description. Is it ok now ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22327: [SPARK-25330][BUILD] Revert Hadoop 2.7 to 2.7.3
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22327 I think that is reasonable personally --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21669: [SPARK-23257][K8S] Kerberos Support for Spark on ...
Github user ifilonenko commented on a diff in the pull request: https://github.com/apache/spark/pull/21669#discussion_r215471255 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala --- @@ -60,7 +62,23 @@ private[spark] case class KubernetesConf[T <: KubernetesRoleSpecificConf]( roleSecretEnvNamesToKeyRefs: Map[String, String], roleEnvs: Map[String, String], roleVolumes: Iterable[KubernetesVolumeSpec[_ <: KubernetesVolumeSpecificConf]], -sparkFiles: Seq[String]) { +sparkFiles: Seq[String], +hadoopConfDir: Option[String]) { + + def getHadoopConfigMapName: String = s"$appResourceNamePrefix-hadoop-config" + + def getKRBConfigMapName: String = s"$appResourceNamePrefix-krb5-file" + + def getHadoopStepsOrchestrator : Option[HadoopStepsOrchestrator] = hadoopConfDir.map { +hConf => new HadoopStepsOrchestrator( + sparkConf, + appResourceNamePrefix, + hConf, + getHadoopConfigMapName, + getTokenManager.isSecurityEnabled)} + + def getTokenManager : KubernetesHadoopDelegationTokenManager = --- End diff -- The `KubernetesHadoopDelegationTokenManager` is crucial for doing token retrieval and needs to be done in the Feature steps. As such, must be retrieved via the `KubernetesConf`. Sadly, I don't see another way given the design of the feature steps. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Struc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22282 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21669: [SPARK-23257][K8S] Kerberos Support for Spark on ...
Github user ifilonenko commented on a diff in the pull request: https://github.com/apache/spark/pull/21669#discussion_r215471079 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala --- @@ -225,6 +225,60 @@ private[spark] object Config extends Logging { "Ensure that major Python version is either Python2 or Python3") .createWithDefault("2") + val KUBERNETES_KERBEROS_PROXY_USER = --- End diff -- Ah, this was in reference to @skonto who was interested in being able to handle the case of using a proxy user with kerberos. I decided to move this feature addition into a separate PR but left this config. I'll remove this for the sake of proper separation between PR features. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Struc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22282 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95731/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Struc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22282 **[Test build #95731 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95731/testReport)** for PR 22282 at commit [`b0f64e9`](https://github.com/apache/spark/commit/b0f64e91cb4f6306a7c0c60d4a17f1a0aacb3a51). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21669: [SPARK-23257][K8S] Kerberos Support for Spark on ...
Github user ifilonenko commented on a diff in the pull request: https://github.com/apache/spark/pull/21669#discussion_r215470727 --- Diff: docs/security.md --- @@ -722,6 +722,62 @@ with encryption, at least. The Kerberos login will be periodically renewed using the provided credentials, and new delegation tokens for supported will be created. +## Secure Interaction with Kubernetes + +When talking to Hadoop-based services behind Kerberos, it was noted that Spark needs to obtain delegation tokens +so that non-local processes can authenticate. These delegation tokens in Kubernetes are stored in Secrets that are +shared by the Driver and its Executors. As such, there are three ways of submitting a kerberos job: + +1. Submitting with a $kinit that stores a TGT in the Local Ticket Cache: +```bash +/usr/bin/kinit -kt / +/opt/spark/bin/spark-submit \ +--deploy-mode cluster \ +--class org.apache.spark.examples.HdfsTest \ +--master k8s:// \ +--conf spark.executor.instances=1 \ +--conf spark.app.name=spark-hdfs \ +--conf spark.kubernetes.container.image=spark:latest \ +--conf spark.kubernetes.kerberos.krb5location=/etc/krb5.conf \ + local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar \ + +``` +3. Submitting with a local keytab and principle +```bash +/opt/spark/bin/spark-submit \ +--deploy-mode cluster \ +--class org.apache.spark.examples.HdfsTest \ +--master k8s:// \ +--conf spark.executor.instances=1 \ +--conf spark.app.name=spark-hdfs \ +--conf spark.kubernetes.container.image=spark:latest \ +--conf spark.kubernetes.kerberos.keytab= \ +--conf spark.kubernetes.kerberos.principal= \ +--conf spark.kubernetes.kerberos.krb5location=/etc/krb5.conf \ + local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar \ + +``` + +3. Submitting with pre-populated secrets already existing within the namespace --- End diff -- Agreed, I should be more clear. This is a secret containing the Delegation Token that will be used to mount onto the driver and executors, pointed to via the ENV: `HADOOP_TOKEN_FILE_LOCATION` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21669: [SPARK-23257][K8S] Kerberos Support for Spark on ...
Github user ifilonenko commented on a diff in the pull request: https://github.com/apache/spark/pull/21669#discussion_r215470379 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/hadoopsteps/HadoopStepsOrchestrator.scala --- @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.features.hadoopsteps + +import java.io.File + +import org.apache.spark.SparkConf +import org.apache.spark.deploy.k8s.Config._ +import org.apache.spark.deploy.k8s.features.OptionRequirements +import org.apache.spark.deploy.k8s.security.KubernetesHadoopDelegationTokenManager +import org.apache.spark.internal.Logging + +private[spark] class HadoopStepsOrchestrator( + conf: SparkConf, + kubernetesResourceNamePrefix: String, + hadoopConfDir: String, + hadoopConfigMapName: String, + isKerberosEnabled: Boolean) extends Logging { + + private val maybePrincipal = conf.get(KUBERNETES_KERBEROS_PRINCIPAL) + private val maybeKeytab = conf.get(KUBERNETES_KERBEROS_KEYTAB) +.map(k => new File(k)) + private val maybeExistingSecretName = conf.get(KUBERNETES_KERBEROS_DT_SECRET_NAME) + private val maybeExistingSecretItemKey = +conf.get(KUBERNETES_KERBEROS_DT_SECRET_ITEM_KEY) + private val maybeRenewerPrincipal = +conf.get(KUBERNETES_KERBEROS_RENEWER_PRINCIPAL) + + require(maybeKeytab.forall( _ => isKerberosEnabled ), +"You must enable Kerberos support if you are specifying a Kerberos Keytab") + + require(maybeExistingSecretName.forall( _ => isKerberosEnabled ), +"You must enable Kerberos support if you are specifying a Kerberos Secret") + + OptionRequirements.requireBothOrNeitherDefined( +maybeKeytab, +maybePrincipal, +"If a Kerberos keytab is specified you must also specify a Kerberos principal", +"If a Kerberos principal is specified you must also specify a Kerberos keytab") + + OptionRequirements.requireBothOrNeitherDefined( +maybeExistingSecretName, +maybeExistingSecretItemKey, +"If a secret storing a Kerberos Delegation Token is specified you must also" + + " specify the label where the data is stored", +"If a secret data item-key where the data of the Kerberos Delegation Token is specified" + + " you must also specify the name of the secret") + + def getHadoopSteps(kubeTokenManager: KubernetesHadoopDelegationTokenManager): --- End diff -- I thought that the interface was clear and easier to understand then using specific submodules which could bloat a single step, making it trickier, imo, to unit test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22327: [SPARK-25330][BUILD] Revert Hadoop 2.7 to 2.7.3
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22327 How about revert it to branch-2.3 as we are going to release 2.3.2? We have time to fix it before releasing 2.4.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21669: [SPARK-23257][K8S] Kerberos Support for Spark on ...
Github user ifilonenko commented on a diff in the pull request: https://github.com/apache/spark/pull/21669#discussion_r215470115 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/hadoopsteps/HadoopKerberosSecretResolverStep.scala --- @@ -0,0 +1,41 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.features.hadoopsteps + +import org.apache.spark.deploy.k8s.security.KubernetesHadoopDelegationTokenManager +import org.apache.spark.internal.Logging + + /** + * This step assumes that you have already done all the heavy lifting in retrieving a --- End diff -- This specific "step" is used when the delegation token has already been stored in a pre-existing secret (that is not created on-the-fly by the Submission Client). This is a use-case we have seen desired by those running on Kubernetes Clusters where they do not wish to provide certain clients with keytabs and merely wish to point to pre-populated secrets that the user has access to (access is restricted via RBAC). I thought that secret creation logic and non-creation logic should be separated, but I can combine them into the same step. I just thought it would be more clear. (Also easier for unit testing). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21669: [SPARK-23257][K8S] Kerberos Support for Spark on ...
Github user ifilonenko commented on a diff in the pull request: https://github.com/apache/spark/pull/21669#discussion_r215469488 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/hadoopsteps/HadoopKerberosKeytabResolverStep.scala --- @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.features.hadoopsteps + +import java.io._ +import java.security.PrivilegedExceptionAction + +import scala.collection.JavaConverters._ + +import io.fabric8.kubernetes.api.model.SecretBuilder +import org.apache.commons.codec.binary.Base64 + +import org.apache.spark.{SparkConf, SparkException} +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.deploy.k8s.Constants._ +import org.apache.spark.deploy.k8s.security.KubernetesHadoopDelegationTokenManager +import org.apache.spark.deploy.security.HadoopDelegationTokenManager +import org.apache.spark.internal.Logging + + /** + * This step does all the heavy lifting for Delegation Token logic. This step + * assumes that the job user has either specified a principal and keytab or ran + * $kinit before running spark-submit. With a TGT stored locally, by running + * UGI.getCurrentUser you are able to obtain the current user, alternatively + * you can run UGI.loginUserFromKeytabAndReturnUGI and by running .doAs run + * as the logged into user instead of the current user. With the Job User principal + * you then retrieve the delegation token from the NameNode and store values in + * DelegationToken. Lastly, the class puts the data into a secret. All this is + * appended to the current HadoopSpec which in turn will append to the current + * DriverSpec. + */ +private[spark] class HadoopKerberosKeytabResolverStep( +submissionSparkConf: SparkConf, +kubernetesResourceNamePrefix : String, +maybePrincipal: Option[String], +maybeKeytab: Option[File], +maybeRenewerPrincipal: Option[String], +tokenManager: KubernetesHadoopDelegationTokenManager) + extends HadoopConfigurationStep with Logging { + +override def configureHadoopSpec(hSpec: HadoopConfigSpec): HadoopConfigSpec = { + val hadoopConf = SparkHadoopUtil.get.newConfiguration(submissionSparkConf) + if (!tokenManager.isSecurityEnabled) { +throw new SparkException("Hadoop not configured with Kerberos") + } + val maybeJobUserUGI = --- End diff -- Shouldn't the `yarn` config change exist in a different PR or should I include it in here? I agree that it would be super helpful. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/22192 > It seems like the line Thread.currentThread().setContextClassLoader(replClassLoader) is causing the pyspark failures What if you restore the original class loader after initializing the plugins? I was a little worried about this call, but was waiting for tests... so if it's causing problems, better to not change the way things work there. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/22192 > I'll change the config documentation to specify YARN only, hopefully it's not a huge issue. I don't think you're really understanding what I'm saying above. I'm not saying this feature only works with YARN. I'm saying that distributing the plugins with `--jars` only works with YARN. And that if there was user-documentation for this feature - which there isn't - then it would have to explain how the jar is expected to be made available to the executors. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22271: [SPARK-25268][GraphX]run Parallel Personalized PageRank ...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/22271 LGTM I tested this locally and confirmed it fixes the serialization issue. Thank you @shahidki31 ! Merging with master after fresh tests finish --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user NiharS commented on the issue: https://github.com/apache/spark/pull/22192 I'll change the config documentation to specify YARN only, hopefully it's not a huge issue. It seems like the line `Thread.currentThread().setContextClassLoader(replClassLoader)` is causing the pyspark failures, they pass when I remove it. I'm looking at the test cases but I really don't see how this is affecting them...it seems that in both test cases, the DStreams monitor a directory but don't pick up the changes they're supposed to, and just time out. I checked that I can bypass this issue by changing back to having the plugins loaded on a separate thread (and setting that thread's contextClassLoader instead of the current thread) and it passes tests and continues to work. That said this issue does seem to be indicative of some problem in pyspark streaming --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22262 Thank you, @seancxmao . I'll review tonight again. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22302: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test f...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22302 @fjh100456 please do not break the PR format, and can you clean up, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20999: [WIP][SPARK-14922][SPARK-23866][SQL] Support partition f...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20999 still WIP? btw, it seems currently credits can go to multiple developers; https://github.com/apache/spark/pull/22324#issuecomment-418227335 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22302: [SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test f...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22302 @fjh100456 . I'm @dongjoon-hyun . I think you put a wrong person in the the PR description. :) BTW, you don't need to mention the person. It would be enough to mention SPARK JIRA IDs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFile...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22324#discussion_r215461695 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -473,6 +476,27 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext with Befo } } } + + test("SPARK-25237 compute correct input metrics in FileScanRDD") { +withTempPath { p => + val path = p.getAbsolutePath + spark.range(1000).repartition(1).write.csv(path) + val bytesReads = new mutable.ArrayBuffer[Long]() + val bytesReadListener = new SparkListener() { +override def onTaskEnd(taskEnd: SparkListenerTaskEnd) { + bytesReads += taskEnd.taskMetrics.inputMetrics.bytesRead +} + } + sparkContext.addSparkListener(bytesReadListener) + try { +spark.read.csv(path).limit(1).collect() +sparkContext.listenerBus.waitUntilEmpty(1000L) +assert(bytesReads.sum === 7860) --- End diff -- yea, actually the file size is `3890`, but the hadoop API (`FileSystem.getAllStatistics ) reports that number (`3930`). I didn't look into the Hadoop code yet, so I don't get why. I'll dig into it later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21525: [SPARK-24513][ML] Attribute support in UnaryTrans...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21525#discussion_r215461652 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Transformer.scala --- @@ -116,10 +116,17 @@ abstract class UnaryTransformer[IN, OUT, T <: UnaryTransformer[IN, OUT, T]] StructType(outputFields) } + /** + * Returns [[Metadata]] to be attached to the output column. + */ + protected def outputMetadata(outputSchema: StructType, dataset: Dataset[_]): Metadata = +Metadata.empty + override def transform(dataset: Dataset[_]): DataFrame = { -transformSchema(dataset.schema, logging = true) +val outputSchema = transformSchema(dataset.schema, logging = true) val transformUDF = udf(this.createTransformFunc, outputDataType) -dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol +val metadata = outputMetadata(outputSchema, dataset) --- End diff -- `HashingTF` is an example that the metadata is created in `transformSchema` and attached to `outputSchema`. So my question is, do we need an extra API `outputMetadata` to do this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18853 **[Test build #95733 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95733/testReport)** for PR 18853 at commit [`d0a2089`](https://github.com/apache/spark/commit/d0a2089b8c52a3be3977f0bfc5538758ef7d2b55). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22340: [SPARK-25337][SQL][TEST] `runSparkSubmit` should provide...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22340 Could you review this Scala-2.12 PR again, @cloud-fan and @gatorsmile ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20887: [SPARK-23774][SQL] `Cast` to CHAR/VARCHAR should truncat...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20887 Based on the review opinion, I'll close this PR and JIRA issue for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20887: [SPARK-23774][SQL] `Cast` to CHAR/VARCHAR should ...
Github user dongjoon-hyun closed the pull request at: https://github.com/apache/spark/pull/20887 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22270: [SPARK-25267][SQL][TEST] Disable ConvertToLocalRelation ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22270 Thank you! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22323: [SPARK-25262][K8S] Allow SPARK_LOCAL_DIRS to be t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22323#discussion_r215456606 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/LocalDirsFeatureStep.scala --- @@ -45,6 +47,10 @@ private[spark] class LocalDirsFeatureStep( new VolumeBuilder() .withName(s"spark-local-dir-${index + 1}") .withNewEmptyDir() +.withMedium(useLocalDirTmpFs match { + case true => "Memory" // Use tmpfs --- End diff -- Think we shouldn't use `case true... false` - can instead do this: ``` .withNewEmptyDir().withMedium(if (useLocalDirTmpFs) "Memory" else null)... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22323: [SPARK-25262][K8S] Allow SPARK_LOCAL_DIRS to be t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22323#discussion_r215456338 --- Diff: docs/running-on-kubernetes.md --- @@ -215,6 +215,19 @@ spark.kubernetes.driver.volumes.persistentVolumeClaim.checkpointpvc.options.clai The configuration properties for mounting volumes into the executor pods use prefix `spark.kubernetes.executor.` instead of `spark.kubernetes.driver.`. For a complete list of available options for each supported type of volumes, please refer to the [Spark Properties](#spark-properties) section below. +## Local Storage + +Spark uses temporary scratch space to spill data to disk during shuffles and other operations. When using Kubernetes as the resource manager the pods will be created with an [emptyDir](https://kubernetes.io/docs/concepts/storage/volumes/#emptydir) volume mounted for each directory listed in `SPARK_LOCAL_DIRS`. If no directories are explicitly specified then a default directory is created and configured appropriately. + +`emptyDir` volumes use the ephemeral storage feature of Kubernetes and do not persist beyond the life of the pod. + +### Using RAM for local storage + +As `emptyDir` volumes use the nodes backing storage for ephemeral storage this default behaviour may not be appropriate for some compute environments. For example if you have diskless nodes with remote storage mounted over a network having lots of executors doing IO to this remote storage may actually degrade performance. + +In this case it may be desirable to set `spark.kubernetes.local.dirs.tmpfs=true` in your configuration which will cause the `emptyDir` volumes to be configured as `tmpfs` i.e. RAM backed volumes. When configured like this Sparks local storage usage will count towards your pods memory usage therefore you may wish to increase your memory requests via the normal `spark.driver.memory` and `spark.executor.memory` configuration properties. --- End diff -- You can't allocate space for tmpfs via `spark,{driver, executor}.memory` because that will strictly be allocated to the heap. The Java command is basically this: `/bin/java -Xmx${spark.driver.memory}` hence strictly requiring memory overhead to get this space to be dedicated to tmpfs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22323: [SPARK-25262][K8S] Allow SPARK_LOCAL_DIRS to be t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22323#discussion_r215456006 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/LocalDirsFeatureStep.scala --- @@ -22,6 +22,7 @@ import java.util.UUID import io.fabric8.kubernetes.api.model.{ContainerBuilder, HasMetadata, PodBuilder, VolumeBuilder, VolumeMountBuilder} import org.apache.spark.deploy.k8s.{KubernetesConf, KubernetesDriverSpecificConf, KubernetesRoleSpecificConf, SparkPod} +import org.apache.spark.deploy.k8s.Config._ --- End diff -- Actually I think wildcard for configuration fields is probably ok. But we haven't been consistent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21669: [SPARK-23257][K8S] Kerberos Support for Spark on ...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21669#discussion_r215455515 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala --- @@ -60,7 +62,23 @@ private[spark] case class KubernetesConf[T <: KubernetesRoleSpecificConf]( roleSecretEnvNamesToKeyRefs: Map[String, String], roleEnvs: Map[String, String], roleVolumes: Iterable[KubernetesVolumeSpec[_ <: KubernetesVolumeSpecificConf]], -sparkFiles: Seq[String]) { +sparkFiles: Seq[String], +hadoopConfDir: Option[String]) { + + def getHadoopConfigMapName: String = s"$appResourceNamePrefix-hadoop-config" + + def getKRBConfigMapName: String = s"$appResourceNamePrefix-krb5-file" + + def getHadoopStepsOrchestrator : Option[HadoopStepsOrchestrator] = hadoopConfDir.map { +hConf => new HadoopStepsOrchestrator( + sparkConf, + appResourceNamePrefix, + hConf, + getHadoopConfigMapName, + getTokenManager.isSecurityEnabled)} + + def getTokenManager : KubernetesHadoopDelegationTokenManager = --- End diff -- Strange to have the `KubernetesConf` object return a unit that does work - most of these are properties. This thing should basically behave like a struct. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21525: [SPARK-24513][ML] Attribute support in UnaryTransformer
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21525 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21525: [SPARK-24513][ML] Attribute support in UnaryTransformer
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21525 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95732/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21525: [SPARK-24513][ML] Attribute support in UnaryTransformer
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21525 **[Test build #95732 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95732/testReport)** for PR 21525 at commit [`cfeae4d`](https://github.com/apache/spark/commit/cfeae4d5282be98d619da97f350c11339057e92f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22335: [SPARK-25091][SQL] reduce the storage memory in E...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22335#discussion_r215452975 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -646,8 +646,17 @@ private[spark] class AppStatusListener( } override def onUnpersistRDD(event: SparkListenerUnpersistRDD): Unit = { -liveRDDs.remove(event.rddId) -kvstore.delete(classOf[RDDStorageInfoWrapper], event.rddId) +while (true) { --- End diff -- Thank you for letting me know about a single thread. I agree with your point. As you pointed out, when `rdd.isEmpty()` continues to return `false`, it will get into an infinite loop. Thus, I imagine that this works under multithread. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22146 I think we still need some consensus on the driver container API. Personally I think it's clearer to use a configuration option to specifically select the container that is the driver, but this is an opinion I hold loosely. Would like specific weigh-in from @erikerlandson and @liyinan926. Aside from that, this looks good on my end apart from the tests we need to add and should be ready to merge pretty soon (depending also on Spark 2.4 timeline) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22146#discussion_r215452489 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesDriverBuilder.scala --- @@ -81,9 +95,12 @@ private[spark] class KubernetesDriverBuilder( .getOrElse(provideJavaStep(kubernetesConf)) val allFeatures = (baseFeatures :+ bindingsStep) ++ - secretFeature ++ envSecretFeature ++ volumesFeature + secretFeature ++ envSecretFeature ++ volumesFeature ++ podTemplateFeature -var spec = KubernetesDriverSpec.initialSpec(kubernetesConf.sparkConf.getAll.toMap) +var spec = KubernetesDriverSpec( + provideInitialPod(), + Seq.empty, --- End diff -- Ping on this --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22270: [SPARK-25267][SQL][TEST] Disable ConvertToLocalRelation ...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22270 @gatorsmile Yeah.. i will push the changes tonight for you to take a look. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22146#discussion_r215451113 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala --- @@ -59,5 +66,28 @@ private[spark] object KubernetesUtils { } } + def loadPodFromTemplate( + kubernetesClient: KubernetesClient, + templateFile: File): SparkPod = { +try { + val pod = kubernetesClient.pods().load(templateFile).get() + pod.getSpec.getContainers.asScala.toList match { --- End diff -- Ah actually because we're extracting first and rest this implementation is probably cleaner. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22146#discussion_r215450800 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala --- @@ -74,8 +74,16 @@ private[spark] object Constants { val ENV_R_PRIMARY = "R_PRIMARY" val ENV_R_ARGS = "R_APP_ARGS" + // Pod spec templates + val EXECUTOR_POD_SPEC_TEMPLATE_FILE_NAME = "pod-spec-template.yml" + val EXECUTOR_POD_SPEC_TEMPLATE_MOUNTHPATH = "/opt/spark/pod-template" --- End diff -- Spelling, think we want `EXECUTOR_POD_SPEC_TEMPLATE_MOUNTPATH`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18080: [Spark-20771][SQL] Make weekofyear more intuitive
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18080 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22146#discussion_r215450600 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala --- @@ -59,5 +66,28 @@ private[spark] object KubernetesUtils { } } + def loadPodFromTemplate( + kubernetesClient: KubernetesClient, + templateFile: File): SparkPod = { +try { + val pod = kubernetesClient.pods().load(templateFile).get() + pod.getSpec.getContainers.asScala.toList match { --- End diff -- I think you can use `headOption` here to get an `Option` object and use `map`, `getOrElse`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22333: [SPARK-25335][BUILD] Skip Zinc downloading if it's insta...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22333 Thank you, @srowen . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22179: [SPARK-25258][SPARK-23131][SPARK-25176][BUILD] Up...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22179 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org