[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r169556960 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -107,17 +106,24 @@ case class DataSourceV2Relation( } /** - * A specialization of DataSourceV2Relation with the streaming bit set to true. Otherwise identical - * to the non-streaming relation. + * A specialization of [[DataSourceV2Relation]] with the streaming bit set to true. + * + * Note that, this plan has a mutable reader, so Spark won't apply operator push-down for this plan, + * to avoid making the plan mutable. We should consolidate this plan and [[DataSourceV2Relation]] + * after we figure out how to apply operator push-down for streaming data sources. --- End diff -- Currently the streaming execution creates the reader once. The reader is mutable and contains 2 kinds of states: 1. operator push-down states, e.g. the filters being pushed down. 2. streaming related states, like offsets, kafka connection, etc. For continues mode, it's fine. We create the reader, set offsets, construct the plan, get the physical plan, and process. We mutate the reader states at the beginning and never mutate it again. For micro-batch mode, we have a problem. We create the reader at the beginning, set reader offset, construct the plan and get the physical plan for every batch. This means we apply operator push-down to this reader many times, and data source v2 doesn't define what the behavior should be for this case. Thus we can't apply operator push-down for streaming data sources. @marmbrus @tdas @zsxwing @jose-torres I have 2 proposals to support operator push down for streaming relation: 1. Introduce a `reset` API to `DataSourceReader` to clear out the operator push-down states. Then we can call `reset` for every micro-batch and safely apply operator pushdown. 2. Do plan analyzing/optimizing/planning only once for micro-batch mode. Theoretically it's not good, as different micro-batch may have different statistics and the optimal physical plan is different, we should rerun the planner for each batch. The benefit is, plan analyzing/optimizing/planning may be costly, doing it once can mitigate the cost. Also adaptive execution can help so it's not that bad to reuse the same physical plan. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20575: [SPARK-23386][DEPLOY] enable direct application links in...
Github user gerashegalov commented on the issue: https://github.com/apache/spark/pull/20575 @vanzin what do you mean by "as part of parsing the logs"? This PR is about avoiding the long wait for eventLogs to be read from a remote filesystem, and being parsed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...
Github user IgorBerman commented on a diff in the pull request: https://github.com/apache/spark/pull/20640#discussion_r169556816 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala --- @@ -648,15 +645,6 @@ private[spark] class MesosCoarseGrainedSchedulerBackend( totalGpusAcquired -= gpus gpusByTaskId -= taskId } -// If it was a failure, mark the slave as failed for blacklisting purposes -if (TaskState.isFailed(state)) { - slave.taskFailures += 1 - - if (slave.taskFailures >= MAX_SLAVE_FAILURES) { -logInfo(s"Blacklisting Mesos slave $slaveId due to too many failures; " + --- End diff -- @kayousterhout BlacklistTracker has it's own logging that is concerned with blacklisted nodes, won't it be enough? on the other hand, if blacklisting is disabled, which is default, then we will lose this information. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20640 **[Test build #87580 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87580/testReport)** for PR 20640 at commit [`e2ddc1b`](https://github.com/apache/spark/commit/e2ddc1be19e2f978df4fe84073aff3f5b46afe45). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...
Github user IgorBerman commented on a diff in the pull request: https://github.com/apache/spark/pull/20640#discussion_r169554433 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala --- @@ -571,7 +568,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend( cpus + totalCoresAcquired <= maxCores && mem <= offerMem && numExecutors < executorLimit && - slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES && + !scheduler.nodeBlacklist().contains(slaveId) && --- End diff -- You are right, thank! here: https://github.com/apache/spark/blob/9e50a1d37a4cf0c34e20a7c1a910ceaff41535a2/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L193 I've managed to track it to https://github.com/apache/spark/blob/e18d6f5326e0d9ea03d31de5ce04cb84d3b8ab37/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L852 I'll change it to offerHostname --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20613: [SPARK-23368][SQL] Avoid unnecessary Exchange or Sort af...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20613 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20613: [SPARK-23368][SQL] Avoid unnecessary Exchange or Sort af...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20613 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87574/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20613: [SPARK-23368][SQL] Avoid unnecessary Exchange or Sort af...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20613 **[Test build #87574 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87574/testReport)** for PR 20613 at commit [`27a1af2`](https://github.com/apache/spark/commit/27a1af22e94f49b7801c4f49443ebadb1ff35571). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20603: [SPARK-23418][SQL]: Fail DataSourceV2 reads when ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20603 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20603: [SPARK-23418][SQL]: Fail DataSourceV2 reads when user sc...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20603 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20647 **[Test build #87579 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87579/testReport)** for PR 20647 at commit [`9dd1986`](https://github.com/apache/spark/commit/9dd1986558013f93a0c77f6e62c277b51140ea39). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/982/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20647 **[Test build #87578 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87578/testReport)** for PR 20647 at commit [`e1f24a4`](https://github.com/apache/spark/commit/e1f24a4be751ef39ea4d45b71a315810dcb9fd74). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87578/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20647 **[Test build #87578 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87578/testReport)** for PR 20647 at commit [`e1f24a4`](https://github.com/apache/spark/commit/e1f24a4be751ef39ea4d45b71a315810dcb9fd74). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/981/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87577/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20647 **[Test build #87577 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87577/testReport)** for PR 20647 at commit [`138679e`](https://github.com/apache/spark/commit/138679ee8b713c20ac89d44a2e8e5d82c69a). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/980/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20647 CC @tdas @gatorsmile @rdblue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20647 **[Test build #87577 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87577/testReport)** for PR 20647 at commit [`138679e`](https://github.com/apache/spark/commit/138679ee8b713c20ac89d44a2e8e5d82c69a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/20647 [SPARK-23303][SQL] improve the explain result for data source v2 relations ## What changes were proposed in this pull request? The current explain result for data source v2 relation is unreadable: ``` == Parsed Logical Plan == 'Filter ('i > 6) +- AnalysisBarrier +- Project [j#1] +- DataSourceV2Relation [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader@3b415940 == Analyzed Logical Plan == j: int Project [j#1] +- Filter (i#0 > 6) +- Project [j#1, i#0] +- DataSourceV2Relation [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader@3b415940 == Optimized Logical Plan == Project [j#1] +- Filter isnotnull(i#0) +- DataSourceV2Relation [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader@3b415940 == Physical Plan == *(1) Project [j#1] +- *(1) Filter isnotnull(i#0) +- *(1) DataSourceV2Scan [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader@3b415940 ``` after this PR ``` == Parsed Logical Plan == 'Project [unresolvedalias('j, None)] +- AnalysisBarrier +- Relation AdvancedDataSourceV2[i#0, j#1] == Analyzed Logical Plan == j: int Project [j#1] +- Relation AdvancedDataSourceV2[i#0, j#1] == Optimized Logical Plan == Relation AdvancedDataSourceV2[j#1] == Physical Plan == *(1) Scan AdvancedDataSourceV2[j#1] ``` --- ``` == Analyzed Logical Plan == i: int, j: int Filter (i#88 > 3) +- Relation JavaAdvancedDataSourceV2[i#88, j#89] == Optimized Logical Plan == Filter isnotnull(i#88) +- Relation JavaAdvancedDataSourceV2[i#88, j#89] (PushedFilter: [GreaterThan(i,3)]) == Physical Plan == *(1) Filter isnotnull(i#88) +- *(1) Scan JavaAdvancedDataSourceV2[i#88, j#89] (PushedFilter: [GreaterThan(i,3)]) ``` an example for streaming query ``` == Parsed Logical Plan == Aggregate [value#6], [value#6, count(1) AS count(1)#11L] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6] +- MapElements , class java.lang.String, [StructField(value,StringType,true)], obj#5: java.lang.String +- DeserializeToObject cast(value#25 as string).toString, obj#4: java.lang.String +- Streaming Relation FakeDataSourceV2$[value#25] == Analyzed Logical Plan == value: string, count(1): bigint Aggregate [value#6], [value#6, count(1) AS count(1)#11L] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6] +- MapElements , class java.lang.String, [StructField(value,StringType,true)], obj#5: java.lang.String +- DeserializeToObject cast(value#25 as string).toString, obj#4: java.lang.String +- Streaming Relation FakeDataSourceV2$[value#25] == Optimized Logical Plan == Aggregate [value#6], [value#6, count(1) AS count(1)#11L] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6] +- MapElements , class java.lang.String, [StructField(value,StringType,true)], obj#5: java.lang.String +- DeserializeToObject value#25.toString, obj#4: java.lang.String +- Streaming Relation FakeDataSourceV2$[value#25] == Physical Plan == *(4) HashAggregate(keys=[value#6], functions=[count(1)], output=[value#6, count(1)#11L]) +- StateStoreSave [value#6], state info [ checkpoint = *(redacted)/cloud/dev/spark/target/tmp/temporary-549f264b-2531-4fcb-a52f-433c77347c12/state, runId = f84d9da9-2f8c-45c1-9ea1-70791be684de, opId = 0, ver = 0, numPartitions = 5], Complete, 0 +- *(3) HashAggregate(keys=[value#6], functions=[merge_count(1)], output=[value#6, count#16L]) +- StateStoreRestore [value#6], state info [ checkpoint = *(redacted)/cloud/dev/spark/target/tmp/temporary-549f264b-2531-4fcb-a52f-433c77347c12/state, runId = f84d9da9-2f8c-45c1-9ea1-70791be684de, opId = 0, ver = 0, numPartitions = 5] +- *(2) HashAggregate(keys=[value#6], functions=[merge_count(1)], output=[value#6, count#16L]) +- Exchange hashpartitioning(value#6, 5) +- *(1) HashAggregate(keys=[value#6], functions=[partial_count(1)], output=[value#6, count#16L]) +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6]
[GitHub] spark issue #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecutive of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20572 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecutive of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20572 **[Test build #87576 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87576/testReport)** for PR 20572 at commit [`08f3570`](https://github.com/apache/spark/commit/08f3570c0491f96abcaa9a6dc0f4e3030cfea6c0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecutive of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20572 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87576/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecutive of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20572 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecutive of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20572 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/979/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecutive of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20572 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecutive of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20572 **[Test build #87576 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87576/testReport)** for PR 20572 at commit [`08f3570`](https://github.com/apache/spark/commit/08f3570c0491f96abcaa9a6dc0f4e3030cfea6c0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/20572#discussion_r169538019 --- Diff: external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala --- @@ -53,7 +51,7 @@ class CachedKafkaConsumer[K, V] private( // TODO if the buffer was kept around as a random-access structure, // could possibly optimize re-calculating of an RDD in the same batch - protected var buffer = ju.Collections.emptyList[ConsumerRecord[K, V]]().iterator + protected var buffer = ju.Collections.emptyListIterator[ConsumerRecord[K, V]]() --- End diff -- Agreed, think it should be ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecutive of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20572 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87575/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecutive of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20572 **[Test build #87575 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87575/testReport)** for PR 20572 at commit [`2246750`](https://github.com/apache/spark/commit/224675046be2fbc38fea59c59394736e19042eb4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/20572#discussion_r169537541 --- Diff: external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala --- @@ -83,13 +81,50 @@ class CachedKafkaConsumer[K, V] private( s"Failed to get records for $groupId $topic $partition $offset after polling for $timeout") record = buffer.next() assert(record.offset == offset, -s"Got wrong record for $groupId $topic $partition even after seeking to offset $offset") +s"Got wrong record for $groupId $topic $partition even after seeking to offset $offset " + + s"got offset ${record.offset} instead. If this is a compacted topic, consider enabling " + + "spark.streaming.kafka.allowNonConsecutiveOffsets" + ) } nextOffset = offset + 1 record } + /** + * Start a batch on a compacted topic + */ + def compactedStart(offset: Long, timeout: Long): Unit = { +logDebug(s"compacted start $groupId $topic $partition starting $offset") +// This seek may not be necessary, but it's hard to tell due to gaps in compacted topics +if (offset != nextOffset) { + logInfo(s"Initial fetch for compacted $groupId $topic $partition $offset") + seek(offset) + poll(timeout) +} + } + + /** + * Get the next record in the batch from a compacted topic. + * Assumes compactedStart has been called first, and ignores gaps. + */ + def compactedNext(timeout: Long): ConsumerRecord[K, V] = { +if (!buffer.hasNext()) { poll(timeout) } +assert(buffer.hasNext(), --- End diff -- That's a "shouldn't happen unless the topicpartition or broker is gone" kind of thing. Semantically I could see that being more like require than assert, but don't have a strong opinion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecu...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/20572#discussion_r169536036 --- Diff: external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala --- @@ -22,12 +22,17 @@ import java.{ util => ju } import scala.collection.JavaConverters._ import scala.util.Random +import kafka.common.TopicAndPartition --- End diff -- Right, LogCleaner hadn't yet been moved to the new apis, added a comment to that effect. Think we're ok here because it's just being used to mock up a compacted topic, not in the actual dstream api. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecutive of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20572 **[Test build #87575 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87575/testReport)** for PR 20572 at commit [`2246750`](https://github.com/apache/spark/commit/224675046be2fbc38fea59c59394736e19042eb4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecutive of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20572 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20572: [SPARK-17147][STREAMING][KAFKA] Allow non-consecutive of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20572 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/978/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14936: [SPARK-7877][MESOS] Allow configuration of framework tim...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14936 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14936: [SPARK-7877][MESOS] Allow configuration of framework tim...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14936 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87573/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14936: [SPARK-7877][MESOS] Allow configuration of framework tim...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14936 **[Test build #87573 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87573/testReport)** for PR 14936 at commit [`bd9ace4`](https://github.com/apache/spark/commit/bd9ace4ea815bb9e4395dc2cafd52e6889250b29). * This patch **fails PySpark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20613: [SPARK-23368][SQL] Avoid unnecessary Exchange or Sort af...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20613 **[Test build #87574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87574/testReport)** for PR 20613 at commit [`27a1af2`](https://github.com/apache/spark/commit/27a1af22e94f49b7801c4f49443ebadb1ff35571). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20613: [SPARK-23368][SQL] Avoid unnecessary Exchange or Sort af...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20613 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/977/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20613: [SPARK-23368][SQL] Avoid unnecessary Exchange or Sort af...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20613 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20613: [SPARK-23368][SQL] Avoid unnecessary Exchange or Sort af...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20613 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r169530627 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/TeradataDialect.scala --- @@ -31,4 +31,19 @@ private case object TeradataDialect extends JdbcDialect { case BooleanType => Option(JdbcType("CHAR(1)", java.sql.Types.CHAR)) case _ => None } + + override def isCascadingTruncateTable(): Option[Boolean] = Some(false) + + /** + * The SQL query used to truncate a table. + * @param table The table to truncate. + * @param cascade Whether or not to cascade the truncation. Default value is the + *value of isCascadingTruncateTable(). Ignored for Teradata as it is unsupported + * @return The SQL query to use for truncating a table + */ + override def getTruncateQuery( + table: String, + cascade: Option[Boolean] = isCascadingTruncateTable): String = { +s"TRUNCATE TABLE $table" --- End diff -- Thank you so much for confirming, @klinvill ! @danielvdende . Could you update `TeradataDialect` according to @klinvill 's advice? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20646 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20646 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87572/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20646 **[Test build #87572 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87572/testReport)** for PR 20646 at commit [`1df90e7`](https://github.com/apache/spark/commit/1df90e796e9388d7992bc55f9f87bfd71af2f7f9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20612: [SPARK-23424][SQL]Add codegenStageId in comment
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20612 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20419: [SPARK-23032][SQL][FOLLOW-UP]Add codegenStageId i...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20419 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20612: [SPARK-23424][SQL]Add codegenStageId in comment
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20612 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user klinvill commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r169526767 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/TeradataDialect.scala --- @@ -31,4 +31,19 @@ private case object TeradataDialect extends JdbcDialect { case BooleanType => Option(JdbcType("CHAR(1)", java.sql.Types.CHAR)) case _ => None } + + override def isCascadingTruncateTable(): Option[Boolean] = Some(false) + + /** + * The SQL query used to truncate a table. + * @param table The table to truncate. + * @param cascade Whether or not to cascade the truncation. Default value is the + *value of isCascadingTruncateTable(). Ignored for Teradata as it is unsupported + * @return The SQL query to use for truncating a table + */ + override def getTruncateQuery( + table: String, + cascade: Option[Boolean] = isCascadingTruncateTable): String = { +s"TRUNCATE TABLE $table" --- End diff -- Hi @dongjoon-hyun, I was the original author of the TeradataDialect and @gatorsmile reviewed and committed it. You are correct, Teradata does not support the TRUNCATE statement. Instead Teradata uses a DELETE statement so you should be able to use `DELETE FROM $table ALL` instead of `TRUNCATE TABLE $table` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/20646 Actually, I am having second thoughts about this. This is fundamentally changing how the tests work, especially for stress tests. The stress tests actually test these corner cases (by randomly adding successive AddData) about what if data was being added while the previously added data is being picked up. With this change, we will accidentally not test those race-condition-prone cases. Second, we are taking multiple locks here in multiple sources, and the StreamExecution is likely to take the same locks. I am really afraid that we are introducing deadlocks by doing this. I am still thinking what the right approach here. I think it should be - Explicit synchronized adding of data to multiple sources. - Not holding locks in multiple sources. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20635: [SPARK-23053][CORE][BRANCH-2.1] taskBinarySerialization ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/20635 thanks @ivoson , merged! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20646: [SPARK-23408][SS] Synchronize successive AddDataM...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20646#discussion_r169522041 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala --- @@ -429,7 +429,25 @@ trait StreamTest extends QueryTest with SharedSQLContext with TimeLimits with Be val defaultCheckpointLocation = Utils.createTempDir(namePrefix = "streaming.metadata").getCanonicalPath try { - startedTest.foreach { action => + val actionIterator = startedTest.iterator.buffered + while (actionIterator.hasNext) { +// Synchronize sequential addDataMemory actions. --- End diff -- // Synchronize --> // Collectively synchronize actions so that the data gets added together in a single batch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20646: [SPARK-23408][SS] Synchronize successive AddDataM...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20646#discussion_r169521344 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala --- @@ -429,7 +429,25 @@ trait StreamTest extends QueryTest with SharedSQLContext with TimeLimits with Be val defaultCheckpointLocation = Utils.createTempDir(namePrefix = "streaming.metadata").getCanonicalPath try { - startedTest.foreach { action => + val actionIterator = startedTest.iterator.buffered + while (actionIterator.hasNext) { +// Synchronize sequential addDataMemory actions. +val addDataMemoryActions = ArrayBuffer[AddDataMemory[_]]() +while (actionIterator.hasNext && actionIterator.head.isInstanceOf[AddDataMemory[_]]) { + addDataMemoryActions.append(actionIterator.next().asInstanceOf[AddDataMemory[_]]) +} +if (addDataMemoryActions.nonEmpty) { + val synchronizeAll = addDataMemoryActions --- End diff -- This is some magic-ish code. Can you add a bit more comments on how this compose thing works? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20631: [SPARK-23454][SS][DOCS] Added trigger information...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20631 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20645: SPARK-23472: Add defaultJavaOptions for drivers and exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20645 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87569/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20645: SPARK-23472: Add defaultJavaOptions for drivers and exec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20645 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20645: SPARK-23472: Add defaultJavaOptions for drivers and exec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20645 **[Test build #87569 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87569/testReport)** for PR 20645 at commit [`89ee1f3`](https://github.com/apache/spark/commit/89ee1f330047c6c8a986e62e8201d9b7890cf4c8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20643: [SPARK-23468][core] Stringify auth secret before ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20643 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14936: [SPARK-7877][MESOS] Allow configuration of framework tim...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14936 **[Test build #87573 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87573/testReport)** for PR 14936 at commit [`bd9ace4`](https://github.com/apache/spark/commit/bd9ace4ea815bb9e4395dc2cafd52e6889250b29). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20644: [SPARK-23470][ui] Use first attempt of last stage...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20644 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20644: [SPARK-23470][ui] Use first attempt of last stage to def...
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/20644 merging this to master/2.3. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20644: [SPARK-23470][ui] Use first attempt of last stage to def...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20644 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87568/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...
Github user jose-torres commented on the issue: https://github.com/apache/spark/pull/20646 (https://issues.apache.org/jira/browse/SPARK-23369 was already filed for previous flake) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20644: [SPARK-23470][ui] Use first attempt of last stage to def...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20644 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20644: [SPARK-23470][ui] Use first attempt of last stage to def...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20644 **[Test build #87568 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87568/testReport)** for PR 20644 at commit [`0dc95ad`](https://github.com/apache/spark/commit/0dc95ad83e9d736dedbb5d09d18513508847c0e4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...
Github user jose-torres commented on the issue: https://github.com/apache/spark/pull/20646 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20646 **[Test build #87572 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87572/testReport)** for PR 20646 at commit [`1df90e7`](https://github.com/apache/spark/commit/1df90e796e9388d7992bc55f9f87bfd71af2f7f9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...
Github user jose-torres commented on the issue: https://github.com/apache/spark/pull/20646 > java.lang.RuntimeException: [unresolved dependency: com.sun.jersey#jersey-core;1.14: configuration not found in com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 compile] Surely unrelated to this change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20622: [SPARK-23441][SS] Remove queryExecutionThread.interrupt(...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/20622 @jose-torres I had a long offline chat with @zsxwing, kudos to him for catching a corner case in the current solution. The following sequence of events may occur. - In the query thread, the epoch tracking thread is started - Before the query thread actually starts the Spark job, the epoch tracking thread may detect some sort of reconfiguration and attempt to cancelJob even before the query thread has started spark jobs. - Query thread starts spark job, gets blocked, never terminates. Fundamentally, its not a great setup that one thread is starting the jobs and another thread is canceling them. Because of the async nature, we have no way reasoning which attempt wins, starting or cancelling. Rather let's make sure that we start and cancel in the same thread (then we can do some reasoning). Here is an alternate solution. - The epoch thread ONLY interrupts the query thread. It's not responsible for any Spark state management (other than the enum state). - The query thread cancels jobs and stops sources in the `finally` clause. There is less likely to be race conditions that end up not canceling Spark job as a single thread (the query thread) is responsible for all Spark state management. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20646 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20646 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87571/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20646 **[Test build #87571 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87571/testReport)** for PR 20646 at commit [`1df90e7`](https://github.com/apache/spark/commit/1df90e796e9388d7992bc55f9f87bfd71af2f7f9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20644: [SPARK-23470][ui] Use first attempt of last stage to def...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/20644 Verified that this patch fixes the issue. LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20604: [SPARK-23365][CORE] Do not adjust num executors when kil...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20604 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87567/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20604: [SPARK-23365][CORE] Do not adjust num executors when kil...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20604 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20604: [SPARK-23365][CORE] Do not adjust num executors when kil...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20604 **[Test build #87567 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87567/testReport)** for PR 20604 at commit [`e69851b`](https://github.com/apache/spark/commit/e69851b9b5a1a6bc1ecc1c14a656d4c08572b9d7). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20643: [SPARK-23468][core] Stringify auth secret before storing...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20643 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87566/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20643: [SPARK-23468][core] Stringify auth secret before storing...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20643 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20643: [SPARK-23468][core] Stringify auth secret before storing...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20643 **[Test build #87566 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87566/testReport)** for PR 20643 at commit [`77d009d`](https://github.com/apache/spark/commit/77d009d33a4ff6ba85453f10836b984d5e5acdf9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20632: [SPARK-3159] added subtree pruning in the translation fr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20632 **[Test build #4103 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4103/testReport)** for PR 20632 at commit [`eeb3cc9`](https://github.com/apache/spark/commit/eeb3cc93ffa39198ccaba75fc7a63032325baf1e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20644: [SPARK-23470][ui] Use first attempt of last stage to def...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/20644 Thanks for fixing this so fast. I'm testing it now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20635: [SPARK-23053][CORE][BRANCH-2.1] taskBinarySerialization ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20635 **[Test build #4102 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4102/testReport)** for PR 20635 at commit [`bd88903`](https://github.com/apache/spark/commit/bd88903aca841e6ad55144127e4c11e9844ef6ce). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20616: [SPARK-23434][SQL] Spark should not warn `metadat...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20616 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/20640#discussion_r169500415 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala --- @@ -571,7 +568,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend( cpus + totalCoresAcquired <= maxCores && mem <= offerMem && numExecutors < executorLimit && - slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES && + !scheduler.nodeBlacklist().contains(slaveId) && --- End diff -- In other places it looks like the hostname is used in the blacklist - why does this check against the slaveId instead of the offerHostname? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20616: [SPARK-23434][SQL] Spark should not warn `metadata direc...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20616 Thank you, @zsxwing and @cloud-fan . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20616: [SPARK-23434][SQL] Spark should not warn `metadata direc...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/20616 LGTM. Merging to master. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20646 **[Test build #87571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87571/testReport)** for PR 20646 at commit [`1df90e7`](https://github.com/apache/spark/commit/1df90e796e9388d7992bc55f9f87bfd71af2f7f9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20646 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20616: [SPARK-23434][SQL] Spark should not warn `metadata direc...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20616 Here, it is. It's `AccessControlException`, @zsxwing . ``` 18/02/20 23:46:53 WARN streaming.FileStreamSink: Error while looking for metadata directory. org.apache.hadoop.security.AccessControlException: Permission denied: user=spark, access=EXECUTE, inode="/tmp/people.json/_spark_metadata":ambari-qa:hdfs:-rw-r--r-- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1955) at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:109) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4111) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1137) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:866) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2110) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426) at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354) at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$.createBaseDataset(JsonDataSource.scala:114) at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$.infer(JsonDataSource.scala:95) at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63) at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397) at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:340) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:24) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:29) at $li
[GitHub] spark pull request #20640: [SPARK-19755][Mesos] Blacklist is always active f...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/20640#discussion_r169497847 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala --- @@ -648,15 +645,6 @@ private[spark] class MesosCoarseGrainedSchedulerBackend( totalGpusAcquired -= gpus gpusByTaskId -= taskId } -// If it was a failure, mark the slave as failed for blacklisting purposes -if (TaskState.isFailed(state)) { - slave.taskFailures += 1 - - if (slave.taskFailures >= MAX_SLAVE_FAILURES) { -logInfo(s"Blacklisting Mesos slave $slaveId due to too many failures; " + --- End diff -- Is it a concern to lose this error message? (I don't know anything about Mesos but it does seem potentially useful?) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20646: [SPARK-23408][SS] Synchronize successive AddDataM...
GitHub user jose-torres opened a pull request: https://github.com/apache/spark/pull/20646 [SPARK-23408][SS] Synchronize successive AddDataMemory actions in StreamTest. ## What changes were proposed in this pull request? The stream-stream join tests add data to multiple sources, and expect it all to show up in the next batch. But there's a race condition; the new batch might trigger when only one of the AddData actions has been reached. Fortunately, MemoryStream synchronizes batch generation on itself, and StreamExecution won't generate empty batches. So we can resolve this race condition by synchronizing successive AddDataMemory actions against every MemoryStream together. Then we can be sure StreamExecution won't start generating a batch before all the data is present. ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/jose-torres/spark flaky Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20646.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20646 commit d540be6bb051a33d2f6bd69a49fbe11afe9f0a65 Author: Jose Torres Date: 2018-02-20T23:34:16Z just use synchronization commit d91c55f1a17b03aa2d46682e76c6eb207e71a521 Author: Jose Torres Date: 2018-02-20T23:38:35Z Merge branch 'master' of https://github.com/apache/spark into flaky commit dce075f53c8a1418dac99c9b7b7f9b7e79ed17ff Author: Jose Torres Date: 2018-02-20T23:45:40Z fix merge --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20646: [SPARK-23408][SS] Synchronize successive AddDataMemory a...
Github user jose-torres commented on the issue: https://github.com/apache/spark/pull/20646 @tdas --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org