[GitHub] spark issue #20690: [SPARK-23532][Standalone]Improve data locality when laun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20690 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20690: [SPARK-23532][Standalone]Improve data locality when laun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20690 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1135/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20675: [SPARK-23033][SS][Follow Up] Task level retry for contin...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/20675 > it just means that for very long-running streams task restarts will eventually run out. Ah, I know your means. Yeah, if we support task level retry we should also set the task retry number unlimited. > But if you're worried that the current implementation of task restart will become incorrect as more complex scenarios are supported, I'd definitely lean towards deferring it until continuous processing is more feature-complete. Yep, the "complex scenarios" I mentioned mainly including shuffle and aggregation scenario like comments in https://issues.apache.org/jira/browse/SPARK-20928?focusedCommentId=16245556=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16245556, in those scenario maybe task level retry should consider epoch align, but current implementation of task restart is completed for map-only continuous processing I think. Agree with you about deferring it, so I just leave a comment in SPARK-23033 and close this or you think this should reviewed by others? > Do you want to spin that off into a separate PR? (I can handle it otherwise.) Of cause, #20689 added a new interface `ContinuousDataReaderFactory` as our comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20690: [SPARK-23532][Standalone]Improve data locality when laun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20690 **[Test build #87760 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87760/testReport)** for PR 20690 at commit [`f7efb22`](https://github.com/apache/spark/commit/f7efb22ddea3dc8eeccc833086d5a82cbce7e530). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20690: [SPARK-23532][Standalone]Improve data locality wh...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/20690 [SPARK-23532][Standalone]Improve data locality when launching new executors for dynamic allocation ## What changes were proposed in this pull request? Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, Refer to _https://issues.apache.org/jira/browse/SPARK-4352_. Mesos alse supports data locality, Refer to _https://issues.apache.org/jira/browse/SPARK-16944_ It would be better that Standalone can also support this feature. ## How was this patch tested? Added a unit test, and manual testing on HDFS You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark executorlocality Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20690.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20690 commit f7efb22ddea3dc8eeccc833086d5a82cbce7e530 Author: liuxianDate: 2018-02-28T07:33:44Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20689: [SPARK-23533][SS] Add support for changing ContinuousDat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20689 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20689: [SPARK-23533][SS] Add support for changing ContinuousDat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20689 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1134/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20667 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20667 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87748/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20667 **[Test build #87748 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87748/testReport)** for PR 20667 at commit [`bf79f4d`](https://github.com/apache/spark/commit/bf79f4d5c83c364c7f1fc05f158753d282409330). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20649: [SPARK-23462][SQL] improve missing field error message i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20649 **[Test build #87759 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87759/testReport)** for PR 20649 at commit [`8cdb1d5`](https://github.com/apache/spark/commit/8cdb1d52117325fcbdd1cefc9e9f0616afdb2baa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20689: [SPARK-23533][SS] Add support for changing ContinuousDat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20689 **[Test build #87758 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87758/testReport)** for PR 20689 at commit [`59cef98`](https://github.com/apache/spark/commit/59cef98868586a4f039b04e74c32c94eaff965c0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20675: [SPARK-23033][SS][Follow Up] Task level retry for...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/20675#discussion_r171161352 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousDataReader.java --- @@ -33,4 +33,16 @@ * as a restart checkpoint. */ PartitionOffset getOffset(); + +/** + * Set the start offset for the current record, only used in task retry. If setOffset keep + * default implementation, it means current ContinuousDataReader can't support task level retry. + * + * @param offset last offset before task retry. + */ +default void setOffset(PartitionOffset offset) { --- End diff -- Cool, that's more clearer. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20689: [SPARK-23533][SS] Add support for changing Contin...
GitHub user xuanyuanking opened a pull request: https://github.com/apache/spark/pull/20689 [SPARK-23533][SS] Add support for changing ContinuousDataReader's startOffset ## What changes were proposed in this pull request? As discussion in #20675, we need add a new interface `ContinuousDataReaderFactory` to support the requirements of setting start offset in Continuous Processing. ## How was this patch tested? Existing UT. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuanyuanking/spark SPARK-23533 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20689.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20689 commit 59cef98868586a4f039b04e74c32c94eaff965c0 Author: Yuanjian LiDate: 2018-02-28T07:29:57Z [SPARK-23533][SS] Add support for changing ContinousDataReader's startOffset --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r171160692 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { case ((m, cnt), x) => (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1) } + + // Calculate the number of samples for finding splits + val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt --- End diff -- I have seen the note of function _sample_, and _sample_ does not guarantee to provide exactly the fraction of the count of the given RDD. It seems that requiring _numSamples - partNumSamples_ to be non-negative is a more efficient choice than trigger a _count_. The degree of approximation depends upon the degree approximation of _sample_. And it's sure that the splits will be inaccurate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20667 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20667 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87746/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20667 **[Test build #87746 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87746/testReport)** for PR 20667 at commit [`3379899`](https://github.com/apache/spark/commit/337989945b0757dfc6a069315c4e7828afe77d00). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20681 **[Test build #87757 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87757/testReport)** for PR 20681 at commit [`5c922ca`](https://github.com/apache/spark/commit/5c922cacc498018bb22bfe7dde7a137776e6fe3f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20681 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1133/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20681 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20681 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20681 **[Test build #87747 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87747/testReport)** for PR 20681 at commit [`999f86f`](https://github.com/apache/spark/commit/999f86f89ae05147136de8ace51efeb972bf1538). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20681 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87747/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87750/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20647 **[Test build #87750 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87750/testReport)** for PR 20647 at commit [`c5af52e`](https://github.com/apache/spark/commit/c5af52ea185e6f94f64096a4937f462db47a4fc5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20678: [SPARK-23380][PYTHON] Adds a conf for Arrow fallb...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20678#discussion_r171155800 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1518,7 +1525,9 @@ class SQLConf extends Serializable with Logging { def rangeExchangeSampleSizePerPartition: Int = getConf(RANGE_EXCHANGE_SAMPLE_SIZE_PER_PARTITION) - def arrowEnable: Boolean = getConf(ARROW_EXECUTION_ENABLE) + def arrowEnable: Boolean = getConf(ARROW_EXECUTION_ENABLED) --- End diff -- Actually seems we don't use `arrowEnable` too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20678: [SPARK-23380][PYTHON] Adds a conf for Arrow fallb...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20678#discussion_r171155732 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1518,7 +1525,9 @@ class SQLConf extends Serializable with Logging { def rangeExchangeSampleSizePerPartition: Int = getConf(RANGE_EXCHANGE_SAMPLE_SIZE_PER_PARTITION) - def arrowEnable: Boolean = getConf(ARROW_EXECUTION_ENABLE) + def arrowEnable: Boolean = getConf(ARROW_EXECUTION_ENABLED) + + def arrowFallbackEnable: Boolean = getConf(ARROW_FALLBACK_ENABLED) --- End diff -- nit: Have we used this `arrowFallbackEnable` definition? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20449: [SPARK-23040][CORE]: Returns interruptible iterator for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20449 **[Test build #87756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87756/testReport)** for PR 20449 at commit [`8c15c56`](https://github.com/apache/spark/commit/8c15c564c7d2d0adc0cfd725e34dbd359c6a0ab6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20449: [SPARK-23040][CORE]: Returns interruptible iterat...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20449#discussion_r171153715 --- Diff: core/src/test/scala/org/apache/spark/JobCancellationSuite.scala --- @@ -320,6 +321,63 @@ class JobCancellationSuite extends SparkFunSuite with Matchers with BeforeAndAft f2.get() } + test("Interruptible iterator of shuffle reader") { +// In this test case, we create a Spark job of two stages. The second stage is cancelled during +// execution and a counter is used to make sure that the corresponding tasks are indeed +// cancelled. +import JobCancellationSuite._ +val numSlice = 1 --- End diff -- I'm not sure, let's just try it :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20449: [SPARK-23040][CORE]: Returns interruptible iterator for ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20449 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20683: [SPARK-8605] Exclude files in StreamingContext. textFile...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/20683 > a extra boolean expression was added to test if a regex was present. Can you please explain what's the meaning of "if a regex was present"? Seems the fix is not so necessary. If you want to filter out some temp files, you can write your own `filter` instead of using Spark Streaming's default one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20449: [SPARK-23040][CORE]: Returns interruptible iterat...
Github user advancedxy commented on a diff in the pull request: https://github.com/apache/spark/pull/20449#discussion_r171152501 --- Diff: core/src/test/scala/org/apache/spark/JobCancellationSuite.scala --- @@ -320,6 +321,63 @@ class JobCancellationSuite extends SparkFunSuite with Matchers with BeforeAndAft f2.get() } + test("Interruptible iterator of shuffle reader") { +// In this test case, we create a Spark job of two stages. The second stage is cancelled during +// execution and a counter is used to make sure that the corresponding tasks are indeed +// cancelled. +import JobCancellationSuite._ +val numSlice = 1 --- End diff -- Will update it later. But looks like Jenkins are having troubles there days? it it back to normal? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20685: [SPARK-23524] Big local shuffle blocks should not be che...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20685 **[Test build #87755 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87755/testReport)** for PR 20685 at commit [`110c851`](https://github.com/apache/spark/commit/110c8510dcc6c2abaf4ca416b95854daf129b0a5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20685: [SPARK-23524] Big local shuffle blocks should not be che...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20685 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1132/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20685: [SPARK-23524] Big local shuffle blocks should not be che...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20685 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20685: [SPARK-23524] Big local shuffle blocks should not be che...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/20685 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user sujith71955 commented on the issue: https://github.com/apache/spark/pull/20611 @gatorsmile Is any issue with this PR? can you please re look into this. Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20043: [SPARK-22856][SQL] Add wrappers for codegen output and n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20043 **[Test build #87754 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87754/testReport)** for PR 20043 at commit [`f59bb19`](https://github.com/apache/spark/commit/f59bb19a3fd04b24ea3077a12283777be0af437d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20043: [SPARK-22856][SQL] Add wrappers for codegen output and n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20043 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1131/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20043: [SPARK-22856][SQL] Add wrappers for codegen output and n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20043 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18906 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87743/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18906 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18906 **[Test build #87743 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87743/testReport)** for PR 18906 at commit [`e6e6dbf`](https://github.com/apache/spark/commit/e6e6dbf5cd8d8c8e15977fe89f741483eb6138a6). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20679: [SPARK-23514] Use SessionState.newHadoopConf() to propag...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20679 **[Test build #87753 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87753/testReport)** for PR 20679 at commit [`b37f24f`](https://github.com/apache/spark/commit/b37f24f372bb45ff9b8380222e0eb7e6d8819e58). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20679: [SPARK-23514] Use SessionState.newHadoopConf() to propag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20679 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1130/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20679: [SPARK-23514] Use SessionState.newHadoopConf() to propag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20679 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20449: [SPARK-23040][CORE]: Returns interruptible iterator for ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20449 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20208 Finally, Spark 2.3 passes the vote. Could you review this, @gatorsmile ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20679: [SPARK-23514] Use SessionState.newHadoopConf() to propag...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20679 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20684: [SPARK-23523] [SQL] Fix the incorrect result caused by t...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20684 Hi, @gatorsmile and @cloud-fan . Since 2.3 vote passed, can we have this in `branch-2.3` for Apache Spark 2.3.1? The conflicts on `LocalRelation.scala` is simply due to indentation changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20449: [SPARK-23040][CORE]: Returns interruptible iterat...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20449#discussion_r171148057 --- Diff: core/src/test/scala/org/apache/spark/JobCancellationSuite.scala --- @@ -320,6 +321,63 @@ class JobCancellationSuite extends SparkFunSuite with Matchers with BeforeAndAft f2.get() } + test("Interruptible iterator of shuffle reader") { +// In this test case, we create a Spark job of two stages. The second stage is cancelled during +// execution and a counter is used to make sure that the corresponding tasks are indeed +// cancelled. +import JobCancellationSuite._ +val numSlice = 1 --- End diff -- can we hardcode it? using a variable makes people feel like they can change its value and the test can still pass, however it's not true as `assert(executionOfInterruptibleCounter.get() <= 10)` needs to be updated too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20208 **[Test build #87752 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87752/testReport)** for PR 20208 at commit [`6ae471c`](https://github.com/apache/spark/commit/6ae471c8ecaae3eb3888eecaac1c4e7552bedcc6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1129/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20208 Rebased to the master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20688: [SPARK-23096][SS] Migrate rate source to V2
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20688 **[Test build #87751 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87751/testReport)** for PR 20688 at commit [`8bfadc3`](https://github.com/apache/spark/commit/8bfadc387393c2a42d09ef11707b1f0d3d27a53a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20688: [SPARK-23096][SS] Migrate rate source to V2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20688 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20688: [SPARK-23096][SS] Migrate rate source to V2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20688 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1128/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20685: [SPARK-23524] Big local shuffle blocks should not be che...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20685 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87741/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20685: [SPARK-23524] Big local shuffle blocks should not be che...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20685 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20685: [SPARK-23524] Big local shuffle blocks should not be che...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20685 **[Test build #87741 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87741/testReport)** for PR 20685 at commit [`110c851`](https://github.com/apache/spark/commit/110c8510dcc6c2abaf4ca416b95854daf129b0a5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user Ngone51 commented on the issue: https://github.com/apache/spark/pull/20667 Hi, @jiangxb1987 , thanks for your kindly explanation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r171143986 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2StringFormat.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import org.apache.commons.lang3.StringUtils + +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.sources.DataSourceRegister +import org.apache.spark.sql.sources.v2.DataSourceV2 +import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.util.Utils + +/** + * A trait that can be used by data source v2 related query plans(both logical and physical), to + * provide a string format of the data source information for explain. + */ +trait DataSourceV2StringFormat { + + /** + * The instance of this data source implementation. Note that we only consider its class in + * equals/hashCode, not the instance itself. + */ + def source: DataSourceV2 + + /** + * The output of the data source reader, w.r.t. column pruning. + */ + def output: Seq[Attribute] + + /** + * The options for this data source reader. + */ + def options: Map[String, String] + + /** + * The created data source reader. Here we use it to get the filters that has been pushed down + * so far, itself doesn't take part in the equals/hashCode. + */ + def reader: DataSourceReader + + private lazy val filters = reader match { +case s: SupportsPushDownCatalystFilters => s.pushedCatalystFilters().toSet +case s: SupportsPushDownFilters => s.pushedFilters().toSet +case _ => Set.empty + } + + private def sourceName: String = source match { +case registered: DataSourceRegister => registered.shortName() +case _ => source.getClass.getSimpleName.stripSuffix("$") + } + + def metadataString: String = { +val entries = scala.collection.mutable.ArrayBuffer.empty[(String, String)] + +if (filters.nonEmpty) { + entries += "Pushed Filters" -> filters.mkString("[", ", ", "]") +} + +// TODO: we should only display some standard options like path, table, etc. +entries ++= options + +val outputStr = Utils.truncatedString(output, "[", ", ", "]") + +val entriesStr = if (entries.nonEmpty) { + Utils.truncatedString(entries.map { +case (key, value) => StringUtils.abbreviate(redact(key + ":" + value), 100) --- End diff -- Now users can match password by `password:.+` to redact password. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20667 In case the same `BlockManagerId` being created multiple times, this cache will ensure we always use the first one that is created, which make it possible for the rest `BlockManagerId` instances being recycled shortly. The downside is we have to persist all the distinct `BlockManagerId` created. Since the code is added long times ago, and it's actually hard to examine the performance with/without the cache, we'd like to keep it for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r171143887 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -107,19 +104,36 @@ case class DataSourceV2Relation( } /** - * A specialization of DataSourceV2Relation with the streaming bit set to true. Otherwise identical - * to the non-streaming relation. + * A specialization of [[DataSourceV2Relation]] with the streaming bit set to true. + * + * Note that, this plan has a mutable reader, so Spark won't apply operator push-down for this plan, + * to avoid making the plan mutable. We should consolidate this plan and [[DataSourceV2Relation]] + * after we figure out how to apply operator push-down for streaming data sources. */ case class StreamingDataSourceV2Relation( output: Seq[AttributeReference], +source: DataSourceV2, +options: Map[String, String], reader: DataSourceReader) -extends LeafNode with DataSourceReaderHolder with MultiInstanceRelation { + extends LeafNode with MultiInstanceRelation with DataSourceV2StringFormat { + override def isStreaming: Boolean = true - override def canEqual(other: Any): Boolean = other.isInstanceOf[StreamingDataSourceV2Relation] + override def simpleString: String = "Streaming RelationV2 " + metadataString override def newInstance(): LogicalPlan = copy(output = output.map(_.newInstance())) + // TODO: unify the equal/hashCode implementation for all data source v2 query plans. + override def equals(other: Any): Boolean = other match { +case other: StreamingDataSourceV2Relation => + output == other.output && reader.getClass == other.reader.getClass && options == other.options --- End diff -- Now it's exactly same as before. We should clean it up after figure out how to push down operators to streaming relation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13599 **[Test build #87749 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87749/testReport)** for PR 13599 at commit [`86484d6`](https://github.com/apache/spark/commit/86484d67c3f85e2372cd1de69cafb3a4b7bbb691). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean)` * ` class DriverEndpoint(override val rpcEnv: RpcEnv)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13599 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87749/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13599 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20647 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1127/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20647 Hi @rdblue , I've opened https://issues.apache.org/jira/browse/SPARK-23531 to include the type info. I'd like to do it later as it's a general problem in Spark SQL and many plans need to be updated like leaf nodes other than data source scan. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20647 **[Test build #87750 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87750/testReport)** for PR 20647 at commit [`c5af52e`](https://github.com/apache/spark/commit/c5af52ea185e6f94f64096a4937f462db47a4fc5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user Ngone51 commented on the issue: https://github.com/apache/spark/pull/20667 Hi, @caneGuy , sorry for my previous comment as I mixed up ```BlockId``` with ```BlockManagerId```, and leave some wrong comments. And thanks for your reply. Back to now, I have the same question with @cloud-fan , > Why we need this cache? though, we have a better cache way(guava cache) now. My confusions: - It is weird that we need to create a ```BlockManagerId ``` before we get a same one from the cache. - And on executor side, when ```BlockManagerId ``` registered to master and return with an updated ```BlockManagerId ``` , the new ```BlockManagerId ``` does not be updated to ```blockManagerIdCache```. So, it seems executor side's ```BlockManagerId``` has little relevance with ```blockManagerIdCache```. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13599 **[Test build #87749 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87749/testReport)** for PR 13599 at commit [`86484d6`](https://github.com/apache/spark/commit/86484d67c3f85e2372cd1de69cafb3a4b7bbb691). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13599 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1126/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13599 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20688: [SPARK-23096][SS] Migrate rate source to V2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20688 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87740/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20688: [SPARK-23096][SS] Migrate rate source to V2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20688 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20688: [SPARK-23096][SS] Migrate rate source to V2
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20688 **[Test build #87740 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87740/testReport)** for PR 20688 at commit [`538223e`](https://github.com/apache/spark/commit/538223e52e1d12d82339a22390a9812beaccf8a6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20678: [SPARK-23380][PYTHON] Adds a conf for Arrow fallback in ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20678 Will try to clean up soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20678: [SPARK-23380][PYTHON] Adds a conf for Arrow fallb...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20678#discussion_r171139748 --- Diff: docs/sql-programming-guide.md --- @@ -1689,6 +1689,10 @@ using the call `toPandas()` and when creating a Spark DataFrame from a Pandas Da `createDataFrame(pandas_df)`. To use Arrow when executing these calls, users need to first set the Spark configuration 'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default. +In addition, optimizations enabled by 'spark.sql.execution.arrow.enabled' will fallback automatically +to non-optimized implementations if an error occurs. This can be controlled by --- End diff -- Let me try to rephrase this doc a bit. The point I was trying to make in this fallback (for now) was, to only do the fallback before the actual distributed computation within Spark. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20678: [SPARK-23380][PYTHON] Adds a conf for Arrow fallb...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20678#discussion_r171138898 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1986,55 +1986,89 @@ def toPandas(self): timezone = None if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", "false").lower() == "true": +should_fallback = False try: -from pyspark.sql.types import _check_dataframe_convert_date, \ -_check_dataframe_localize_timestamps, to_arrow_schema +from pyspark.sql.types import to_arrow_schema from pyspark.sql.utils import require_minimum_pyarrow_version + require_minimum_pyarrow_version() -import pyarrow to_arrow_schema(self.schema) -tables = self._collectAsArrow() -if tables: -table = pyarrow.concat_tables(tables) -pdf = table.to_pandas() -pdf = _check_dataframe_convert_date(pdf, self.schema) -return _check_dataframe_localize_timestamps(pdf, timezone) -else: -return pd.DataFrame.from_records([], columns=self.columns) except Exception as e: -msg = ( -"Note: toPandas attempted Arrow optimization because " -"'spark.sql.execution.arrow.enabled' is set to true. Please set it to false " -"to disable this.") -raise RuntimeError("%s\n%s" % (_exception_message(e), msg)) -else: -pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) -dtype = {} +if self.sql_ctx.getConf("spark.sql.execution.arrow.fallback.enabled", "true") \ +.lower() == "true": +msg = ( +"toPandas attempted Arrow optimization because " +"'spark.sql.execution.arrow.enabled' is set to true; however, " +"failed by the reason below:\n %s\n" +"Attempts non-optimization as " +"'spark.sql.execution.arrow.fallback.enabled' is set to " +"true." % _exception_message(e)) +warnings.warn(msg) +should_fallback = True +else: +msg = ( +"toPandas attempted Arrow optimization because " +"'spark.sql.execution.arrow.enabled' is set to true; however, " +"failed by the reason below:\n %s\n" +"For fallback to non-optimization automatically, please set true to " +"'spark.sql.execution.arrow.fallback.enabled'." % _exception_message(e)) +raise RuntimeError(msg) + +if not should_fallback: --- End diff -- Correct, but there's one more - we fallback if PyArrow is not installed. Will add some comments to make this easier to read. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20667 **[Test build #87748 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87748/testReport)** for PR 20667 at commit [`bf79f4d`](https://github.com/apache/spark/commit/bf79f4d5c83c364c7f1fc05f158753d282409330). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20681 **[Test build #87747 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87747/testReport)** for PR 20681 at commit [`999f86f`](https://github.com/apache/spark/commit/999f86f89ae05147136de8ace51efeb972bf1538). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20681 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20681 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1125/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13599 **[Test build #87745 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87745/testReport)** for PR 13599 at commit [`3da68c7`](https://github.com/apache/spark/commit/3da68c75552798d841e7adefae1c2ae7cefff0b7). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean)` * ` class DriverEndpoint(override val rpcEnv: RpcEnv)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13599 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13599 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87745/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case bl...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20667#discussion_r171136135 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala --- @@ -132,10 +133,17 @@ private[spark] object BlockManagerId { getCachedBlockManagerId(obj) } - val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() + /** + * Here we set max cache size as 1.Since the size of a BlockManagerId object --- End diff -- nit: ``` The max cache size is hardcoded to 1, since the size of a BlockManagerId object is about 48B, the total memory cost should be below 1MB which is feasible. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20667 **[Test build #87746 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87746/testReport)** for PR 20667 at commit [`3379899`](https://github.com/apache/spark/commit/337989945b0757dfc6a069315c4e7828afe77d00). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13599 **[Test build #87745 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87745/testReport)** for PR 13599 at commit [`3da68c7`](https://github.com/apache/spark/commit/3da68c75552798d841e7adefae1c2ae7cefff0b7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13599 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1124/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13599 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20667 **[Test build #87744 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87744/testReport)** for PR 20667 at commit [`3379899`](https://github.com/apache/spark/commit/337989945b0757dfc6a069315c4e7828afe77d00). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20667 add to whitelist --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20667 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20667 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org