[GitHub] [spark] wangyum commented on a change in pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition
wangyum commented on a change in pull request #28642: URL: https://github.com/apache/spark/pull/28642#discussion_r435725267 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/InferFiltersFromConstraintsSuite.scala ## @@ -316,4 +316,19 @@ class InferFiltersFromConstraintsSuite extends PlanTest { condition) } } + + test("Infer IsNotNull for non null-intolerant child of null intolerant join condition") { +testConstraintsAfterJoin( + testRelation.subquery('left), + testRelation.subquery('right), + testRelation.where(IsNotNull(Coalesce(Seq('a, 'b.subquery('left), Review comment: ``` hive> EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.a, t1.b)=t2.a; OK STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: $hdt$_0:t1 Fetch Operator limit: -1 Alias -> Map Local Operator Tree: $hdt$_0:t1 TableScan alias: t1 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: COALESCE(a,b) is not null (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: a (type: string), b (type: string), c (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE HashTable Sink Operator keys: 0 COALESCE(_col0,_col1) (type: string) 1 _col0 (type: string) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: t2 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: a is not null (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: a (type: string) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 COALESCE(_col0,_col1) (type: string) 1 _col0 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Execution mode: vectorized Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pquentin commented on pull request #22480: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests failed on Python 3.6 and macOS High Sierra
pquentin commented on pull request #22480: URL: https://github.com/apache/spark/pull/22480#issuecomment-639295379 @HyukjinKwon I believe `os.fork()` is the root cause, and the pickle problem the symptom. The article above explains that when you call `os.fork()`, only the current threads keeps running, so if you have a lock in another thread, it will never be released. And since pickle isn't thread safe, you need locks. I see a few `threading` imports in the PySpark codebase: in context.py, accumulators.py and broacast.py. For example, `BroadcastPickleRegistry` would deadlock if the thread holding the lock died. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28729: [SPARK-31910][SQL] Enable Java 8 time API in Thrift server
AmplabJenkins removed a comment on pull request #28729: URL: https://github.com/apache/spark/pull/28729#issuecomment-639294132 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition
AmplabJenkins removed a comment on pull request #28642: URL: https://github.com/apache/spark/pull/28642#issuecomment-639294078 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition
AmplabJenkins commented on pull request #28642: URL: https://github.com/apache/spark/pull/28642#issuecomment-639294078 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28729: [SPARK-31910][SQL] Enable Java 8 time API in Thrift server
AmplabJenkins commented on pull request #28729: URL: https://github.com/apache/spark/pull/28729#issuecomment-639294132 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28642: [SPARK-31809][SQL] Infer IsNotNull for non null intolerant child of null intolerant in join condition
SparkQA commented on pull request #28642: URL: https://github.com/apache/spark/pull/28642#issuecomment-639293643 **[Test build #123552 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123552/testReport)** for PR 28642 at commit [`a5f52a8`](https://github.com/apache/spark/commit/a5f52a8c7449bc65c35deb06ddb2b9f4bd059104). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pquentin edited a comment on pull request #22480: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests failed on Python 3.6 and macOS High Sierra
pquentin edited a comment on pull request #22480: URL: https://github.com/apache/spark/pull/22480#issuecomment-638786759 @HyukjinKwon This continues to be a problem for us, as we tend to forget to use the workaround in new projects. As noted in https://github.com/mozilla/mozanalysis/issues/40#issuecomment-495807665, I think the root cause is that [PySpark calls os.fork()](https://github.com/apache/spark/blob/3b14088/python/pyspark/daemon.py#L147) in a way that should simply be avoided on macOS: https://www.evanjones.ca/fork-is-dangerous.html This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28729: [SPARK-31910][SQL] Enable Java 8 time API in Thrift server
SparkQA commented on pull request #28729: URL: https://github.com/apache/spark/pull/28729#issuecomment-639293646 **[Test build #123551 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123551/testReport)** for PR 28729 at commit [`3c35cf5`](https://github.com/apache/spark/commit/3c35cf5920c6e4216adcefc866bd518dfe635def). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28729: [SPARK-30808][SQL] Enable Java 8 time API in Thrift server
HyukjinKwon commented on pull request #28729: URL: https://github.com/apache/spark/pull/28729#issuecomment-639291503 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28721: [SPARK-28624][SQL][TESTS] Run date.sql via Thrift Server
HyukjinKwon commented on pull request #28721: URL: https://github.com/apache/spark/pull/28721#issuecomment-639290601 I had an offline discussion with @MaxGekk and @cloud-fan. Per https://github.com/apache/spark/pull/28723#issuecomment-639285714, I will revert this too. Again, technically it was fine to merge in a way because the skipped tests passed. I am here reverting this rather for the management purpose - the JIRA isn't resolved yet because we need to discuss if the result is correct or not. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28723: [SPARK-28624][SQL][TESTS][3.0] Run date.sql via Thrift Server
HyukjinKwon commented on pull request #28723: URL: https://github.com/apache/spark/pull/28723#issuecomment-639285714 I discussed with @MaxGekk @cloud-fan. The tests will have to be disabled back at SPARK-30808. Basically, resulting to `0045-03-15` itself is actually controversial. We strictly _can just merge_ and enable it for Spark 3.0 specifically because SPARK-30808 won't land to `branch-3.0` but let me just don't merge this for simplicity. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sathyaprakashg commented on pull request #28703: SPARK-29897 Add implicit cast for SubtractTimestamps
sathyaprakashg commented on pull request #28703: URL: https://github.com/apache/spark/pull/28703#issuecomment-639285650 @cloud-fan Need your help in reviewing this PR :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #28723: [SPARK-28624][SQL][TESTS][3.0] Run date.sql via Thrift Server
HyukjinKwon closed pull request #28723: URL: https://github.com/apache/spark/pull/28723 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] karuppayya closed pull request #28686: [SPARK-31877][SQL]Avoid stats computation for Hive table
karuppayya closed pull request #28686: URL: https://github.com/apache/spark/pull/28686 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28723: [SPARK-28624][SQL][TESTS][3.0] Run date.sql via Thrift Server
cloud-fan commented on a change in pull request #28723: URL: https://github.com/apache/spark/pull/28723#discussion_r435695747 ## File path: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala ## @@ -68,8 +68,6 @@ class ThriftServerQueryTestSuite extends SQLQueryTestSuite with SharedThriftServ // Missing UDF "postgreSQL/boolean.sql", "postgreSQL/case.sql", -// SPARK-28624 -"date.sql", Review comment: thriftserver doesn't support negative year, I think we still need to ignore this test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] karuppayya commented on pull request #28662: [SPARK-31850][SQL]Prevent DetermineTableStats from computing stats multiple times for same table
karuppayya commented on pull request #28662: URL: https://github.com/apache/spark/pull/28662#issuecomment-639264681 The above condition is already present. But we return a **copy** of relation(code: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L137) with the updated Table Stats at the end of the method - When ResolvedAggregateFunction rule runs again(to achieve Fixed point), it will not be aware of the updated relation. `executeWithSameContext` with rerun the Stats collection as part of DetermineTableStats rule. - When the DetermineTableStats rule actually runs as part of Analysis phase, it will not be aware of the updated relation @viirya This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28732: [MINOR][PYTHON] Add one more newline between JVM and Python tracebacks
HyukjinKwon commented on pull request #28732: URL: https://github.com/apache/spark/pull/28732#issuecomment-639252018 Merged to master and branch-3.0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #28732: [MINOR][PYTHON] Add one more newline between JVM and Python tracebacks
HyukjinKwon closed pull request #28732: URL: https://github.com/apache/spark/pull/28732 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28732: [MINOR][PYTHON] Add one more newline between JVM and Python tracebacks
AmplabJenkins commented on pull request #28732: URL: https://github.com/apache/spark/pull/28732#issuecomment-639250427 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28732: [MINOR][PYTHON] Add one more newline between JVM and Python tracebacks
AmplabJenkins removed a comment on pull request #28732: URL: https://github.com/apache/spark/pull/28732#issuecomment-639250427 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28732: [MINOR][PYTHON] Add one more newline between JVM and Python tracebacks
SparkQA removed a comment on pull request #28732: URL: https://github.com/apache/spark/pull/28732#issuecomment-639243418 **[Test build #123549 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123549/testReport)** for PR 28732 at commit [`f27f080`](https://github.com/apache/spark/commit/f27f0802fb12575f8eb7cef4dcaca9e0e01c88d0). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28732: [MINOR][PYTHON] Add one more newline between JVM and Python tracebacks
SparkQA commented on pull request #28732: URL: https://github.com/apache/spark/pull/28732#issuecomment-639250134 **[Test build #123549 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123549/testReport)** for PR 28732 at commit [`f27f080`](https://github.com/apache/spark/commit/f27f0802fb12575f8eb7cef4dcaca9e0e01c88d0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28645: [SPARK-31826][SQL] Support composed type of case class for typed Scala UDF
SparkQA commented on pull request #28645: URL: https://github.com/apache/spark/pull/28645#issuecomment-639244877 **[Test build #123550 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123550/testReport)** for PR 28645 at commit [`86035fa`](https://github.com/apache/spark/commit/86035fa42edbb847419c22bd7b37cf8bd8234b60). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on pull request #28730: [SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI.
HyukjinKwon edited a comment on pull request #28730: URL: https://github.com/apache/spark/pull/28730#issuecomment-639243858 Merged to master and branch-3.0. I don't mind porting it back if anyone needs. I didn't here just because there's a conflict, and it's just a matter of monitoring. I will leave it to you @ueshin :D. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28730: [SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI.
HyukjinKwon commented on pull request #28730: URL: https://github.com/apache/spark/pull/28730#issuecomment-639243858 Merged to master and branch-3.0. I don't mind porting it back if anyone needs. I didn't here just because there's a conflict, and it's just a matter of monitoring. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28732: [MINOR][PYTHON] Add one more newline between JVM and Python tracebacks
AmplabJenkins removed a comment on pull request #28732: URL: https://github.com/apache/spark/pull/28732#issuecomment-639243664 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28732: [MINOR][PYTHON] Add one more newline between JVM and Python tracebacks
AmplabJenkins commented on pull request #28732: URL: https://github.com/apache/spark/pull/28732#issuecomment-639243664 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28732: [MINOR][PYTHON] Add one more newline between JVM and Python tracebacks
SparkQA commented on pull request #28732: URL: https://github.com/apache/spark/pull/28732#issuecomment-639243418 **[Test build #123549 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123549/testReport)** for PR 28732 at commit [`f27f080`](https://github.com/apache/spark/commit/f27f0802fb12575f8eb7cef4dcaca9e0e01c88d0). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #28730: [SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI.
HyukjinKwon closed pull request #28730: URL: https://github.com/apache/spark/pull/28730 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request #28732: [MINOR][PYTHON] Add one more newline between JVM and Python tracebacks
HyukjinKwon opened a new pull request #28732: URL: https://github.com/apache/spark/pull/28732 ### What changes were proposed in this pull request? This PR proposes to add one more newline to clearly separate JVM and Python tracebacks: Before: ``` Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; JVM stacktrace: org.apache.spark.sql.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; ... ``` After: ``` Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; JVM stacktrace: org.apache.spark.sql.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; ... ``` This is kind of a followup of https://github.com/apache/spark/commit/e69466056fb2c121b7bbb6ad082f09deb1c41063 (SPARK-31849). ### Why are the changes needed? To make it easier to read. ### Does this PR introduce _any_ user-facing change? It's in the unreleased branches. ### How was this patch tested? Manually tested. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
HeartSaVioR edited a comment on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-639239790 My alternative with wrapping state store is something like below: ``` class RowValidatingStateStore( underlying: StateStore, keyType: Seq[DataType], valueType: Seq[DataType]) extends StateStore { private var isValidated = false override def get(key: UnsafeRow): UnsafeRow = { val value = underlying.get(key) if (!isValidated) { validateRow(value, valueType) isValidated = true } value } override def id: StateStoreId = underlying.id override def version: Long = underlying.version override def put(key: UnsafeRow, value: UnsafeRow): Unit = underlying.put(key, value) override def remove(key: UnsafeRow): Unit = underlying.remove(key) override def commit(): Long = underlying.commit() override def abort(): Unit = underlying.abort() override def iterator(): Iterator[UnsafeRowPair] = underlying.iterator() override def metrics: StateStoreMetrics = underlying.metrics override def hasCommitted: Boolean = underlying.hasCommitted private def validateRow(row: UnsafeRow, rowDataType: Seq[DataType]): Unit = { // TODO: call util method with row and data type to validate - note that it can only check with value schema } } def get(...): StateStore = { require(version >= 0) val storeProvider = loadedProviders.synchronized { ... } // TODO: add if statement to see whether it should wrap state store or not new RowValidatingStateStore(storeProvider.getStore(version, keySchema, valueSchema)) } ``` The example code only checks in get operation, which is insufficient to check "key" row in state. That said, iterator approach still provides more possibility of validation, though the validation of unsafe row itself doesn't have enough coverage of checking various incompatibility issues (Definitely we should have another guards as well) so that's a sort of OK to only cover value side. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
HeartSaVioR commented on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-639239790 My alternative with wrapping state store is something like below: ``` class RowValidatingStateStore( underlying: StateStore, keyType: Seq[DataType], valueType: Seq[DataType]) extends StateStore { private var isValidated = false override def get(key: UnsafeRow): UnsafeRow = { val value = underlying.get(key) if (!isValidated) { validateRow(value) isValidated = true } value } override def id: StateStoreId = underlying.id override def version: Long = underlying.version override def put(key: UnsafeRow, value: UnsafeRow): Unit = underlying.put(key, value) override def remove(key: UnsafeRow): Unit = underlying.remove(key) override def commit(): Long = underlying.commit() override def abort(): Unit = underlying.abort() override def iterator(): Iterator[UnsafeRowPair] = underlying.iterator() override def metrics: StateStoreMetrics = underlying.metrics override def hasCommitted: Boolean = underlying.hasCommitted private def validateRow(row: UnsafeRow): Unit = { // TODO: call util method with row and schema to validate } } def get(...): StateStore = { require(version >= 0) val storeProvider = loadedProviders.synchronized { ... } // TODO: add if statement to see whether it should wrap state store or not new RowValidatingStateStore(storeProvider.getStore(version, keySchema, valueSchema)) } ``` The example code only checks in get operation, which is insufficient to check "key" row in state. That said, iterator approach still provides more possibility of validation, though the validation of unsafe row itself doesn't have enough coverage of checking various incompatibility issues (Definitely we should have another guards as well) so that's a sort of OK to only cover value side. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
HeartSaVioR edited a comment on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-639200645 > @HeartSaVioR After taking a further look. Instead of dealing with the iterator, how about adding the invalidation for all state store operations in StateStoreProvider? Since we can get the key/value row during load map. WDYT? It would be nice to see the proposed change by code to avoid misunderstanding, like I proposed in previous comment. (anything including commit in your fork or text comment is OK) I'll try out my alternative (wrapping State Store) and show the code change. Thanks! EDIT: Please deal with interface whenever possible - there're different implementations of state store providers and we should avoid sticking to the specific implementation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28730: [SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI.
AmplabJenkins removed a comment on pull request #28730: URL: https://github.com/apache/spark/pull/28730#issuecomment-639226292 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28730: [SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI.
AmplabJenkins commented on pull request #28730: URL: https://github.com/apache/spark/pull/28730#issuecomment-639226292 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28730: [SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI.
SparkQA removed a comment on pull request #28730: URL: https://github.com/apache/spark/pull/28730#issuecomment-639151923 **[Test build #123547 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123547/testReport)** for PR 28730 at commit [`5705e15`](https://github.com/apache/spark/commit/5705e1523f108e66afcf266c066615503a98a7cb). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28730: [SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI.
SparkQA commented on pull request #28730: URL: https://github.com/apache/spark/pull/28730#issuecomment-639225813 **[Test build #123547 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123547/testReport)** for PR 28730 at commit [`5705e15`](https://github.com/apache/spark/commit/5705e1523f108e66afcf266c066615503a98a7cb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639222935 The K8s test failure appears unrelated (`- Run in client mode. *** FAILED ***`) we don't do anything with the tokens. I'll investigate more tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
AmplabJenkins commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639222847 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
AmplabJenkins removed a comment on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639222847 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
SparkQA commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-63971 **[Test build #123548 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123548/testReport)** for PR 28708 at commit [`60bec89`](https://github.com/apache/spark/commit/60bec89a67253ec823d4497bc3eef8bbc30b7949). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
SparkQA removed a comment on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639183660 **[Test build #123548 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123548/testReport)** for PR 28708 at commit [`60bec89`](https://github.com/apache/spark/commit/60bec89a67253ec823d4497bc3eef8bbc30b7949). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bmarcott commented on pull request #27096: [SPARK-28148][SQL] Repartition after join is not optimized away
bmarcott commented on pull request #27096: URL: https://github.com/apache/spark/pull/27096#issuecomment-639221995 @cloud-fan @viirya could you help review or add a suggested reviewer here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on pull request #22480: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests failed on Python 3.6 and macOS High Sierra
HyukjinKwon edited a comment on pull request #22480: URL: https://github.com/apache/spark/pull/22480#issuecomment-639212453 @pquentin, yes, it's kind of difficult to avoid in PySpark side for now. The problem isn't solely because we use `fork()` but it binds to other conditions. I didn't take a very close look at that time but the error was thrown when a particular instance is pickled in the forked process. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #22480: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests failed on Python 3.6 and macOS High Sierra
HyukjinKwon commented on pull request #22480: URL: https://github.com/apache/spark/pull/22480#issuecomment-639212453 @pquentin, yes, it's kind of difficult to avoid in PySpark side for now. The problem isn't solely because we use `fork()` but it binds to other conditions. I didn't take a very close look at that time but the error was thrown when a particular instance is pickled. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
AmplabJenkins removed a comment on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639202750 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/28172/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
AmplabJenkins removed a comment on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639202742 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
SparkQA commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639202723 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/28172/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
AmplabJenkins commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639202742 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
holdenk commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639201319 > > > So @attilapiros looking at the Jenkins console logs we aren't leaking any threads during testing (nor would I expect us to). But I'll add something to more aggressively stop the shuffle migration threads. > > > > > > It will come when the `BlockManager` will be tested in `BlockManagerSuite`: > > ``` > > = POSSIBLE THREAD LEAK IN SUITE o.a.s.storage.BlockManagerSuite, thread names: rpc-boss-3-1, migrate-shuffle-to-BlockManagerId(exec2, localhost, 50804, None), shuffle-boss-9-1 , shuffle-boss-6-1 = > > ``` > > Gotcha was looking for the explicit decom test. I'll eagerly shutdown the migrate-shuffle-to threads then. I think the latest changes have fixed this (e.g. `grep "THREAD LEAK" consoleFull |grep BlockManager` returns nothing). Worth noting we do leak threads in ~283 tests so I'm not sure how important this is. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
HeartSaVioR commented on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-639200645 > @HeartSaVioR After taking a further look. Instead of dealing with the iterator, how about adding the invalidation for all state store operations in StateStoreProvider? Since we can get the key/value row during load map. WDYT? It would be nice to see the proposed change by code to avoid misunderstanding, like I proposed in previous comment. (anything including commit in your fork or text comment is OK) I'll try out my alternative (wrapping State Store) and show the code change. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28720: [SPARK-31900][SPARK-SUBMIT] Client memory passed unvalidated to the JVM Xmx
AmplabJenkins removed a comment on pull request #28720: URL: https://github.com/apache/spark/pull/28720#issuecomment-639197112 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28720: [SPARK-31900][SPARK-SUBMIT] Client memory passed unvalidated to the JVM Xmx
AmplabJenkins commented on pull request #28720: URL: https://github.com/apache/spark/pull/28720#issuecomment-639197112 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28720: [SPARK-31900][SPARK-SUBMIT] Client memory passed unvalidated to the JVM Xmx
SparkQA removed a comment on pull request #28720: URL: https://github.com/apache/spark/pull/28720#issuecomment-639149691 **[Test build #123546 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123546/testReport)** for PR 28720 at commit [`87e1d67`](https://github.com/apache/spark/commit/87e1d67c5be394ae514e38a50958f88ecc721287). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28720: [SPARK-31900][SPARK-SUBMIT] Client memory passed unvalidated to the JVM Xmx
SparkQA commented on pull request #28720: URL: https://github.com/apache/spark/pull/28720#issuecomment-639196496 **[Test build #123546 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123546/testReport)** for PR 28720 at commit [`87e1d67`](https://github.com/apache/spark/commit/87e1d67c5be394ae514e38a50958f88ecc721287). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
SparkQA commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639195695 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/28172/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
AmplabJenkins removed a comment on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-639187847 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
AmplabJenkins commented on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-639187847 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] skambha commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
skambha commented on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-639187545 > @skambha You can check the integrated tests in #28725. If we delete the validation, we'll get a NPE for [this test](https://github.com/apache/spark/pull/28725/files#diff-492f0d70824a58ef2ea94a54dc6f9707R79), and get an assertion in the unsafe row for [this test](https://github.com/apache/spark/pull/28725/files#diff-492f0d70824a58ef2ea94a54dc6f9707R185). That is to say, we will get random failures during reusing the checkpoint written by the old Spark version. Thanks for adding the test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #27172: [WIP] [SPARK-29644][SQL] Fixed ByteType JDBCUtils to map to TinyInt at write read and ShortType on read
github-actions[bot] commented on pull request #27172: URL: https://github.com/apache/spark/pull/27172#issuecomment-639187219 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
SparkQA removed a comment on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-639134384 **[Test build #123545 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123545/testReport)** for PR 28710 at commit [`7b01b63`](https://github.com/apache/spark/commit/7b01b63f9ce6549eaf248296b0d48e98a2dd7a25). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
SparkQA commented on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-639187095 **[Test build #123545 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123545/testReport)** for PR 28710 at commit [`7b01b63`](https://github.com/apache/spark/commit/7b01b63f9ce6549eaf248296b0d48e98a2dd7a25). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
SparkQA commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639183660 **[Test build #123548 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123548/testReport)** for PR 28708 at commit [`60bec89`](https://github.com/apache/spark/commit/60bec89a67253ec823d4497bc3eef8bbc30b7949). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] siknezevic commented on pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements
siknezevic commented on pull request #27246: URL: https://github.com/apache/spark/pull/27246#issuecomment-639172725 Thank you for the comments. I will addressed them soon This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jacobwu123 opened a new pull request #28731: [SPARK-31909][CORE] Add SPARK_SUBMIT_OPTS to Beeline Script
jacobwu123 opened a new pull request #28731: URL: https://github.com/apache/spark/pull/28731 ### What changes were proposed in this pull request? Added the SPARK_SUBMIT_OPTS environment available to beeline. ### Why are the changes needed? The beeline is not able to pick up the krb5.conf variable specified in the SPARK_SUBMIT_OPTS, located in spark_env.sh. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ./dev/run-tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28731: [SPARK-31909][CORE] Add SPARK_SUBMIT_OPTS to Beeline Script
AmplabJenkins removed a comment on pull request #28731: URL: https://github.com/apache/spark/pull/28731#issuecomment-639158213 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28731: [SPARK-31909][CORE] Add SPARK_SUBMIT_OPTS to Beeline Script
AmplabJenkins commented on pull request #28731: URL: https://github.com/apache/spark/pull/28731#issuecomment-639158594 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28731: [SPARK-31909][CORE] Add SPARK_SUBMIT_OPTS to Beeline Script
AmplabJenkins commented on pull request #28731: URL: https://github.com/apache/spark/pull/28731#issuecomment-639158213 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28730: [SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI.
AmplabJenkins removed a comment on pull request #28730: URL: https://github.com/apache/spark/pull/28730#issuecomment-639152306 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28730: [SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI.
AmplabJenkins commented on pull request #28730: URL: https://github.com/apache/spark/pull/28730#issuecomment-639152306 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28730: [SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI.
SparkQA commented on pull request #28730: URL: https://github.com/apache/spark/pull/28730#issuecomment-639151923 **[Test build #123547 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123547/testReport)** for PR 28730 at commit [`5705e15`](https://github.com/apache/spark/commit/5705e1523f108e66afcf266c066615503a98a7cb). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin opened a new pull request #28730: [SPARK-31903][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI.
ueshin opened a new pull request #28730: URL: https://github.com/apache/spark/pull/28730 ### What changes were proposed in this pull request? In `Dataset.collectAsArrowToR` and `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics. We should call `serveToStream` first, then `withAction` in it. ### Why are the changes needed? When calling toPandas, usually Query UI shows each plan node's metric and corresponding Stage ID and Task ID: ```py >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z']) >>> df.toPandas() x yz 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-03 at 4 47 07 PM](https://user-images.githubusercontent.com/506656/83815735-bec22380-a675-11ea-8ecc-bf2954731f35.png) but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct: ```py >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True) >>> df.toPandas() x yz 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-03 at 4 47 27 PM](https://user-images.githubusercontent.com/506656/83815804-de594c00-a675-11ea-933a-d0ffc0f534dd.png) ### Does this PR introduce _any_ user-facing change? Yes, the Query UI will show the plan with the correct metrics. ### How was this patch tested? I checked it manually in my local. ![Screen Shot 2020-06-04 at 3 19 41 PM](https://user-images.githubusercontent.com/506656/83816265-d77f0900-a676-11ea-84b8-2a8d80428bc6.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28720: [SPARK-31900][SPARK-SUBMIT] Client memory passed unvalidated to the JVM Xmx
SparkQA commented on pull request #28720: URL: https://github.com/apache/spark/pull/28720#issuecomment-639149691 **[Test build #123546 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123546/testReport)** for PR 28720 at commit [`87e1d67`](https://github.com/apache/spark/commit/87e1d67c5be394ae514e38a50958f88ecc721287). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28720: [SPARK-31900][SPARK-SUBMIT] Client memory passed unvalidated to the JVM Xmx
AmplabJenkins removed a comment on pull request #28720: URL: https://github.com/apache/spark/pull/28720#issuecomment-639147791 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28720: [SPARK-31900][SPARK-SUBMIT] Client memory passed unvalidated to the JVM Xmx
AmplabJenkins removed a comment on pull request #28720: URL: https://github.com/apache/spark/pull/28720#issuecomment-638456255 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28720: [SPARK-31900][SPARK-SUBMIT] Client memory passed unvalidated to the JVM Xmx
AmplabJenkins commented on pull request #28720: URL: https://github.com/apache/spark/pull/28720#issuecomment-639147791 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on pull request #28720: [SPARK-31900][SPARK-SUBMIT] Client memory passed unvalidated to the JVM Xmx
gatorsmile commented on pull request #28720: URL: https://github.com/apache/spark/pull/28720#issuecomment-639147436 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
holdenk commented on a change in pull request #28708: URL: https://github.com/apache/spark/pull/28708#discussion_r435575853 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala ## @@ -1790,6 +1822,108 @@ private[spark] class BlockManager( } } + private class ShuffleMigrationRunnable(peer: BlockManagerId) extends Runnable { +@volatile var running = true +override def run(): Unit = { + var migrating: Option[(Int, Long)] = None + val storageLevel = StorageLevel( +useDisk = true, +useMemory = false, +useOffHeap = false, +deserialized = false, +replication = 1) + logInfo(s"Starting migration thread for ${peer}") + // Once a block fails to transfer to an executor stop trying to transfer more blocks + try { +while (running) { + val migrating = Option(shufflesToMigrate.poll()) + migrating match { +case None => + logInfo("Nothing to migrate") + // Nothing to do right now, but maybe a transfer will fail or a new block + // will finish being committed. + val SLEEP_TIME_SECS = 1 + Thread.sleep(SLEEP_TIME_SECS * 1000L) +case Some((shuffleId, mapId)) => + logInfo(s"Trying to migrate shuffle ${shuffleId},${mapId} to ${peer}") + val blocks = +migratableResolver.getMigrationBlocks(shuffleId, mapId) + logInfo(s"Got migration sub-blocks ${blocks}") + blocks.foreach { case (blockId, buffer) => +logInfo(s"Migrating sub-block ${blockId}") +blockTransferService.uploadBlockSync( + peer.host, + peer.port, + peer.executorId, + blockId, + buffer, + storageLevel, + null)// class tag, we don't need for shuffle +logInfo(s"Migrated sub block ${blockId}") + } + logInfo(s"Migrated ${shuffleId},${mapId} to ${peer}") + } +} +// This catch is intentionally outside of the while running block. +// if we encounter errors migrating to an executor we want to stop. + } catch { +case e: Exception => + migrating match { +case Some(shuffleMap) => + logError("Error ${e} during migration, adding ${shuffleMap} back to migration queue") + shufflesToMigrate.add(shuffleMap) +case None => + logError(s"Error ${e} while waiting for block to migrate") + } + } +} + } + + private val migrationPeers = mutable.HashMap[BlockManagerId, ShuffleMigrationRunnable]() + + /** + * Tries to offload all shuffle blocks that are registered with the shuffle service locally. + * Note: this does not delete the shuffle files in-case there is an in-progress fetch + * but rather shadows them. + * Requires an Indexed based shuffle resolver. + */ + def offloadShuffleBlocks(): Unit = { +// Update the queue of shuffles to be migrated +logInfo("Offloading shuffle blocks") +val localShuffles = migratableResolver.getStoredShuffles() +logInfo(s"My local shuffles are ${localShuffles.toList}") +val newShufflesToMigrate = localShuffles.&~(migratingShuffles).toSeq Review comment: This is for computing the change needed, readability isn't a big concern. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
holdenk commented on a change in pull request #28708: URL: https://github.com/apache/spark/pull/28708#discussion_r435575539 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala ## @@ -1790,6 +1822,108 @@ private[spark] class BlockManager( } } + private class ShuffleMigrationRunnable(peer: BlockManagerId) extends Runnable { +@volatile var running = true +override def run(): Unit = { + var migrating: Option[(Int, Long)] = None + val storageLevel = StorageLevel( +useDisk = true, +useMemory = false, +useOffHeap = false, +deserialized = false, +replication = 1) + logInfo(s"Starting migration thread for ${peer}") + // Once a block fails to transfer to an executor stop trying to transfer more blocks + try { +while (running) { + val migrating = Option(shufflesToMigrate.poll()) + migrating match { +case None => + logInfo("Nothing to migrate") + // Nothing to do right now, but maybe a transfer will fail or a new block + // will finish being committed. + val SLEEP_TIME_SECS = 1 + Thread.sleep(SLEEP_TIME_SECS * 1000L) +case Some((shuffleId, mapId)) => + logInfo(s"Trying to migrate shuffle ${shuffleId},${mapId} to ${peer}") + val blocks = +migratableResolver.getMigrationBlocks(shuffleId, mapId) + logInfo(s"Got migration sub-blocks ${blocks}") + blocks.foreach { case (blockId, buffer) => +logInfo(s"Migrating sub-block ${blockId}") +blockTransferService.uploadBlockSync( + peer.host, + peer.port, + peer.executorId, + blockId, + buffer, + storageLevel, + null)// class tag, we don't need for shuffle +logInfo(s"Migrated sub block ${blockId}") + } + logInfo(s"Migrated ${shuffleId},${mapId} to ${peer}") + } +} +// This catch is intentionally outside of the while running block. +// if we encounter errors migrating to an executor we want to stop. + } catch { +case e: Exception => + migrating match { +case Some(shuffleMap) => + logError("Error ${e} during migration, adding ${shuffleMap} back to migration queue") + shufflesToMigrate.add(shuffleMap) +case None => + logError(s"Error ${e} while waiting for block to migrate") + } + } +} + } + + private val migrationPeers = mutable.HashMap[BlockManagerId, ShuffleMigrationRunnable]() + + /** + * Tries to offload all shuffle blocks that are registered with the shuffle service locally. + * Note: this does not delete the shuffle files in-case there is an in-progress fetch + * but rather shadows them. + * Requires an Indexed based shuffle resolver. + */ + def offloadShuffleBlocks(): Unit = { +// Update the queue of shuffles to be migrated +logInfo("Offloading shuffle blocks") +val localShuffles = migratableResolver.getStoredShuffles() +logInfo(s"My local shuffles are ${localShuffles.toList}") Review comment: Looking at it not I think I'll just take it out, was useful while I was doing dev but shouldn't be needed for any operations stuff. Good call on it maybe being too long in production environments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
holdenk commented on a change in pull request #28708: URL: https://github.com/apache/spark/pull/28708#discussion_r435575094 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala ## @@ -1790,6 +1822,108 @@ private[spark] class BlockManager( } } + private class ShuffleMigrationRunnable(peer: BlockManagerId) extends Runnable { +@volatile var running = true +override def run(): Unit = { + var migrating: Option[(Int, Long)] = None + val storageLevel = StorageLevel( +useDisk = true, +useMemory = false, +useOffHeap = false, +deserialized = false, +replication = 1) + logInfo(s"Starting migration thread for ${peer}") + // Once a block fails to transfer to an executor stop trying to transfer more blocks + try { +while (running) { + val migrating = Option(shufflesToMigrate.poll()) + migrating match { +case None => + logInfo("Nothing to migrate") + // Nothing to do right now, but maybe a transfer will fail or a new block + // will finish being committed. + val SLEEP_TIME_SECS = 1 + Thread.sleep(SLEEP_TIME_SECS * 1000L) +case Some((shuffleId, mapId)) => + logInfo(s"Trying to migrate shuffle ${shuffleId},${mapId} to ${peer}") + val blocks = +migratableResolver.getMigrationBlocks(shuffleId, mapId) + logInfo(s"Got migration sub-blocks ${blocks}") + blocks.foreach { case (blockId, buffer) => +logInfo(s"Migrating sub-block ${blockId}") +blockTransferService.uploadBlockSync( + peer.host, + peer.port, + peer.executorId, + blockId, + buffer, + storageLevel, + null)// class tag, we don't need for shuffle +logInfo(s"Migrated sub block ${blockId}") + } + logInfo(s"Migrated ${shuffleId},${mapId} to ${peer}") + } +} +// This catch is intentionally outside of the while running block. +// if we encounter errors migrating to an executor we want to stop. + } catch { +case e: Exception => + migrating match { +case Some(shuffleMap) => + logError("Error ${e} during migration, adding ${shuffleMap} back to migration queue") + shufflesToMigrate.add(shuffleMap) +case None => + logError(s"Error ${e} while waiting for block to migrate") + } + } +} + } + + private val migrationPeers = mutable.HashMap[BlockManagerId, ShuffleMigrationRunnable]() + + /** + * Tries to offload all shuffle blocks that are registered with the shuffle service locally. + * Note: this does not delete the shuffle files in-case there is an in-progress fetch + * but rather shadows them. + * Requires an Indexed based shuffle resolver. + */ + def offloadShuffleBlocks(): Unit = { +// Update the queue of shuffles to be migrated +logInfo("Offloading shuffle blocks") +val localShuffles = migratableResolver.getStoredShuffles() Review comment: No, if we get a class cast exception we want to bubble it up because there isn't anything we can do in that situation besides report it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28729: [SPARK-30808][SQL] Enable Java 8 time API in Thrift server
AmplabJenkins removed a comment on pull request #28729: URL: https://github.com/apache/spark/pull/28729#issuecomment-639137871 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/123544/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28729: [SPARK-30808][SQL] Enable Java 8 time API in Thrift server
AmplabJenkins removed a comment on pull request #28729: URL: https://github.com/apache/spark/pull/28729#issuecomment-639137856 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28729: [SPARK-30808][SQL] Enable Java 8 time API in Thrift server
SparkQA removed a comment on pull request #28729: URL: https://github.com/apache/spark/pull/28729#issuecomment-639122853 **[Test build #123544 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123544/testReport)** for PR 28729 at commit [`3c35cf5`](https://github.com/apache/spark/commit/3c35cf5920c6e4216adcefc866bd518dfe635def). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28729: [SPARK-30808][SQL] Enable Java 8 time API in Thrift server
AmplabJenkins commented on pull request #28729: URL: https://github.com/apache/spark/pull/28729#issuecomment-639137856 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28729: [SPARK-30808][SQL] Enable Java 8 time API in Thrift server
SparkQA commented on pull request #28729: URL: https://github.com/apache/spark/pull/28729#issuecomment-639137804 **[Test build #123544 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123544/testReport)** for PR 28729 at commit [`3c35cf5`](https://github.com/apache/spark/commit/3c35cf5920c6e4216adcefc866bd518dfe635def). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28728: [SPARK-31879][SQL][test-java11] Make week-based pattern invalid for formatting too
AmplabJenkins removed a comment on pull request #28728: URL: https://github.com/apache/spark/pull/28728#issuecomment-639137018 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28728: [SPARK-31879][SQL][test-java11] Make week-based pattern invalid for formatting too
AmplabJenkins commented on pull request #28728: URL: https://github.com/apache/spark/pull/28728#issuecomment-639137018 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28728: [SPARK-31879][SQL][test-java11] Make week-based pattern invalid for formatting too
SparkQA removed a comment on pull request #28728: URL: https://github.com/apache/spark/pull/28728#issuecomment-638992217 **[Test build #123538 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123538/testReport)** for PR 28728 at commit [`d7fc6d9`](https://github.com/apache/spark/commit/d7fc6d9db1244f681066415b14e798820fc6f61e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28728: [SPARK-31879][SQL][test-java11] Make week-based pattern invalid for formatting too
SparkQA commented on pull request #28728: URL: https://github.com/apache/spark/pull/28728#issuecomment-639136125 **[Test build #123538 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123538/testReport)** for PR 28728 at commit [`d7fc6d9`](https://github.com/apache/spark/commit/d7fc6d9db1244f681066415b14e798820fc6f61e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
AmplabJenkins removed a comment on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-639134843 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
AmplabJenkins commented on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-639134843 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
SparkQA commented on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-639134384 **[Test build #123545 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123545/testReport)** for PR 28710 at commit [`7b01b63`](https://github.com/apache/spark/commit/7b01b63f9ce6549eaf248296b0d48e98a2dd7a25). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28729: [SPARK-30808][SQL] Enable Java 8 time API in Thrift server
AmplabJenkins removed a comment on pull request #28729: URL: https://github.com/apache/spark/pull/28729#issuecomment-639123368 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28729: [SPARK-30808][SQL] Enable Java 8 time API in Thrift server
AmplabJenkins commented on pull request #28729: URL: https://github.com/apache/spark/pull/28729#issuecomment-639123368 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28729: [SPARK-30808][SQL] Enable Java 8 time API in Thrift server
SparkQA commented on pull request #28729: URL: https://github.com/apache/spark/pull/28729#issuecomment-639122853 **[Test build #123544 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123544/testReport)** for PR 28729 at commit [`3c35cf5`](https://github.com/apache/spark/commit/3c35cf5920c6e4216adcefc866bd518dfe635def). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
holdenk commented on a change in pull request #28708: URL: https://github.com/apache/spark/pull/28708#discussion_r43809 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala ## @@ -1790,6 +1822,108 @@ private[spark] class BlockManager( } } + private class ShuffleMigrationRunnable(peer: BlockManagerId) extends Runnable { +@volatile var running = true +override def run(): Unit = { + var migrating: Option[(Int, Long)] = None + val storageLevel = StorageLevel( +useDisk = true, +useMemory = false, +useOffHeap = false, +deserialized = false, +replication = 1) + logInfo(s"Starting migration thread for ${peer}") + // Once a block fails to transfer to an executor stop trying to transfer more blocks + try { +while (running) { + val migrating = Option(shufflesToMigrate.poll()) + migrating match { +case None => + logInfo("Nothing to migrate") + // Nothing to do right now, but maybe a transfer will fail or a new block + // will finish being committed. + val SLEEP_TIME_SECS = 1 + Thread.sleep(SLEEP_TIME_SECS * 1000L) +case Some((shuffleId, mapId)) => + logInfo(s"Trying to migrate shuffle ${shuffleId},${mapId} to ${peer}") + val blocks = +migratableResolver.getMigrationBlocks(shuffleId, mapId) + logInfo(s"Got migration sub-blocks ${blocks}") + blocks.foreach { case (blockId, buffer) => +logInfo(s"Migrating sub-block ${blockId}") +blockTransferService.uploadBlockSync( + peer.host, + peer.port, + peer.executorId, + blockId, + buffer, + storageLevel, + null)// class tag, we don't need for shuffle +logInfo(s"Migrated sub block ${blockId}") + } + logInfo(s"Migrated ${shuffleId},${mapId} to ${peer}") Review comment: We don't delete the file from the current host right away. Once the BlockUpdate message is processed on the master it will go to the peer it has been migrated to. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
AmplabJenkins removed a comment on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639118857 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
AmplabJenkins commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-639118857 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28710: [SPARK-31893][ML] Add a generic ClassificationSummary trait
AmplabJenkins removed a comment on pull request #28710: URL: https://github.com/apache/spark/pull/28710#issuecomment-639117314 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/123543/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org