[GitHub] [spark] viirya commented on a change in pull request #32513: [SPARK-35378][SQL] Eagerly execute non-root Command
viirya commented on a change in pull request #32513: URL: https://github.com/apache/spark/pull/32513#discussion_r637365124 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala ## @@ -74,12 +75,26 @@ class QueryExecution( sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker) } + lazy val nonRootCommandExecuted: LogicalPlan = analyzed mapChildren { child => +child transformDown { + // SPARK-35378: Eagerly execute non-root Command Review comment: Instead of just a JIRA number, as this is not TODO item, can you simply add a few comment here explaining why you need to eagerly execute `Command` under top plan node? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32513: [SPARK-35378][SQL] Eagerly execute non-root Command
viirya commented on a change in pull request #32513: URL: https://github.com/apache/spark/pull/32513#discussion_r637364885 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala ## @@ -74,12 +75,26 @@ class QueryExecution( sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker) } + lazy val nonRootCommandExecuted: LogicalPlan = analyzed mapChildren { child => +child transformDown { + // SPARK-35378: Eagerly execute non-root Command + case c: Command => +val qe = sparkSession.sessionState.executePlan(c) +CommandResult( + qe.nonRootCommandExecuted.output, + qe.nonRootCommandExecuted, Review comment: Can `Command` contain other `Command`? Otherwise, can we just use `qe.analyzed.output` here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32513: [SPARK-35378][SQL] Eagerly execute non-root Command
viirya commented on a change in pull request #32513: URL: https://github.com/apache/spark/pull/32513#discussion_r637364733 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala ## @@ -74,12 +75,26 @@ class QueryExecution( sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker) } + lazy val nonRootCommandExecuted: LogicalPlan = analyzed mapChildren { child => Review comment: Oh I see. You mean the `Command` not at the top-level node. My first thought is a command not run as root... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation
SparkQA commented on pull request #32606: URL: https://github.com/apache/spark/pull/32606#issuecomment-846363539 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43344/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32513: [SPARK-35378][SQL] Eagerly execute non-root Command
viirya commented on a change in pull request #32513: URL: https://github.com/apache/spark/pull/32513#discussion_r637363738 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala ## @@ -74,12 +75,26 @@ class QueryExecution( sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker) } + lazy val nonRootCommandExecuted: LogicalPlan = analyzed mapChildren { child => Review comment: I'm a bit confused. Can you define what is non root command? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak edited a comment on pull request #32631: [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA.
sarutak edited a comment on pull request #32631: URL: https://github.com/apache/spark/pull/32631#issuecomment-846361308 cc: @maropu As I comment in JIRA, I've been working on this issue. This seems different from your solution. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32513: [SPARK-35378][SQL] Eagerly execute non-root Command
viirya commented on a change in pull request #32513: URL: https://github.com/apache/spark/pull/32513#discussion_r637363690 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/CommandResultExec.scala ## @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Attribute, UnsafeProjection} +import org.apache.spark.sql.catalyst.plans.QueryPlan +import org.apache.spark.sql.execution.metric.SQLMetrics + +/** + * Physical plan node for holding data from a command. + * + * `rows` may not be serializable and ideally we should not send `rows` and `unsafeRows` Review comment: For`rows` and `unsafeRows`, if you mean the classes `Row` and `UnsafeRow`, please use correct class names? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak commented on pull request #32631: [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA.
sarutak commented on pull request #32631: URL: https://github.com/apache/spark/pull/32631#issuecomment-846361308 cc: @maropu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak opened a new pull request #32631: [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA.
sarutak opened a new pull request #32631: URL: https://github.com/apache/spark/pull/32631 ### What changes were proposed in this pull request? This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA. `doker-integration-tests` can't run if docker is not installed so it run only if `docker-integration-tests` is specified with `--module`. ### Why are the changes needed? CI for `docker-integration-tests` is absent for now. ### Does this PR introduce _any_ user-facing change? GA. ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #32272: [SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata
viirya commented on pull request #32272: URL: https://github.com/apache/spark/pull/32272#issuecomment-846359351 Sorry for late. I will find some time in the weekend to look at this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #32605: [WIP][SPARK-35446] Override getJDBCType in MySQLDialect to map FloatType to FLOAT
maropu commented on pull request #32605: URL: https://github.com/apache/spark/pull/32605#issuecomment-846359322 > Thank you for pining me, @HyukjinKwon. btw, is it better to run these integration tests in GA, too? See https://github.com/apache/spark/pull/32620 for the related work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #32559: [SPARK-35410][SQL] SubExpr elimination should not include redundant children exprs in conditional expression
viirya commented on pull request #32559: URL: https://github.com/apache/spark/pull/32559#issuecomment-846359212 #32586 was merged. Can we look at this if it is good to go? Thanks. cc @cloud-fan @dongjoon-hyun @maropu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32559: [SPARK-35410][SQL] SubExpr elimination should not include redundant children exprs in conditional expression
viirya commented on a change in pull request #32559: URL: https://github.com/apache/spark/pull/32559#discussion_r637362691 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala ## @@ -83,20 +83,35 @@ class EquivalentExpressions { * Adds only expressions which are common in each of given expressions, in a recursive way. * For example, given two expressions `(a + (b + (c + 1)))` and `(d + (e + (c + 1)))`, * the common expression `(c + 1)` will be added into `equivalenceMap`. + * + * Note that as we don't know in advance if any child node of an expression will be common + * across all given expressions, we count all child nodes when looking through the given + * expressions. But when we call `addExprTree` to add common expressions into the map, we + * will add recursively the child nodes. So we need to filter the child expressions first. + * For example, if `((a + b) + c)` and `(a + b)` are common expressions, we only add + * `((a + b) + c)`. */ private def addCommonExprs( exprs: Seq[Expression], addFunc: Expression => Boolean = addExpr): Unit = { val exprSetForAll = mutable.Set[Expr]() Review comment: Both `1 + col` and `col + 1` will be replaced with the extracted subexpression during codege. We don't just look of key at `equivalenceMap` when replacing with subexpression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #32620: [WIP][SPARK-35483][SQL][TESTS] Add a new GA test job for the docker integration tests
maropu commented on pull request #32620: URL: https://github.com/apache/spark/pull/32620#issuecomment-846359132 I used the wrong jira number, so I fixed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
viirya commented on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-846358670 > I'm not sure about the scenario of leveraging PVC as checkpoint location - at least that sounds to me as beyond the support of checkpoint in Structured Streaming. I agree on this, and yes, this is the current status. So that is said we are going to propose a new approach to support checkpoint in Structured Streaming. Unfortunately due to that fact that scheduling is bound to stateful tasks (i.e. state store locations), we cannot achieve the goal without touching other modules, like core. > I'm more likely novice on cloud/k8s, but from the common sense, I guess the actual storage of PVC should be still a sort of network storage to be resilient on "physical node down". I'm wondering how much benefits PVC approach gives compared to the existing approach as just directly use remote fault-tolerant file system. The benefits should be clear to cope with additional complexity. Technically, PVC is kinds of abstract way to look at the volume mounted on container running executor. It could be local storage on nodes on k8s. It depends where the PVC is bound to. HDFS becomes a bottleneck for our streaming jobs. The throughput to HDFS, the number of files as loading on name nodes, these are serious issues to use it as checkpoint destination for heavy streaming jobs in scale. Using PVC as checkpoint could be huge relief on the loading of HDFS. There are also others like better latency, simplified streaming architecture. Personally I think this is enough benefits as the motivation of our proposal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32630: [Minor][Followup] Update SHA for the oracle docker image
SparkQA commented on pull request #32630: URL: https://github.com/apache/spark/pull/32630#issuecomment-846358506 **[Test build #138823 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138823/testReport)** for PR 32630 at commit [`d745025`](https://github.com/apache/spark/commit/d7450257bb575cc568ed0bc5109373fe4ac5f7c1). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak commented on pull request #32630: [Minor][Followup] Update SHA for the oracle docker image
sarutak commented on pull request #32630: URL: https://github.com/apache/spark/pull/32630#issuecomment-846358208 cc: @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak opened a new pull request #32630: [Minor][Followup] Update SHA for the oracle docker image
sarutak opened a new pull request #32630: URL: https://github.com/apache/spark/pull/32630 ### What changes were proposed in this pull request? This PR updates SHA for the oracle docker image in the comment in `OracleIntegrationSuite` and `v2.OracleIntegrationSuite`. The SHA for the latest image is `3f422c4a35b423dfcdbcc57a84f01db6c82eb6c1` ### Why are the changes needed? The script name for creating the oracle docker image is changed in #32629, following the latest image so we also need to update the corresponding SHA. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The value is from `git log`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0
AmplabJenkins removed a comment on pull request #32628: URL: https://github.com/apache/spark/pull/32628#issuecomment-846357806 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138821/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0
AmplabJenkins commented on pull request #32628: URL: https://github.com/apache/spark/pull/32628#issuecomment-846357806 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138821/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak commented on a change in pull request #32629: [Minor][SQL] Change the script name for creating oracle docker image.
sarutak commented on a change in pull request #32629: URL: https://github.com/apache/spark/pull/32629#discussion_r637361238 ## File path: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala ## @@ -48,7 +48,7 @@ import org.apache.spark.tags.DockerTest * $ git clone https://github.com/oracle/docker-images.git * // Head SHA: 3e352a22618070595f823977a0fd1a3a8071a83c Review comment: The SHA is `3f422c4a35b423dfcdbcc57a84f01db6c82eb6c1` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak commented on a change in pull request #32629: [Minor][SQL] Change the script name for creating oracle docker image.
sarutak commented on a change in pull request #32629: URL: https://github.com/apache/spark/pull/32629#discussion_r637361014 ## File path: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala ## @@ -48,7 +48,7 @@ import org.apache.spark.tags.DockerTest * $ git clone https://github.com/oracle/docker-images.git * // Head SHA: 3e352a22618070595f823977a0fd1a3a8071a83c Review comment: Ah, I didn't care about it. I'll fix it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #32629: [Minor][SQL] Change the script name for creating oracle docker image.
dongjoon-hyun commented on a change in pull request #32629: URL: https://github.com/apache/spark/pull/32629#discussion_r637360891 ## File path: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala ## @@ -48,7 +48,7 @@ import org.apache.spark.tags.DockerTest * $ git clone https://github.com/oracle/docker-images.git * // Head SHA: 3e352a22618070595f823977a0fd1a3a8071a83c Review comment: Oh, here is SHA, @sarutak . In this case, we need to update this SHA too, @sarutak . What was the SHA you checked? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #32629: [Minor][SQL] Change the script name for creating oracle docker image.
dongjoon-hyun commented on a change in pull request #32629: URL: https://github.com/apache/spark/pull/32629#discussion_r637360891 ## File path: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala ## @@ -48,7 +48,7 @@ import org.apache.spark.tags.DockerTest * $ git clone https://github.com/oracle/docker-images.git * // Head SHA: 3e352a22618070595f823977a0fd1a3a8071a83c Review comment: Oh, here is SHA, @sarutak . In this case, we need to update this SHA too, @sarutak . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation
SparkQA commented on pull request #32606: URL: https://github.com/apache/spark/pull/32606#issuecomment-846357014 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43344/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0
SparkQA removed a comment on pull request #32628: URL: https://github.com/apache/spark/pull/32628#issuecomment-846342264 **[Test build #138821 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138821/testReport)** for PR 32628 at commit [`48292b7`](https://github.com/apache/spark/commit/48292b7a328b33071345b4d2686ae813ba6122d4). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #32629: [Minor][SQL] Change the script name for creating oracle docker image.
dongjoon-hyun closed pull request #32629: URL: https://github.com/apache/spark/pull/32629 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak commented on pull request #32629: [Minor][SQL] Change the script name for creating oracle docker image.
sarutak commented on pull request #32629: URL: https://github.com/apache/spark/pull/32629#issuecomment-846356330 > I added the link to the PR description. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0
dongjoon-hyun closed pull request #32628: URL: https://github.com/apache/spark/pull/32628 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0
SparkQA commented on pull request #32628: URL: https://github.com/apache/spark/pull/32628#issuecomment-846355223 **[Test build #138821 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138821/testReport)** for PR 32628 at commit [`48292b7`](https://github.com/apache/spark/commit/48292b7a328b33071345b4d2686ae813ba6122d4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak opened a new pull request #32629: [Minor][SQL] Change the script name for crating oracle docker image.
sarutak opened a new pull request #32629: URL: https://github.com/apache/spark/pull/32629 ### What changes were proposed in this pull request? This PR changes the name in the comment in `jdbc.OracleIntegrationSuite` and `v2.OracleIntegrationSuite`. The script is for creating oracle docker image. ### Why are the changes needed? The name of the script is `buildContainerImage`, not `buildDockerImage` now. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed that I can build the image with `./buildContainerImage.sh -v 18.4.0 -x`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation
AmplabJenkins removed a comment on pull request #32606: URL: https://github.com/apache/spark/pull/32606#issuecomment-846347991 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43342/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation
SparkQA commented on pull request #32606: URL: https://github.com/apache/spark/pull/32606#issuecomment-846351977 **[Test build #138822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138822/testReport)** for PR 32606 at commit [`514141a`](https://github.com/apache/spark/commit/514141a6790220454999a99623e274b02ac18f22). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0
AmplabJenkins removed a comment on pull request #32628: URL: https://github.com/apache/spark/pull/32628#issuecomment-846351707 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43343/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0
AmplabJenkins commented on pull request #32628: URL: https://github.com/apache/spark/pull/32628#issuecomment-846351707 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43343/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhouyejoe commented on a change in pull request #32007: [SPARK-33350][SHUFFLE] Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data
zhouyejoe commented on a change in pull request #32007: URL: https://github.com/apache/spark/pull/32007#discussion_r637355653 ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java ## @@ -117,7 +116,7 @@ public ShuffleIndexInformation load(File file) throws IOException { private AppShufflePartitionInfo getOrCreateAppShufflePartitionInfo( AppShuffleId appShuffleId, int reduceId) { -File dataFile = getMergedShuffleDataFile(appShuffleId, reduceId); +File dataFile = getMergedShuffleDataFile(appShuffleId.appId, appShuffleId.shuffleId, reduceId); Review comment: Added AttemptID check here. If the attemptID is different with the one recorded in AppAttemptPathInfo, it will return null and will take this block as a late block, so it will be ignored. ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java ## @@ -991,22 +1023,26 @@ int getNumIOExceptions() { /** * Wraps all the information related to the merge directory of an application. */ - private static class AppPathsInfo { + private static class AppAttemptPathsInfo { +private final int attemptId; private final String[] activeLocalDirs; private final int subDirsPerLocalDir; -private AppPathsInfo( +private AppAttemptPathsInfo( String appId, +int attemptId, Review comment: Changed to AppAttemptShuffleId -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #32624: [SPARK-35482][K8S][3.0] Use `spark.blockManager.port` not the wrong `spark.blockmanager.port` in `BasicExecutorFeatureStep`
dongjoon-hyun closed pull request #32624: URL: https://github.com/apache/spark/pull/32624 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0
SparkQA commented on pull request #32628: URL: https://github.com/apache/spark/pull/32628#issuecomment-846348652 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43343/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation
SparkQA commented on pull request #32606: URL: https://github.com/apache/spark/pull/32606#issuecomment-846347985 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43342/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation
AmplabJenkins commented on pull request #32606: URL: https://github.com/apache/spark/pull/32606#issuecomment-846347991 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43342/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.
AmplabJenkins removed a comment on pull request #32627: URL: https://github.com/apache/spark/pull/32627#issuecomment-846347387 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43341/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.
AmplabJenkins commented on pull request #32627: URL: https://github.com/apache/spark/pull/32627#issuecomment-846347387 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43341/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.
SparkQA commented on pull request #32627: URL: https://github.com/apache/spark/pull/32627#issuecomment-846347382 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43341/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
HeartSaVioR commented on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-846346176 Please correct me if I'm missing here. If I understand correctly about stage level scheduling, you still need to specify "all" resources needed for "all" tasks in StateRDD; while that may block Spark to schedule when some resources are missing (like lost executor with PVC), I'm wondering how task level schedule would work as its intention. After this, locality is the only one we can deal with, and it's not an enforcement so we're back to the origin problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0
SparkQA commented on pull request #32628: URL: https://github.com/apache/spark/pull/32628#issuecomment-846345692 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43343/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation
SparkQA commented on pull request #32606: URL: https://github.com/apache/spark/pull/32606#issuecomment-846344825 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43342/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0
SparkQA commented on pull request #32628: URL: https://github.com/apache/spark/pull/32628#issuecomment-846342264 **[Test build #138821 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138821/testReport)** for PR 32628 at commit [`48292b7`](https://github.com/apache/spark/commit/48292b7a328b33071345b4d2686ae813ba6122d4). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak opened a new pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0
sarutak opened a new pull request #32628: URL: https://github.com/apache/spark/pull/32628 ### What changes were proposed in this pull request? This PR upgrades Dropwizard metrics to 4.2.0. I also modified the corresponding links in `docs/monitoring.md`. ### Why are the changes needed? The latest version was released last week and it contains some improvements. https://github.com/dropwizard/metrics/releases/tag/v4.2.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Build succeeds and all the modified links are reachable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation
SparkQA commented on pull request #32606: URL: https://github.com/apache/spark/pull/32606#issuecomment-846341242 **[Test build #138820 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138820/testReport)** for PR 32606 at commit [`22925c4`](https://github.com/apache/spark/commit/22925c441f22093c4e79c4bd563a05b6c1fa7ac9). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries
AmplabJenkins removed a comment on pull request #32303: URL: https://github.com/apache/spark/pull/32303#issuecomment-846320283 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries
AmplabJenkins commented on pull request #32303: URL: https://github.com/apache/spark/pull/32303#issuecomment-846340882 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138817/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries
SparkQA removed a comment on pull request #32303: URL: https://github.com/apache/spark/pull/32303#issuecomment-846305088 **[Test build #138817 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138817/testReport)** for PR 32303 at commit [`8e6cab3`](https://github.com/apache/spark/commit/8e6cab33e701a55d698a7239947ab85423010b7e). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries
SparkQA commented on pull request #32303: URL: https://github.com/apache/spark/pull/32303#issuecomment-846340285 **[Test build #138817 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138817/testReport)** for PR 32303 at commit [`8e6cab3`](https://github.com/apache/spark/commit/8e6cab33e701a55d698a7239947ab85423010b7e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak commented on a change in pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation
sarutak commented on a change in pull request #32606: URL: https://github.com/apache/spark/pull/32606#discussion_r637344435 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantProjectsSuite.scala ## @@ -215,6 +217,27 @@ abstract class RemoveRedundantProjectsSuiteBase |LIMIT 10 |""".stripMargin assertProjectExec(query, 0, 3) + + } + Seq("true", "false").foreach { codegenEnabled => +test("SPARK-35287: project generating unsafe row " + + s"should not be removed (codegen=$codegenEnabled)") { + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1", +SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> codegenEnabled, +SQLConf.LEAF_NODE_DEFAULT_PARALLELISM.key -> "1") { +withTempPath { path => + val format = classOf[SimpleWritableDataSource].getName + spark.range(3).select($"id" as "i", $"id" as "j") +.write.format(format).mode("overwrite").save(path.getCanonicalPath) + + val df = spark.read.format(format).load(path.getCanonicalPath) + val dfLeft = df.as("x") + val dfRight = df.as("y") + val join = dfLeft.filter(dfLeft("i") > 0).join(dfRight, "i") + assert(join.collect === Array(Row(1, 1, 1), Row(2, 2, 2))) Review comment: I think so. `prepareRowVar` generate `UnsafeRow`. https://github.com/apache/spark/blob/b624b7e93f3ae07d9550ef3fd258d73f0eeb1bf6/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L132 Test for `codegenEnabled = true` is just in case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak commented on a change in pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation
sarutak commented on a change in pull request #32606: URL: https://github.com/apache/spark/pull/32606#discussion_r637344188 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantProjectsSuite.scala ## @@ -215,6 +217,27 @@ abstract class RemoveRedundantProjectsSuiteBase |LIMIT 10 |""".stripMargin assertProjectExec(query, 0, 3) + + } + Seq("true", "false").foreach { codegenEnabled => +test("SPARK-35287: project generating unsafe row " + + s"should not be removed (codegen=$codegenEnabled)") { + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1", Review comment: orderBy can trigger this issue and choose it for test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.
SparkQA commented on pull request #32627: URL: https://github.com/apache/spark/pull/32627#issuecomment-846339203 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43341/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #32618: [SPARK-28551][SQL][FOLLOWUP] Use the corrected hadoop conf
yaooqinn commented on pull request #32618: URL: https://github.com/apache/spark/pull/32618#issuecomment-846336835 nice catch and merged to master, thanks all -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn closed pull request #32618: [SPARK-28551][SQL][FOLLOWUP] Use the corrected hadoop conf
yaooqinn closed pull request #32618: URL: https://github.com/apache/spark/pull/32618 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
Ngone51 commented on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-846336246 > If the persistent volume is a resource, then it will have to be there on executor startup, so I guess a new executor checks for it on startup and advertises it. At that point, how does a task tell the difference between its state store and another one? The scheduler for resources only checks that the number of a particular resource match the task requirements, it doesn't differentiation different state store "ids". So it will assign a task to any executor with a PV. @tgravescs To clarify, I think it's state store rather than PV is the resource here. And the use case here might be a bit different from the classic use case(where the resources must be specified for executor launch). In this case, the executor can tell the state store resource to the driver only when the first time the state store instance is constructed. That means this special resource will be updated at runtime. (Streaming has the existing event `ReportActiveInstance` to report the state store instances, and we can extend it to update the executor's state store resources). To follow the existing custom resource management, the state store resource might be maintained as ("statestore" -> list of statestore ids) as a part of the custom resource. > another point - What if entire node is lost not just executor? I guess the discovery script starting up new executor would load it from HDFS??? So as mentioned above, we don't need the discovery script here. But yes the state store instances can be loaded from HDFS. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #32624: [SPARK-35482][K8S][3.0] Use `spark.blockManager.port` not the wrong `spark.blockmanager.port` in `BasicExecutorFeatureStep`
yaooqinn commented on pull request #32624: URL: https://github.com/apache/spark/pull/32624#issuecomment-846335887 cc @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.
SparkQA removed a comment on pull request #32627: URL: https://github.com/apache/spark/pull/32627#issuecomment-846334449 **[Test build #138819 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138819/testReport)** for PR 32627 at commit [`57997ae`](https://github.com/apache/spark/commit/57997aed8595ed4627aba5f4f13d2363987b1e2e). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.
AmplabJenkins removed a comment on pull request #32627: URL: https://github.com/apache/spark/pull/32627#issuecomment-846334587 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138819/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.
AmplabJenkins commented on pull request #32627: URL: https://github.com/apache/spark/pull/32627#issuecomment-846334587 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138819/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.
SparkQA commented on pull request #32627: URL: https://github.com/apache/spark/pull/32627#issuecomment-846334582 **[Test build #138819 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138819/testReport)** for PR 32627 at commit [`57997ae`](https://github.com/apache/spark/commit/57997aed8595ed4627aba5f4f13d2363987b1e2e). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class SparkIndexOpsMethods(metaclass=ABCMeta):` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32626: [WIP][SPARK-35452][PYTHON] Introduce ComplexOps for complex types
AmplabJenkins removed a comment on pull request #32626: URL: https://github.com/apache/spark/pull/32626#issuecomment-846334351 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.
SparkQA commented on pull request #32627: URL: https://github.com/apache/spark/pull/32627#issuecomment-846334449 **[Test build #138819 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138819/testReport)** for PR 32627 at commit [`57997ae`](https://github.com/apache/spark/commit/57997aed8595ed4627aba5f4f13d2363987b1e2e). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32626: [WIP][SPARK-35452][PYTHON] Introduce ComplexOps for complex types
AmplabJenkins commented on pull request #32626: URL: https://github.com/apache/spark/pull/32626#issuecomment-846334351 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
HeartSaVioR edited a comment on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-846329353 I'm not sure about the scenario of leveraging PVC as checkpoint location - at least that sounds to me as beyond the support of checkpoint in Structured Streaming. We have been clearly describing about the requirement of checkpoint location in Structured Streaming guide page, like following: > Checkpoint location: For some output sinks where the end-to-end fault-tolerance can be guaranteed, specify the location where the system will write all the checkpoint information. This should be a directory in an HDFS-compatible fault-tolerant file system. The semantics of checkpointing is discussed in more detail in the next section. I know we allow custom checkpoint manager implementations to deal with non-HDFS compatible file system (like object stores which don't provide "atomic rename"), but they still deal with "remote" "fault-tolerant" file system, and doesn't require Spark scheduler to schedule specific task to specific executor based on the availability of checkpoint location. In other words, only checkpoint manager handles the complexity of checkpoint on file system, not somewhere else. And sounds like it's no longer holding true if we want to support PVC based checkpoint. Please correct me if I'm missing something. I'm more likely novice on cloud/k8s, but from the common sense, I guess the actual storage of PVC should be still a sort of network storage to be resilient on "physical node down". I'm wondering how much benefits PVC approach gives compared to the existing approach as just directly use remote fault-tolerant file system. The benefits should be clear to cope with additional complexity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
ueshin commented on a change in pull request #32611: URL: https://github.com/apache/spark/pull/32611#discussion_r637338960 ## File path: python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py ## @@ -33,107 +35,148 @@ def pser(self): def psser(self): return ps.from_pandas(self.pser) +@property +def float_pser(self): +return pd.Series([1, 2, 3], dtype=float) + +@property +def float_psser(self): +return ps.from_pandas(self.float_pser) + def test_add(self): -self.assertRaises(TypeError, lambda: self.psser + 1) -self.assertRaises(TypeError, lambda: self.psser + 0.1) +self.assert_eq(self.pser + 1, self.psser + 1) +self.assert_eq(self.pser + 0.1, self.psser + 0.1) with option_context("compute.ops_on_diff_frames", True): -for psser in self.pssers: +for pser, psser in self.numeric_pser_psser_pairs: +self.assert_eq(self.pser + pser, (self.psser + psser).sort_index()) + +for psser in self.non_numeric_pssers.values(): self.assertRaises(TypeError, lambda: self.psser + psser) def test_sub(self): -self.assertRaises(TypeError, lambda: self.psser - 1) -self.assertRaises(TypeError, lambda: self.psser - 0.1) +self.assert_eq(self.pser - 1, self.psser - 1) +self.assert_eq(self.pser - 0.1, self.psser - 0.1) with option_context("compute.ops_on_diff_frames", True): -for psser in self.pssers: +for pser, psser in self.numeric_pser_psser_pairs: +self.assert_eq(self.pser - pser, (self.psser - psser).sort_index()) + +for psser in self.non_numeric_pssers.values(): self.assertRaises(TypeError, lambda: self.psser - psser) def test_mul(self): -self.assertRaises(TypeError, lambda: self.psser * 1) -self.assertRaises(TypeError, lambda: self.psser * 0.1) +self.assert_eq(self.pser * 1, self.psser * 1) +self.assert_eq(self.pser * 0.1, self.psser * 0.1) with option_context("compute.ops_on_diff_frames", True): -for psser in self.pssers: +for pser, psser in self.numeric_pser_psser_pairs: +self.assert_eq(self.pser * pser, (self.psser * psser).sort_index()) + +for psser in self.non_numeric_pssers.values(): self.assertRaises(TypeError, lambda: self.psser * psser) def test_truediv(self): -self.assertRaises(TypeError, lambda: self.psser / 1) -self.assertRaises(TypeError, lambda: self.psser / 0.1) +self.assert_eq(self.pser / 1, self.psser / 1) +self.assert_eq(self.pser / 0.1, self.psser / 0.1) with option_context("compute.ops_on_diff_frames", True): -for psser in self.pssers: +self.assert_eq( +self.pser / self.float_pser, (self.psser / self.float_psser).sort_index()) + +for psser in self.non_numeric_pssers.values(): self.assertRaises(TypeError, lambda: self.psser / psser) def test_floordiv(self): -self.assertRaises(TypeError, lambda: self.psser // 1) -self.assertRaises(TypeError, lambda: self.psser // 0.1) +# float is always returned in pandas-on-Spark +self.assert_eq((self.pser // 1).astype("float"), self.psser // 1) +# in pandas, 1 // 0.1 = 9.0; in pandas-on-Spark, 1 // 0.1 = 10.0 +# self.assert_eq(self.pser // 0.1, self.psser // 0.1) with option_context("compute.ops_on_diff_frames", True): -for psser in self.pssers: +self.assert_eq( +self.pser // self.float_pser, (self.psser // self.float_psser).sort_index() +) + +for psser in self.non_numeric_pssers.values(): self.assertRaises(TypeError, lambda: self.psser // psser) def test_mod(self): -self.assertRaises(TypeError, lambda: self.psser % 1) -self.assertRaises(TypeError, lambda: self.psser % 0.1) +self.assert_eq(self.pser % 1, self.psser % 1) +self.assert_eq(self.pser % 0.1, self.psser % 0.1) with option_context("compute.ops_on_diff_frames", True): -for psser in self.pssers: +for pser, psser in self.numeric_pser_psser_pairs: +self.assert_eq(self.pser % pser, (self.psser % psser).sort_index()) + +for psser in self.non_numeric_pssers.values(): self.assertRaises(TypeError, lambda: self.psser % psser) def test_pow(self): -self.assertRaises(TypeError, lambda: self.psser ** 1) -self.assertRaises(TypeError, lambda: self.psser ** 0.1) +# float is always returned in pandas-on-Spark +self.assert_eq((self.pser ** 1).astype("float"), self.psser ** 1) +self.assert_eq(self.pser ** 0.1, self.psser ** 0.1) with option_context("compute.ops_on_diff_frames",
[GitHub] [spark] HeartSaVioR commented on a change in pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
HeartSaVioR commented on a change in pull request #32136: URL: https://github.com/apache/spark/pull/32136#discussion_r637338852 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulingPlugin.scala ## @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +/** + * This trait provides a plugin interface for suggesting task scheduling to Spark + * scheduler. + */ +private[spark] trait TaskSchedulingPlugin { Review comment: > I'm not sure now if we want it to be open to implement outside Spark. IMHO I'm thinking opposite on this. I'm not sure if we want to add some implementation in Spark codebase, unless the implementation of plugin proves that it works well for all possible cases. I'd rather be OK to let Spark users (including us) to be hackers and customize the scheduler with "taking their own risks", but I have to be conservative if we want to change the behavior of Spark. If changing the behavior of Spark to cope with state by default is the final goal, I think the behavioral change itself is a major one requiring SPIP, not something we can simply go after doing POC on happy case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
ueshin commented on a change in pull request #32611: URL: https://github.com/apache/spark/pull/32611#discussion_r637338469 ## File path: python/pyspark/pandas/data_type_ops/num_ops.py ## @@ -16,31 +16,46 @@ # import numbers -from typing import TYPE_CHECKING, Union +from typing import Any, TYPE_CHECKING, Union import numpy as np from pandas.api.types import CategoricalDtype from pyspark.sql import Column, functions as F from pyspark.sql.types import ( +BooleanType, NumericType, StringType, TimestampType, ) from pyspark.pandas.base import column_op, IndexOpsMixin, numpy_column_op -from pyspark.pandas.data_type_ops.base import DataTypeOps +from pyspark.pandas.data_type_ops.base import DataTypeOps, transform_boolean_operand_to_numeric from pyspark.pandas.spark import functions as SF +from pyspark.sql.column import Column + if TYPE_CHECKING: from pyspark.pandas.indexes import Index # noqa: F401 (SPARK-34943) from pyspark.pandas.series import Series # noqa: F401 (SPARK-34943) +def is_valid_operand_for_numeric_arithmetic(operand: Any) -> bool: +"""Check whether the operand is valid for arithmetic operations against numerics.""" +if isinstance(operand, numbers.Number): +return True +elif isinstance(operand, IndexOpsMixin): +if isinstance(operand.dtype, CategoricalDtype): +return False +else: +return isinstance(operand.spark.data_type, NumericType) or isinstance( +operand.spark.data_type, BooleanType) +else: +return isinstance(operand, Column) Review comment: Do we currently allow it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32626: [WIP][SPARK-35452][PYTHON] Introduce ComplexOps for complex types
SparkQA removed a comment on pull request #32626: URL: https://github.com/apache/spark/pull/32626#issuecomment-846304945 **[Test build #138816 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138816/testReport)** for PR 32626 at commit [`a7dd95c`](https://github.com/apache/spark/commit/a7dd95c625809a7b443a9043b90bfb3a718d373d). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32626: [WIP][SPARK-35452][PYTHON] Introduce ComplexOps for complex types
SparkQA commented on pull request #32626: URL: https://github.com/apache/spark/pull/32626#issuecomment-846329637 **[Test build #138816 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138816/testReport)** for PR 32626 at commit [`a7dd95c`](https://github.com/apache/spark/commit/a7dd95c625809a7b443a9043b90bfb3a718d373d). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class ComplexOps(DataTypeOps):` * `class ArrayOps(ComplexOps):` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
HeartSaVioR edited a comment on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-846329353 I'm not sure about the scenario of leveraging PVC as checkpoint location - at least that sounds to me as beyond the support of checkpoint in Structured Streaming. We have been clearly describing about the requirement of checkpoint location in Structured Streaming guide page, like following: > Checkpoint location: For some output sinks where the end-to-end fault-tolerance can be guaranteed, specify the location where the system will write all the checkpoint information. This should be a directory in an HDFS-compatible fault-tolerant file system. The semantics of checkpointing is discussed in more detail in the next section. I know we allow custom checkpoint manager implementations to deal with non-HDFS compatible file system (like object stores which don't provide "atomic rename"), but they still deal with "remote" "fault-tolerant" file system, and doesn't require Spark scheduler to schedule specific task to specific executor based on the availability of checkpoint location. In other words, only checkpoint manager handles the complexity of checkpoint on file system, not somewhere else. And sounds like it's no longer holding true if we want to support PVC based checkpoint. Please correct me if I'm missing something. I'm more likely novice on cloud/k8s, but from the common sense, I guess the actual storage of PVC should be still a sort of network storage to be resilient on "node down". I'm wondering how much benefits PVC approach gives compared to the existing approach as just directly use remote fault-tolerant file system. The benefits should be clear to cope with additional complexity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
HeartSaVioR edited a comment on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-846329353 I'm not sure about the scenario of leveraging PVC as checkpoint location - at least that sounds to me as beyond the support of checkpoint in Structured Streaming. We have been clearly describing about the requirement of checkpoint location in Structured Streaming guide page, like following: > Checkpoint location: For some output sinks where the end-to-end fault-tolerance can be guaranteed, specify the location where the system will write all the checkpoint information. This should be a directory in an HDFS-compatible fault-tolerant file system. The semantics of checkpointing is discussed in more detail in the next section. I know we allow custom checkpoint manager implementations to deal with non-HDFS compatible file system (like object stores which don't provide "atomic rename"), but they still deal with "remote" "fault-tolerant" file system, and doesn't require Spark scheduler to schedule specific task to specific executor based on the availability of checkpoint. In other words, only checkpoint manager handles the complexity of checkpoint on file system, not somewhere else. And sounds like it's no longer holding true if we want to support PVC based checkpoint. Please correct me if I'm missing something. I'm more likely novice on cloud/k8s, but from the common sense, I guess the actual storage of PVC should be still a sort of network storage to be resilient on "node down". I'm wondering how much benefits PVC approach gives compared to the existing approach as just directly use remote fault-tolerant file system. The benefits should be clear to cope with additional complexity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
HeartSaVioR commented on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-846329353 I'm not sure about the scenario of leveraging PVC as checkpoint location - at least that sounds to me as beyond the support of checkpoint in Structured Streaming. We have been clearly describing about the requirement of checkpoint location in Structured Streaming guide page, like following: > Checkpoint location: For some output sinks where the end-to-end fault-tolerance can be guaranteed, specify the location where the system will write all the checkpoint information. This should be a directory in an HDFS-compatible fault-tolerant file system. The semantics of checkpointing is discussed in more detail in the next section. I know we allow custom checkpoint manager implementations to deal with non-HDFS compatible file system, but they still deal with "remote" "fault-tolerant" file system, and doesn't require Spark scheduler to schedule specific task to specific executor based on the availability of checkpoint. In other words, only checkpoint manager handles the complexity of checkpoint on file system, not somewhere else. And sounds like it's no longer holding true if we want to support PVC based checkpoint. Please correct me if I'm missing something. I'm more likely novice on cloud/k8s, but from the common sense, I guess the actual storage of PVC should be still a sort of network storage to be resilient on "node down". I'm wondering how much benefits PVC approach gives compared to the existing approach as just directly use remote fault-tolerant file system. The benefits should be clear to cope with additional complexity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin opened a new pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.
ueshin opened a new pull request #32627: URL: https://github.com/apache/spark/pull/32627 ### What changes were proposed in this pull request? Adds more type annotations in the files: - `python/pyspark/pandas/spark/accessors.py` - `python/pyspark/pandas/typedef/typehints.py` - `python/pyspark/pandas/utils.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more `disallow_untyped_defs` mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32626: [WIP][SPARK-35452][PYTHON] Introduce ComplexOps for complex types
SparkQA commented on pull request #32626: URL: https://github.com/apache/spark/pull/32626#issuecomment-846328869 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43338/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
AmplabJenkins commented on pull request #32611: URL: https://github.com/apache/spark/pull/32611#issuecomment-846328532 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43340/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
SparkQA commented on pull request #32611: URL: https://github.com/apache/spark/pull/32611#issuecomment-846328527 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43340/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
AmplabJenkins removed a comment on pull request #32611: URL: https://github.com/apache/spark/pull/32611#issuecomment-846326242 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138818/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
AmplabJenkins commented on pull request #32611: URL: https://github.com/apache/spark/pull/32611#issuecomment-846326242 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138818/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
ueshin commented on a change in pull request #32611: URL: https://github.com/apache/spark/pull/32611#discussion_r637332388 ## File path: python/pyspark/pandas/data_type_ops/boolean_ops.py ## @@ -15,7 +15,27 @@ # limitations under the License. # -from pyspark.pandas.data_type_ops.base import DataTypeOps +import numbers +from typing import TYPE_CHECKING, Union + +import numpy as np +from pandas.api.types import CategoricalDtype + +from pyspark.pandas.base import column_op, IndexOpsMixin, numpy_column_op +from pyspark.pandas.data_type_ops.base import DataTypeOps, transform_boolean_operand_to_numeric +from pyspark.pandas.typedef.typehints import as_spark_type +from pyspark.sql import Column, functions as F +from pyspark.sql.types import NumericType + +if TYPE_CHECKING: +from pyspark.pandas.indexes import Index # noqa: F401 (SPARK-34943) +from pyspark.pandas.series import Series # noqa: F401 (SPARK-34943) + + +def is_numeric_index_ops(index_ops: IndexOpsMixin) -> bool: +"""Check if the given index_ops is numeric IndexOpsMixin.""" +return isinstance(index_ops.spark.data_type, NumericType) and ( +not isinstance(index_ops.dtype, CategoricalDtype)) Review comment: I guess we can reuse `is_valid_operand_for_numeric_arithmetic` in `num_ops.py`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
SparkQA commented on pull request #32611: URL: https://github.com/apache/spark/pull/32611#issuecomment-846324042 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43340/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
ueshin commented on a change in pull request #32611: URL: https://github.com/apache/spark/pull/32611#discussion_r637328527 ## File path: python/pyspark/pandas/data_type_ops/boolean_ops.py ## @@ -26,3 +46,183 @@ class BooleanOps(DataTypeOps): @property def pretty_name(self) -> str: return 'booleans' + +def add(self, left, right) -> Union["Series", "Index"]: +if isinstance(right, numbers.Number): +left = left.spark.transform(lambda scol: scol.cast(as_spark_type(type(right +return column_op(Column.__add__)(left, right) Review comment: I guess `return left + right` should be fine. The `DataTypeOps` related to `left` after converted should know how to handle it better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
SparkQA removed a comment on pull request #32611: URL: https://github.com/apache/spark/pull/32611#issuecomment-846317781 **[Test build #138818 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138818/testReport)** for PR 32611 at commit [`925e34b`](https://github.com/apache/spark/commit/925e34b3bffb6496d62425fa08d36177589cb17d). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
SparkQA commented on pull request #32611: URL: https://github.com/apache/spark/pull/32611#issuecomment-846321739 **[Test build #138818 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138818/testReport)** for PR 32611 at commit [`925e34b`](https://github.com/apache/spark/commit/925e34b3bffb6496d62425fa08d36177589cb17d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
ueshin commented on a change in pull request #32611: URL: https://github.com/apache/spark/pull/32611#discussion_r637323275 ## File path: python/pyspark/pandas/tests/data_type_ops/test_num_ops.py ## @@ -42,7 +54,12 @@ def test_add(self): self.assertRaises(TypeError, lambda: psser + self.non_numeric_pssers["datetime"]) self.assertRaises(TypeError, lambda: psser + self.non_numeric_pssers["date"]) self.assertRaises(TypeError, lambda: psser + self.non_numeric_pssers["categorical"]) -self.assertRaises(TypeError, lambda: psser + self.non_numeric_pssers["bool"]) +self.assert_eq( +(psser + self.non_numeric_pssers["bool"]).sort_index(), +pser + self.non_numeric_psers["bool"], +) + +self.assertRaises(TypeError, lambda: self.float_psser + True) Review comment: This is supposed to be out of for-loop? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries
AmplabJenkins commented on pull request #32303: URL: https://github.com/apache/spark/pull/32303#issuecomment-846320283 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43339/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries
SparkQA commented on pull request #32303: URL: https://github.com/apache/spark/pull/32303#issuecomment-846320278 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43339/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
ueshin commented on a change in pull request #32611: URL: https://github.com/apache/spark/pull/32611#discussion_r637311351 ## File path: python/pyspark/pandas/tests/data_type_ops/test_num_ops.py ## @@ -30,6 +34,14 @@ class NumOpsTest(PandasOnSparkTestCase, TestCasesUtils): returns float32. The underlying reason is the respective Spark operations return DoubleType always. """ +@property +def float_pser(self): +return pd.Series([1, 2, 3], dtype=float) + +@property +def float_psser(self): +return ps.from_pandas(self.float_pser) + def test_add(self): for pser, psser in self.numeric_pser_psser_pairs: self.assert_eq(pser + pser, psser + psser) Review comment: Isn't this PR supposed to support it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
SparkQA commented on pull request #32611: URL: https://github.com/apache/spark/pull/32611#issuecomment-846317781 **[Test build #138818 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138818/testReport)** for PR 32611 at commit [`925e34b`](https://github.com/apache/spark/commit/925e34b3bffb6496d62425fa08d36177589cb17d). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #24801: [SPARK-27950][DSTREAMS][Kinesis] dynamoDBEndpointUrl and cloudWatchMetricsLevel for Kinesis
github-actions[bot] closed pull request #24801: URL: https://github.com/apache/spark/pull/24801 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #30681: [Spark-33657][SQL]After spark sql is executed to generate hdfs data, the relevant status information is printed
github-actions[bot] commented on pull request #30681: URL: https://github.com/apache/spark/pull/30681#issuecomment-846316749 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #31509: [SPARK-34396][SQL] Add a new build-in function delegate
github-actions[bot] closed pull request #31509: URL: https://github.com/apache/spark/pull/31509 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
xinrong-databricks commented on a change in pull request #32611: URL: https://github.com/apache/spark/pull/32611#discussion_r637305932 ## File path: python/pyspark/pandas/tests/data_type_ops/test_num_ops.py ## @@ -30,6 +34,14 @@ class NumOpsTest(PandasOnSparkTestCase, TestCasesUtils): returns float32. The underlying reason is the respective Spark operations return DoubleType always. """ +@property +def float_pser(self): +return pd.Series([1, 2, 3], dtype=float) + +@property +def float_psser(self): +return ps.from_pandas(self.float_pser) + def test_add(self): for pser, psser in self.numeric_pser_psser_pairs: self.assert_eq(pser + pser, psser + psser) Review comment: Arithmetic operations involving `bool` are not supported now. TypeErrors are introduced for these cases for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32626: [WIP][SPARK-35452][PYTHON] Introduce ComplexOps for complex types
SparkQA commented on pull request #32626: URL: https://github.com/apache/spark/pull/32626#issuecomment-846314883 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43338/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries
SparkQA commented on pull request #32303: URL: https://github.com/apache/spark/pull/32303#issuecomment-846313915 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43339/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
xinrong-databricks commented on a change in pull request #32611: URL: https://github.com/apache/spark/pull/32611#discussion_r637301587 ## File path: python/pyspark/pandas/tests/data_type_ops/test_date_ops.py ## @@ -60,7 +61,9 @@ def test_sub(self): with option_context("compute.ops_on_diff_frames", True): for pser, psser in self.pser_psser_pairs: if isinstance(psser.spark.data_type, DateType): -self.assert_eq((self.pser - pser).dt.days, (self.psser - psser).sort_index()) +if LooseVersion(pd.__version__) >= LooseVersion("0.24.2"): +self.assert_eq( +(self.pser - pser).dt.days, (self.psser - psser).sort_index()) Review comment: It raises weird TypeError for pandas in the `else` case, as https://github.com/databricks/koalas/runs/2568962950. No, the change is not related to this PR. Let me remove it for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-databricks commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
xinrong-databricks commented on a change in pull request #32611: URL: https://github.com/apache/spark/pull/32611#discussion_r637299344 ## File path: python/pyspark/pandas/data_type_ops/boolean_ops.py ## @@ -26,3 +41,183 @@ class BooleanOps(DataTypeOps): @property def pretty_name(self) -> str: return 'booleans' + +def add(self, left, right): +if isinstance(right, numbers.Number): +left = left.spark.apply(lambda scol: scol.cast(as_spark_type(type(right +return column_op(Column.__add__)(left, right) +elif isinstance(right, IndexOpsMixin) and is_numeric_index_ops(right): +left = transform_boolean_operand_to_numeric(left, right.spark.data_type) +return column_op(Column.__add__)(left, right) +else: +raise TypeError( +"Addition can not be applied to %s and the given type." % self.pretty_name) + +def sub(self, left, right): +if isinstance(right, numbers.Number): +left = left.spark.apply(lambda scol: scol.cast(as_spark_type(type(right +return column_op(Column.__sub__)(left, right) +elif isinstance(right, IndexOpsMixin) and is_numeric_index_ops(right): +left = transform_boolean_operand_to_numeric(left, right.spark.data_type) +return column_op(Column.__sub__)(left, right) +else: +raise TypeError( +"Subtraction can not be applied to %s and the given type." % self.pretty_name) + +def mul(self, left, right): +if isinstance(right, numbers.Number): +left = left.spark.apply(lambda scol: scol.cast(as_spark_type(type(right +return column_op(Column.__mul__)(left, right) +elif isinstance(right, IndexOpsMixin) and is_numeric_index_ops(right): +left = transform_boolean_operand_to_numeric(left, right.spark.data_type) +return column_op(Column.__mul__)(left, right) +else: +raise TypeError( +"Multiplication can not be applied to %s and the given type." % self.pretty_name) + +def truediv(self, left, right): +def truediv(left, right): +return F.when(F.lit(right != 0) | F.lit(right).isNull(), left.__div__(right)).otherwise( +F.when(F.lit(left == np.inf) | F.lit(left == -np.inf), left).otherwise( +F.lit(np.inf).__div__(left) +) +) + +if isinstance(right, numbers.Number): +left = left.spark.apply(lambda scol: scol.cast(as_spark_type(type(right +return numpy_column_op(truediv)(left, right) +elif isinstance(right, IndexOpsMixin) and is_numeric_index_ops(right): +left = transform_boolean_operand_to_numeric(left, right.spark.data_type) +return numpy_column_op(truediv)(left, right) +else: +raise TypeError( +"True division can not be applied to %s and the given type." % self.pretty_name) + +def floordiv(self, left, right): +def floordiv(left, right): +return F.when(F.lit(right is np.nan), np.nan).otherwise( +F.when( +F.lit(right != 0) | F.lit(right).isNull(), F.floor(left.__div__(right)) +).otherwise( +F.when(F.lit(left == np.inf) | F.lit(left == -np.inf), left).otherwise( +F.lit(np.inf).__div__(left) +) +) +) + +if isinstance(right, numbers.Number): +left = left.spark.apply(lambda scol: scol.cast(as_spark_type(type(right +return numpy_column_op(floordiv)(left, right) +elif isinstance(right, IndexOpsMixin) and is_numeric_index_ops(right): +left = transform_boolean_operand_to_numeric(left, right.spark.data_type) +return numpy_column_op(floordiv)(left, right) +else: +raise TypeError( +"Floor division can not be applied to %s and the given type." % self.pretty_name) + +def mod(self, left, right): +def mod(left, right): +return ((left % right) + right) % right + +if isinstance(right, numbers.Number): +left = left.spark.apply(lambda scol: scol.cast(as_spark_type(type(right +return column_op(mod)(left, right) +elif isinstance(right, IndexOpsMixin) and is_numeric_index_ops(right): +left = transform_boolean_operand_to_numeric(left, right.spark.data_type) +return column_op(mod)(left, right) +else: +raise TypeError( +"Modulo can not be applied to %s and the given type." % self.pretty_name) + +def pow(self, left, right): +def pow_func(left, right): +return F.when(left == 1, left).otherwise(Column.__pow__(left, right)) + +if isinstance(right, numbers.Number): +