[GitHub] [spark] viirya commented on a change in pull request #32513: [SPARK-35378][SQL] Eagerly execute non-root Command

2021-05-21 Thread GitBox


viirya commented on a change in pull request #32513:
URL: https://github.com/apache/spark/pull/32513#discussion_r637365124



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
##
@@ -74,12 +75,26 @@ class QueryExecution(
 sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker)
   }
 
+  lazy val nonRootCommandExecuted: LogicalPlan = analyzed mapChildren { child 
=>
+child transformDown {
+  // SPARK-35378: Eagerly execute non-root Command

Review comment:
   Instead of just a JIRA number, as this is not TODO item, can you simply 
add a few comment here explaining why you need to eagerly execute `Command` 
under top plan node?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #32513: [SPARK-35378][SQL] Eagerly execute non-root Command

2021-05-21 Thread GitBox


viirya commented on a change in pull request #32513:
URL: https://github.com/apache/spark/pull/32513#discussion_r637364885



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
##
@@ -74,12 +75,26 @@ class QueryExecution(
 sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker)
   }
 
+  lazy val nonRootCommandExecuted: LogicalPlan = analyzed mapChildren { child 
=>
+child transformDown {
+  // SPARK-35378: Eagerly execute non-root Command
+  case c: Command =>
+val qe = sparkSession.sessionState.executePlan(c)
+CommandResult(
+  qe.nonRootCommandExecuted.output,
+  qe.nonRootCommandExecuted,

Review comment:
   Can `Command` contain other `Command`? Otherwise, can we just use 
`qe.analyzed.output` here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #32513: [SPARK-35378][SQL] Eagerly execute non-root Command

2021-05-21 Thread GitBox


viirya commented on a change in pull request #32513:
URL: https://github.com/apache/spark/pull/32513#discussion_r637364733



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
##
@@ -74,12 +75,26 @@ class QueryExecution(
 sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker)
   }
 
+  lazy val nonRootCommandExecuted: LogicalPlan = analyzed mapChildren { child 
=>

Review comment:
   Oh I see. You mean the `Command` not at the top-level node. My first 
thought is a command not run as root...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation

2021-05-21 Thread GitBox


SparkQA commented on pull request #32606:
URL: https://github.com/apache/spark/pull/32606#issuecomment-846363539


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43344/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #32513: [SPARK-35378][SQL] Eagerly execute non-root Command

2021-05-21 Thread GitBox


viirya commented on a change in pull request #32513:
URL: https://github.com/apache/spark/pull/32513#discussion_r637363738



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
##
@@ -74,12 +75,26 @@ class QueryExecution(
 sparkSession.sessionState.analyzer.executeAndCheck(logical, tracker)
   }
 
+  lazy val nonRootCommandExecuted: LogicalPlan = analyzed mapChildren { child 
=>

Review comment:
   I'm a bit confused. Can you define what is non root command?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak edited a comment on pull request #32631: [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA.

2021-05-21 Thread GitBox


sarutak edited a comment on pull request #32631:
URL: https://github.com/apache/spark/pull/32631#issuecomment-846361308


   cc: @maropu 
   As I comment in JIRA, I've been working on this issue.
   This seems different from your solution. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #32513: [SPARK-35378][SQL] Eagerly execute non-root Command

2021-05-21 Thread GitBox


viirya commented on a change in pull request #32513:
URL: https://github.com/apache/spark/pull/32513#discussion_r637363690



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CommandResultExec.scala
##
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Attribute, UnsafeProjection}
+import org.apache.spark.sql.catalyst.plans.QueryPlan
+import org.apache.spark.sql.execution.metric.SQLMetrics
+
+/**
+ * Physical plan node for holding data from a command.
+ *
+ * `rows` may not be serializable and ideally we should not send `rows` and 
`unsafeRows`

Review comment:
   For`rows` and `unsafeRows`, if you mean the classes `Row` and 
`UnsafeRow`, please use correct class names?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on pull request #32631: [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA.

2021-05-21 Thread GitBox


sarutak commented on pull request #32631:
URL: https://github.com/apache/spark/pull/32631#issuecomment-846361308


   cc: @maropu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak opened a new pull request #32631: [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA.

2021-05-21 Thread GitBox


sarutak opened a new pull request #32631:
URL: https://github.com/apache/spark/pull/32631


   ### What changes were proposed in this pull request?
   
   This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA.
   `doker-integration-tests` can't run if docker is not installed so it run 
only if `docker-integration-tests` is specified with `--module`.
   
   ### Why are the changes needed?
   
   CI for `docker-integration-tests` is absent for now.
   
   ### Does this PR introduce _any_ user-facing change?
   
   GA.
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32272: [SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata

2021-05-21 Thread GitBox


viirya commented on pull request #32272:
URL: https://github.com/apache/spark/pull/32272#issuecomment-846359351


   Sorry for late. I will find some time in the weekend to look at this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #32605: [WIP][SPARK-35446] Override getJDBCType in MySQLDialect to map FloatType to FLOAT

2021-05-21 Thread GitBox


maropu commented on pull request #32605:
URL: https://github.com/apache/spark/pull/32605#issuecomment-846359322


   > Thank you for pining me, @HyukjinKwon. btw, is it better to run these 
integration tests in GA, too?
   
   See https://github.com/apache/spark/pull/32620 for the related work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32559: [SPARK-35410][SQL] SubExpr elimination should not include redundant children exprs in conditional expression

2021-05-21 Thread GitBox


viirya commented on pull request #32559:
URL: https://github.com/apache/spark/pull/32559#issuecomment-846359212


   #32586 was merged. Can we look at this if it is good to go? Thanks. cc 
@cloud-fan @dongjoon-hyun @maropu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #32559: [SPARK-35410][SQL] SubExpr elimination should not include redundant children exprs in conditional expression

2021-05-21 Thread GitBox


viirya commented on a change in pull request #32559:
URL: https://github.com/apache/spark/pull/32559#discussion_r637362691



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
##
@@ -83,20 +83,35 @@ class EquivalentExpressions {
* Adds only expressions which are common in each of given expressions, in a 
recursive way.
* For example, given two expressions `(a + (b + (c + 1)))` and `(d + (e + 
(c + 1)))`,
* the common expression `(c + 1)` will be added into `equivalenceMap`.
+   *
+   * Note that as we don't know in advance if any child node of an expression 
will be common
+   * across all given expressions, we count all child nodes when looking 
through the given
+   * expressions. But when we call `addExprTree` to add common expressions 
into the map, we
+   * will add recursively the child nodes. So we need to filter the child 
expressions first.
+   * For example, if `((a + b) + c)` and `(a + b)` are common expressions, we 
only add
+   * `((a + b) + c)`.
*/
   private def addCommonExprs(
   exprs: Seq[Expression],
   addFunc: Expression => Boolean = addExpr): Unit = {
 val exprSetForAll = mutable.Set[Expr]()

Review comment:
   Both `1 + col` and `col + 1` will be replaced with the extracted 
subexpression during codege. We don't just look of key at `equivalenceMap` when 
replacing with subexpression.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #32620: [WIP][SPARK-35483][SQL][TESTS] Add a new GA test job for the docker integration tests

2021-05-21 Thread GitBox


maropu commented on pull request #32620:
URL: https://github.com/apache/spark/pull/32620#issuecomment-846359132


   I used the wrong jira number, so I fixed it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark

2021-05-21 Thread GitBox


viirya commented on pull request #32136:
URL: https://github.com/apache/spark/pull/32136#issuecomment-846358670


   > I'm not sure about the scenario of leveraging PVC as checkpoint location - 
at least that sounds to me as beyond the support of checkpoint in Structured 
Streaming.
   
   I agree on this, and yes, this is the current status. So that is said we are 
going to propose a new approach to support checkpoint in Structured Streaming. 
Unfortunately due to that fact that scheduling is bound to stateful tasks (i.e. 
state store locations), we cannot achieve the goal without touching other 
modules, like core.

   > I'm more likely novice on cloud/k8s, but from the common sense, I guess 
the actual storage of PVC should be still a sort of network storage to be 
resilient on "physical node down". I'm wondering how much benefits PVC approach 
gives compared to the existing approach as just directly use remote 
fault-tolerant file system. The benefits should be clear to cope with 
additional complexity.
   
   Technically, PVC is kinds of abstract way to look at the volume mounted on 
container running executor. It could be local storage on nodes on k8s. It 
depends where the PVC is bound to.
   
   HDFS becomes a bottleneck for our streaming jobs. The throughput to HDFS, 
the number of files as loading on name nodes, these are serious issues to use 
it as checkpoint destination for heavy streaming jobs in scale. Using PVC as 
checkpoint could be huge relief on the loading of HDFS. There are also others 
like better latency, simplified streaming architecture. Personally I think this 
is enough benefits as the motivation of our proposal.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32630: [Minor][Followup] Update SHA for the oracle docker image

2021-05-21 Thread GitBox


SparkQA commented on pull request #32630:
URL: https://github.com/apache/spark/pull/32630#issuecomment-846358506


   **[Test build #138823 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138823/testReport)**
 for PR 32630 at commit 
[`d745025`](https://github.com/apache/spark/commit/d7450257bb575cc568ed0bc5109373fe4ac5f7c1).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on pull request #32630: [Minor][Followup] Update SHA for the oracle docker image

2021-05-21 Thread GitBox


sarutak commented on pull request #32630:
URL: https://github.com/apache/spark/pull/32630#issuecomment-846358208


   cc: @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak opened a new pull request #32630: [Minor][Followup] Update SHA for the oracle docker image

2021-05-21 Thread GitBox


sarutak opened a new pull request #32630:
URL: https://github.com/apache/spark/pull/32630


   ### What changes were proposed in this pull request?
   
   This PR updates SHA for the oracle docker image in the comment in 
`OracleIntegrationSuite` and `v2.OracleIntegrationSuite`.
   The SHA for the latest image is `3f422c4a35b423dfcdbcc57a84f01db6c82eb6c1`
   
   ### Why are the changes needed?
   
   The script name for creating the oracle docker image is changed in #32629, 
following the latest image so we also need to update the corresponding SHA.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   The value is from `git log`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0

2021-05-21 Thread GitBox


AmplabJenkins removed a comment on pull request #32628:
URL: https://github.com/apache/spark/pull/32628#issuecomment-846357806


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138821/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0

2021-05-21 Thread GitBox


AmplabJenkins commented on pull request #32628:
URL: https://github.com/apache/spark/pull/32628#issuecomment-846357806


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138821/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on a change in pull request #32629: [Minor][SQL] Change the script name for creating oracle docker image.

2021-05-21 Thread GitBox


sarutak commented on a change in pull request #32629:
URL: https://github.com/apache/spark/pull/32629#discussion_r637361238



##
File path: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
##
@@ -48,7 +48,7 @@ import org.apache.spark.tags.DockerTest
  *  $ git clone https://github.com/oracle/docker-images.git
  *  // Head SHA: 3e352a22618070595f823977a0fd1a3a8071a83c

Review comment:
   The SHA is `3f422c4a35b423dfcdbcc57a84f01db6c82eb6c1`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on a change in pull request #32629: [Minor][SQL] Change the script name for creating oracle docker image.

2021-05-21 Thread GitBox


sarutak commented on a change in pull request #32629:
URL: https://github.com/apache/spark/pull/32629#discussion_r637361014



##
File path: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
##
@@ -48,7 +48,7 @@ import org.apache.spark.tags.DockerTest
  *  $ git clone https://github.com/oracle/docker-images.git
  *  // Head SHA: 3e352a22618070595f823977a0fd1a3a8071a83c

Review comment:
   Ah, I didn't care about it. I'll fix it. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #32629: [Minor][SQL] Change the script name for creating oracle docker image.

2021-05-21 Thread GitBox


dongjoon-hyun commented on a change in pull request #32629:
URL: https://github.com/apache/spark/pull/32629#discussion_r637360891



##
File path: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
##
@@ -48,7 +48,7 @@ import org.apache.spark.tags.DockerTest
  *  $ git clone https://github.com/oracle/docker-images.git
  *  // Head SHA: 3e352a22618070595f823977a0fd1a3a8071a83c

Review comment:
   Oh, here is SHA, @sarutak . In this case, we need to update this SHA 
too, @sarutak .
   What was the SHA you checked?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #32629: [Minor][SQL] Change the script name for creating oracle docker image.

2021-05-21 Thread GitBox


dongjoon-hyun commented on a change in pull request #32629:
URL: https://github.com/apache/spark/pull/32629#discussion_r637360891



##
File path: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
##
@@ -48,7 +48,7 @@ import org.apache.spark.tags.DockerTest
  *  $ git clone https://github.com/oracle/docker-images.git
  *  // Head SHA: 3e352a22618070595f823977a0fd1a3a8071a83c

Review comment:
   Oh, here is SHA, @sarutak . In this case, we need to update this SHA 
too, @sarutak .




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation

2021-05-21 Thread GitBox


SparkQA commented on pull request #32606:
URL: https://github.com/apache/spark/pull/32606#issuecomment-846357014


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43344/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0

2021-05-21 Thread GitBox


SparkQA removed a comment on pull request #32628:
URL: https://github.com/apache/spark/pull/32628#issuecomment-846342264


   **[Test build #138821 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138821/testReport)**
 for PR 32628 at commit 
[`48292b7`](https://github.com/apache/spark/commit/48292b7a328b33071345b4d2686ae813ba6122d4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #32629: [Minor][SQL] Change the script name for creating oracle docker image.

2021-05-21 Thread GitBox


dongjoon-hyun closed pull request #32629:
URL: https://github.com/apache/spark/pull/32629


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on pull request #32629: [Minor][SQL] Change the script name for creating oracle docker image.

2021-05-21 Thread GitBox


sarutak commented on pull request #32629:
URL: https://github.com/apache/spark/pull/32629#issuecomment-846356330


   > I added the link to the PR description.
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0

2021-05-21 Thread GitBox


dongjoon-hyun closed pull request #32628:
URL: https://github.com/apache/spark/pull/32628


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0

2021-05-21 Thread GitBox


SparkQA commented on pull request #32628:
URL: https://github.com/apache/spark/pull/32628#issuecomment-846355223


   **[Test build #138821 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138821/testReport)**
 for PR 32628 at commit 
[`48292b7`](https://github.com/apache/spark/commit/48292b7a328b33071345b4d2686ae813ba6122d4).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak opened a new pull request #32629: [Minor][SQL] Change the script name for crating oracle docker image.

2021-05-21 Thread GitBox


sarutak opened a new pull request #32629:
URL: https://github.com/apache/spark/pull/32629


   ### What changes were proposed in this pull request?
   
   This PR changes the name in the comment in `jdbc.OracleIntegrationSuite` and 
`v2.OracleIntegrationSuite`.
   The script is for creating oracle docker image.
   
   ### Why are the changes needed?
   
   The name of the script is `buildContainerImage`, not `buildDockerImage` now.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   I confirmed that I can build the image with `./buildContainerImage.sh -v 
18.4.0 -x`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation

2021-05-21 Thread GitBox


AmplabJenkins removed a comment on pull request #32606:
URL: https://github.com/apache/spark/pull/32606#issuecomment-846347991


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43342/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation

2021-05-21 Thread GitBox


SparkQA commented on pull request #32606:
URL: https://github.com/apache/spark/pull/32606#issuecomment-846351977


   **[Test build #138822 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138822/testReport)**
 for PR 32606 at commit 
[`514141a`](https://github.com/apache/spark/commit/514141a6790220454999a99623e274b02ac18f22).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0

2021-05-21 Thread GitBox


AmplabJenkins removed a comment on pull request #32628:
URL: https://github.com/apache/spark/pull/32628#issuecomment-846351707


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43343/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0

2021-05-21 Thread GitBox


AmplabJenkins commented on pull request #32628:
URL: https://github.com/apache/spark/pull/32628#issuecomment-846351707


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43343/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhouyejoe commented on a change in pull request #32007: [SPARK-33350][SHUFFLE] Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data

2021-05-21 Thread GitBox


zhouyejoe commented on a change in pull request #32007:
URL: https://github.com/apache/spark/pull/32007#discussion_r637355653



##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##
@@ -117,7 +116,7 @@ public ShuffleIndexInformation load(File file) throws 
IOException {
   private AppShufflePartitionInfo getOrCreateAppShufflePartitionInfo(
   AppShuffleId appShuffleId,
   int reduceId) {
-File dataFile = getMergedShuffleDataFile(appShuffleId, reduceId);
+File dataFile = getMergedShuffleDataFile(appShuffleId.appId, 
appShuffleId.shuffleId, reduceId);

Review comment:
   Added AttemptID check here. If the attemptID is different with the one 
recorded in AppAttemptPathInfo, it will return null and will take this block as 
a late block, so it will be ignored.

##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##
@@ -991,22 +1023,26 @@ int getNumIOExceptions() {
   /**
* Wraps all the information related to the merge directory of an 
application.
*/
-  private static class AppPathsInfo {
+  private static class AppAttemptPathsInfo {
 
+private final int attemptId;
 private final String[] activeLocalDirs;
 private final int subDirsPerLocalDir;
 
-private AppPathsInfo(
+private AppAttemptPathsInfo(
 String appId,
+int attemptId,

Review comment:
   Changed to AppAttemptShuffleId




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #32624: [SPARK-35482][K8S][3.0] Use `spark.blockManager.port` not the wrong `spark.blockmanager.port` in `BasicExecutorFeatureStep`

2021-05-21 Thread GitBox


dongjoon-hyun closed pull request #32624:
URL: https://github.com/apache/spark/pull/32624


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0

2021-05-21 Thread GitBox


SparkQA commented on pull request #32628:
URL: https://github.com/apache/spark/pull/32628#issuecomment-846348652


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43343/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation

2021-05-21 Thread GitBox


SparkQA commented on pull request #32606:
URL: https://github.com/apache/spark/pull/32606#issuecomment-846347985


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43342/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation

2021-05-21 Thread GitBox


AmplabJenkins commented on pull request #32606:
URL: https://github.com/apache/spark/pull/32606#issuecomment-846347991


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43342/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.

2021-05-21 Thread GitBox


AmplabJenkins removed a comment on pull request #32627:
URL: https://github.com/apache/spark/pull/32627#issuecomment-846347387


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43341/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.

2021-05-21 Thread GitBox


AmplabJenkins commented on pull request #32627:
URL: https://github.com/apache/spark/pull/32627#issuecomment-846347387


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43341/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.

2021-05-21 Thread GitBox


SparkQA commented on pull request #32627:
URL: https://github.com/apache/spark/pull/32627#issuecomment-846347382


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43341/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark

2021-05-21 Thread GitBox


HeartSaVioR commented on pull request #32136:
URL: https://github.com/apache/spark/pull/32136#issuecomment-846346176


   Please correct me if I'm missing here.
   
   If I understand correctly about stage level scheduling, you still need to 
specify "all" resources needed for "all" tasks in StateRDD; while that may 
block Spark to schedule when some resources are missing (like lost executor 
with PVC), I'm wondering how task level schedule would work as its intention. 
After this, locality is the only one we can deal with, and it's not an 
enforcement so we're back to the origin problem.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0

2021-05-21 Thread GitBox


SparkQA commented on pull request #32628:
URL: https://github.com/apache/spark/pull/32628#issuecomment-846345692


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43343/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation

2021-05-21 Thread GitBox


SparkQA commented on pull request #32606:
URL: https://github.com/apache/spark/pull/32606#issuecomment-846344825


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43342/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0

2021-05-21 Thread GitBox


SparkQA commented on pull request #32628:
URL: https://github.com/apache/spark/pull/32628#issuecomment-846342264


   **[Test build #138821 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138821/testReport)**
 for PR 32628 at commit 
[`48292b7`](https://github.com/apache/spark/commit/48292b7a328b33071345b4d2686ae813ba6122d4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak opened a new pull request #32628: [SPARK-35487][BUILD] Upgrade dropwizard metrics to 4.2.0

2021-05-21 Thread GitBox


sarutak opened a new pull request #32628:
URL: https://github.com/apache/spark/pull/32628


   ### What changes were proposed in this pull request?
   
   This PR upgrades Dropwizard metrics to 4.2.0.
   I also modified the corresponding links in `docs/monitoring.md`.
   
   ### Why are the changes needed?
   
   The latest version was released last week and it contains some improvements.
   https://github.com/dropwizard/metrics/releases/tag/v4.2.0
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Build succeeds and all the modified links are reachable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation

2021-05-21 Thread GitBox


SparkQA commented on pull request #32606:
URL: https://github.com/apache/spark/pull/32606#issuecomment-846341242


   **[Test build #138820 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138820/testReport)**
 for PR 32606 at commit 
[`22925c4`](https://github.com/apache/spark/commit/22925c441f22093c4e79c4bd563a05b6c1fa7ac9).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-05-21 Thread GitBox


AmplabJenkins removed a comment on pull request #32303:
URL: https://github.com/apache/spark/pull/32303#issuecomment-846320283






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-05-21 Thread GitBox


AmplabJenkins commented on pull request #32303:
URL: https://github.com/apache/spark/pull/32303#issuecomment-846340882


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138817/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-05-21 Thread GitBox


SparkQA removed a comment on pull request #32303:
URL: https://github.com/apache/spark/pull/32303#issuecomment-846305088


   **[Test build #138817 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138817/testReport)**
 for PR 32303 at commit 
[`8e6cab3`](https://github.com/apache/spark/commit/8e6cab33e701a55d698a7239947ab85423010b7e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-05-21 Thread GitBox


SparkQA commented on pull request #32303:
URL: https://github.com/apache/spark/pull/32303#issuecomment-846340285


   **[Test build #138817 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138817/testReport)**
 for PR 32303 at commit 
[`8e6cab3`](https://github.com/apache/spark/commit/8e6cab33e701a55d698a7239947ab85423010b7e).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on a change in pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation

2021-05-21 Thread GitBox


sarutak commented on a change in pull request #32606:
URL: https://github.com/apache/spark/pull/32606#discussion_r637344435



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantProjectsSuite.scala
##
@@ -215,6 +217,27 @@ abstract class RemoveRedundantProjectsSuiteBase
 |LIMIT 10
 |""".stripMargin
 assertProjectExec(query, 0, 3)
+
+  }
+  Seq("true", "false").foreach { codegenEnabled =>
+test("SPARK-35287: project generating unsafe row " +
+  s"should not be removed (codegen=$codegenEnabled)") {
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
+SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> codegenEnabled,
+SQLConf.LEAF_NODE_DEFAULT_PARALLELISM.key -> "1") {
+withTempPath { path =>
+  val format = classOf[SimpleWritableDataSource].getName
+  spark.range(3).select($"id" as "i", $"id" as "j")
+.write.format(format).mode("overwrite").save(path.getCanonicalPath)
+
+  val df = spark.read.format(format).load(path.getCanonicalPath)
+  val dfLeft = df.as("x")
+  val dfRight = df.as("y")
+  val join = dfLeft.filter(dfLeft("i") > 0).join(dfRight, "i")
+  assert(join.collect === Array(Row(1, 1, 1), Row(2, 2, 2)))

Review comment:
   I think so. `prepareRowVar` generate `UnsafeRow`.
   
https://github.com/apache/spark/blob/b624b7e93f3ae07d9550ef3fd258d73f0eeb1bf6/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L132
   
   Test for `codegenEnabled = true` is just in case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on a change in pull request #32606: [SPARK-35287][SQL] Allow RemoveRedundantProjects to preserve ProjectExec which generates UnsafeRow for DataSourceV2ScanRelation

2021-05-21 Thread GitBox


sarutak commented on a change in pull request #32606:
URL: https://github.com/apache/spark/pull/32606#discussion_r637344188



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantProjectsSuite.scala
##
@@ -215,6 +217,27 @@ abstract class RemoveRedundantProjectsSuiteBase
 |LIMIT 10
 |""".stripMargin
 assertProjectExec(query, 0, 3)
+
+  }
+  Seq("true", "false").foreach { codegenEnabled =>
+test("SPARK-35287: project generating unsafe row " +
+  s"should not be removed (codegen=$codegenEnabled)") {
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",

Review comment:
   orderBy can trigger this issue and choose it for test.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.

2021-05-21 Thread GitBox


SparkQA commented on pull request #32627:
URL: https://github.com/apache/spark/pull/32627#issuecomment-846339203


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43341/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #32618: [SPARK-28551][SQL][FOLLOWUP] Use the corrected hadoop conf

2021-05-21 Thread GitBox


yaooqinn commented on pull request #32618:
URL: https://github.com/apache/spark/pull/32618#issuecomment-846336835


   nice catch and merged to master, thanks all


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn closed pull request #32618: [SPARK-28551][SQL][FOLLOWUP] Use the corrected hadoop conf

2021-05-21 Thread GitBox


yaooqinn closed pull request #32618:
URL: https://github.com/apache/spark/pull/32618


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark

2021-05-21 Thread GitBox


Ngone51 commented on pull request #32136:
URL: https://github.com/apache/spark/pull/32136#issuecomment-846336246


   > If the persistent volume is a resource, then it will have to be there on 
executor startup, so I guess a new executor checks for it on startup and 
advertises it. At that point, how does a task tell the difference between its 
state store and another one? The scheduler for resources only checks that the 
number of a particular resource match the task requirements, it doesn't 
differentiation different state store "ids". So it will assign a task to any 
executor with a PV.
   
   @tgravescs To clarify, I think it's state store rather than PV is the 
resource here. And the use case here might be a bit different from the classic 
use case(where the resources must be specified for executor launch). In this 
case, the executor can tell the state store resource to the driver only when 
the first time the state store instance is constructed. That means this special 
resource will be updated at runtime. (Streaming has the existing event 
`ReportActiveInstance` to report the state store instances, and we can extend 
it to update the executor's state store resources). To follow the existing 
custom resource management, the state store resource might be maintained as 
("statestore" -> list of statestore ids) as a part of the custom resource. 
   
   
   > another point - What if entire node is lost not just executor? I guess the 
discovery script starting up new executor would load it from HDFS???
   
   So as mentioned above, we don't need the discovery script here. But yes the 
state store instances can be loaded from HDFS.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #32624: [SPARK-35482][K8S][3.0] Use `spark.blockManager.port` not the wrong `spark.blockmanager.port` in `BasicExecutorFeatureStep`

2021-05-21 Thread GitBox


yaooqinn commented on pull request #32624:
URL: https://github.com/apache/spark/pull/32624#issuecomment-846335887


   cc @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.

2021-05-21 Thread GitBox


SparkQA removed a comment on pull request #32627:
URL: https://github.com/apache/spark/pull/32627#issuecomment-846334449


   **[Test build #138819 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138819/testReport)**
 for PR 32627 at commit 
[`57997ae`](https://github.com/apache/spark/commit/57997aed8595ed4627aba5f4f13d2363987b1e2e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.

2021-05-21 Thread GitBox


AmplabJenkins removed a comment on pull request #32627:
URL: https://github.com/apache/spark/pull/32627#issuecomment-846334587


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138819/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.

2021-05-21 Thread GitBox


AmplabJenkins commented on pull request #32627:
URL: https://github.com/apache/spark/pull/32627#issuecomment-846334587


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138819/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.

2021-05-21 Thread GitBox


SparkQA commented on pull request #32627:
URL: https://github.com/apache/spark/pull/32627#issuecomment-846334582


   **[Test build #138819 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138819/testReport)**
 for PR 32627 at commit 
[`57997ae`](https://github.com/apache/spark/commit/57997aed8595ed4627aba5f4f13d2363987b1e2e).
* This patch **fails Python style tests**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `class SparkIndexOpsMethods(metaclass=ABCMeta):`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32626: [WIP][SPARK-35452][PYTHON] Introduce ComplexOps for complex types

2021-05-21 Thread GitBox


AmplabJenkins removed a comment on pull request #32626:
URL: https://github.com/apache/spark/pull/32626#issuecomment-846334351






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.

2021-05-21 Thread GitBox


SparkQA commented on pull request #32627:
URL: https://github.com/apache/spark/pull/32627#issuecomment-846334449


   **[Test build #138819 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138819/testReport)**
 for PR 32627 at commit 
[`57997ae`](https://github.com/apache/spark/commit/57997aed8595ed4627aba5f4f13d2363987b1e2e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32626: [WIP][SPARK-35452][PYTHON] Introduce ComplexOps for complex types

2021-05-21 Thread GitBox


AmplabJenkins commented on pull request #32626:
URL: https://github.com/apache/spark/pull/32626#issuecomment-846334351






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark

2021-05-21 Thread GitBox


HeartSaVioR edited a comment on pull request #32136:
URL: https://github.com/apache/spark/pull/32136#issuecomment-846329353


   I'm not sure about the scenario of leveraging PVC as checkpoint location - 
at least that sounds to me as beyond the support of checkpoint in Structured 
Streaming.
   
   We have been clearly describing about the requirement of checkpoint location 
in Structured Streaming guide page, like following:
   
   > Checkpoint location: For some output sinks where the end-to-end 
fault-tolerance can be guaranteed, specify the location where the system will 
write all the checkpoint information. This should be a directory in an 
HDFS-compatible fault-tolerant file system. The semantics of checkpointing is 
discussed in more detail in the next section.
   
   I know we allow custom checkpoint manager implementations to deal with 
non-HDFS compatible file system (like object stores which don't provide "atomic 
rename"), but they still deal with "remote" "fault-tolerant" file system, and 
doesn't require Spark scheduler to schedule specific task to specific executor 
based on the availability of checkpoint location.
   
   In other words, only checkpoint manager handles the complexity of checkpoint 
on file system, not somewhere else. And sounds like it's no longer holding true 
if we want to support PVC based checkpoint. Please correct me if I'm missing 
something.
   
   I'm more likely novice on cloud/k8s, but from the common sense, I guess the 
actual storage of PVC should be still a sort of network storage to be resilient 
on "physical node down". I'm wondering how much benefits PVC approach gives 
compared to the existing approach as just directly use remote fault-tolerant 
file system. The benefits should be clear to cope with additional complexity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


ueshin commented on a change in pull request #32611:
URL: https://github.com/apache/spark/pull/32611#discussion_r637338960



##
File path: python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py
##
@@ -33,107 +35,148 @@ def pser(self):
 def psser(self):
 return ps.from_pandas(self.pser)
 
+@property
+def float_pser(self):
+return pd.Series([1, 2, 3], dtype=float)
+
+@property
+def float_psser(self):
+return ps.from_pandas(self.float_pser)
+
 def test_add(self):
-self.assertRaises(TypeError, lambda: self.psser + 1)
-self.assertRaises(TypeError, lambda: self.psser + 0.1)
+self.assert_eq(self.pser + 1, self.psser + 1)
+self.assert_eq(self.pser + 0.1, self.psser + 0.1)
 
 with option_context("compute.ops_on_diff_frames", True):
-for psser in self.pssers:
+for pser, psser in self.numeric_pser_psser_pairs:
+self.assert_eq(self.pser + pser, (self.psser + 
psser).sort_index())
+
+for psser in self.non_numeric_pssers.values():
 self.assertRaises(TypeError, lambda: self.psser + psser)
 
 def test_sub(self):
-self.assertRaises(TypeError, lambda: self.psser - 1)
-self.assertRaises(TypeError, lambda: self.psser - 0.1)
+self.assert_eq(self.pser - 1, self.psser - 1)
+self.assert_eq(self.pser - 0.1, self.psser - 0.1)
 
 with option_context("compute.ops_on_diff_frames", True):
-for psser in self.pssers:
+for pser, psser in self.numeric_pser_psser_pairs:
+self.assert_eq(self.pser - pser, (self.psser - 
psser).sort_index())
+
+for psser in self.non_numeric_pssers.values():
 self.assertRaises(TypeError, lambda: self.psser - psser)
 
 def test_mul(self):
-self.assertRaises(TypeError, lambda: self.psser * 1)
-self.assertRaises(TypeError, lambda: self.psser * 0.1)
+self.assert_eq(self.pser * 1, self.psser * 1)
+self.assert_eq(self.pser * 0.1, self.psser * 0.1)
 
 with option_context("compute.ops_on_diff_frames", True):
-for psser in self.pssers:
+for pser, psser in self.numeric_pser_psser_pairs:
+self.assert_eq(self.pser * pser, (self.psser * 
psser).sort_index())
+
+for psser in self.non_numeric_pssers.values():
 self.assertRaises(TypeError, lambda: self.psser * psser)
 
 def test_truediv(self):
-self.assertRaises(TypeError, lambda: self.psser / 1)
-self.assertRaises(TypeError, lambda: self.psser / 0.1)
+self.assert_eq(self.pser / 1, self.psser / 1)
+self.assert_eq(self.pser / 0.1, self.psser / 0.1)
 
 with option_context("compute.ops_on_diff_frames", True):
-for psser in self.pssers:
+self.assert_eq(
+self.pser / self.float_pser, (self.psser / 
self.float_psser).sort_index())
+
+for psser in self.non_numeric_pssers.values():
 self.assertRaises(TypeError, lambda: self.psser / psser)
 
 def test_floordiv(self):
-self.assertRaises(TypeError, lambda: self.psser // 1)
-self.assertRaises(TypeError, lambda: self.psser // 0.1)
+# float is always returned in pandas-on-Spark
+self.assert_eq((self.pser // 1).astype("float"), self.psser // 1)
+# in pandas, 1 // 0.1 = 9.0; in pandas-on-Spark, 1 // 0.1 = 10.0
+# self.assert_eq(self.pser // 0.1, self.psser // 0.1)
 
 with option_context("compute.ops_on_diff_frames", True):
-for psser in self.pssers:
+self.assert_eq(
+self.pser // self.float_pser, (self.psser // 
self.float_psser).sort_index()
+)
+
+for psser in self.non_numeric_pssers.values():
 self.assertRaises(TypeError, lambda: self.psser // psser)
 
 def test_mod(self):
-self.assertRaises(TypeError, lambda: self.psser % 1)
-self.assertRaises(TypeError, lambda: self.psser % 0.1)
+self.assert_eq(self.pser % 1, self.psser % 1)
+self.assert_eq(self.pser % 0.1, self.psser % 0.1)
 
 with option_context("compute.ops_on_diff_frames", True):
-for psser in self.pssers:
+for pser, psser in self.numeric_pser_psser_pairs:
+self.assert_eq(self.pser % pser, (self.psser % 
psser).sort_index())
+
+for psser in self.non_numeric_pssers.values():
 self.assertRaises(TypeError, lambda: self.psser % psser)
 
 def test_pow(self):
-self.assertRaises(TypeError, lambda: self.psser ** 1)
-self.assertRaises(TypeError, lambda: self.psser ** 0.1)
+# float is always returned in pandas-on-Spark
+self.assert_eq((self.pser ** 1).astype("float"), self.psser ** 1)
+self.assert_eq(self.pser ** 0.1, self.psser ** 0.1)
 
 with option_context("compute.ops_on_diff_frames",

[GitHub] [spark] HeartSaVioR commented on a change in pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark

2021-05-21 Thread GitBox


HeartSaVioR commented on a change in pull request #32136:
URL: https://github.com/apache/spark/pull/32136#discussion_r637338852



##
File path: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulingPlugin.scala
##
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+/**
+ * This trait provides a plugin interface for suggesting task scheduling to 
Spark
+ * scheduler.
+ */
+private[spark] trait TaskSchedulingPlugin {

Review comment:
   > I'm not sure now if we want it to be open to implement outside Spark.
   
   IMHO I'm thinking opposite on this. I'm not sure if we want to add some 
implementation in Spark codebase, unless the implementation of plugin proves 
that it works well for all possible cases.
   
   I'd rather be OK to let Spark users (including us) to be hackers and 
customize the scheduler with "taking their own risks", but I have to be 
conservative if we want to change the behavior of Spark. If changing the 
behavior of Spark to cope with state by default is the final goal, I think the 
behavioral change itself is a major one requiring SPIP, not something we can 
simply go after doing POC on happy case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


ueshin commented on a change in pull request #32611:
URL: https://github.com/apache/spark/pull/32611#discussion_r637338469



##
File path: python/pyspark/pandas/data_type_ops/num_ops.py
##
@@ -16,31 +16,46 @@
 #
 
 import numbers
-from typing import TYPE_CHECKING, Union
+from typing import Any, TYPE_CHECKING, Union
 
 import numpy as np
 from pandas.api.types import CategoricalDtype
 
 from pyspark.sql import Column, functions as F
 from pyspark.sql.types import (
+BooleanType,
 NumericType,
 StringType,
 TimestampType,
 )
 
 from pyspark.pandas.base import column_op, IndexOpsMixin, numpy_column_op
-from pyspark.pandas.data_type_ops.base import DataTypeOps
+from pyspark.pandas.data_type_ops.base import DataTypeOps, 
transform_boolean_operand_to_numeric
 from pyspark.pandas.spark import functions as SF
+from pyspark.sql.column import Column
+
 
 if TYPE_CHECKING:
 from pyspark.pandas.indexes import Index  # noqa: F401 (SPARK-34943)
 from pyspark.pandas.series import Series  # noqa: F401 (SPARK-34943)
 
 
+def is_valid_operand_for_numeric_arithmetic(operand: Any) -> bool:
+"""Check whether the operand is valid for arithmetic operations against 
numerics."""
+if isinstance(operand, numbers.Number):
+return True
+elif isinstance(operand, IndexOpsMixin):
+if isinstance(operand.dtype, CategoricalDtype):
+return False
+else:
+return isinstance(operand.spark.data_type, NumericType) or 
isinstance(
+operand.spark.data_type, BooleanType)
+else:
+return isinstance(operand, Column)

Review comment:
   Do we currently allow it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32626: [WIP][SPARK-35452][PYTHON] Introduce ComplexOps for complex types

2021-05-21 Thread GitBox


SparkQA removed a comment on pull request #32626:
URL: https://github.com/apache/spark/pull/32626#issuecomment-846304945


   **[Test build #138816 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138816/testReport)**
 for PR 32626 at commit 
[`a7dd95c`](https://github.com/apache/spark/commit/a7dd95c625809a7b443a9043b90bfb3a718d373d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32626: [WIP][SPARK-35452][PYTHON] Introduce ComplexOps for complex types

2021-05-21 Thread GitBox


SparkQA commented on pull request #32626:
URL: https://github.com/apache/spark/pull/32626#issuecomment-846329637


   **[Test build #138816 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138816/testReport)**
 for PR 32626 at commit 
[`a7dd95c`](https://github.com/apache/spark/commit/a7dd95c625809a7b443a9043b90bfb3a718d373d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `class ComplexOps(DataTypeOps):`
 * `class ArrayOps(ComplexOps):`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark

2021-05-21 Thread GitBox


HeartSaVioR edited a comment on pull request #32136:
URL: https://github.com/apache/spark/pull/32136#issuecomment-846329353


   I'm not sure about the scenario of leveraging PVC as checkpoint location - 
at least that sounds to me as beyond the support of checkpoint in Structured 
Streaming.
   
   We have been clearly describing about the requirement of checkpoint location 
in Structured Streaming guide page, like following:
   
   > Checkpoint location: For some output sinks where the end-to-end 
fault-tolerance can be guaranteed, specify the location where the system will 
write all the checkpoint information. This should be a directory in an 
HDFS-compatible fault-tolerant file system. The semantics of checkpointing is 
discussed in more detail in the next section.
   
   I know we allow custom checkpoint manager implementations to deal with 
non-HDFS compatible file system (like object stores which don't provide "atomic 
rename"), but they still deal with "remote" "fault-tolerant" file system, and 
doesn't require Spark scheduler to schedule specific task to specific executor 
based on the availability of checkpoint location.
   
   In other words, only checkpoint manager handles the complexity of checkpoint 
on file system, not somewhere else. And sounds like it's no longer holding true 
if we want to support PVC based checkpoint. Please correct me if I'm missing 
something.
   
   I'm more likely novice on cloud/k8s, but from the common sense, I guess the 
actual storage of PVC should be still a sort of network storage to be resilient 
on "node down". I'm wondering how much benefits PVC approach gives compared to 
the existing approach as just directly use remote fault-tolerant file system. 
The benefits should be clear to cope with additional complexity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark

2021-05-21 Thread GitBox


HeartSaVioR edited a comment on pull request #32136:
URL: https://github.com/apache/spark/pull/32136#issuecomment-846329353


   I'm not sure about the scenario of leveraging PVC as checkpoint location - 
at least that sounds to me as beyond the support of checkpoint in Structured 
Streaming.
   
   We have been clearly describing about the requirement of checkpoint location 
in Structured Streaming guide page, like following:
   
   > Checkpoint location: For some output sinks where the end-to-end 
fault-tolerance can be guaranteed, specify the location where the system will 
write all the checkpoint information. This should be a directory in an 
HDFS-compatible fault-tolerant file system. The semantics of checkpointing is 
discussed in more detail in the next section.
   
   I know we allow custom checkpoint manager implementations to deal with 
non-HDFS compatible file system (like object stores which don't provide "atomic 
rename"), but they still deal with "remote" "fault-tolerant" file system, and 
doesn't require Spark scheduler to schedule specific task to specific executor 
based on the availability of checkpoint.
   
   In other words, only checkpoint manager handles the complexity of checkpoint 
on file system, not somewhere else. And sounds like it's no longer holding true 
if we want to support PVC based checkpoint. Please correct me if I'm missing 
something.
   
   I'm more likely novice on cloud/k8s, but from the common sense, I guess the 
actual storage of PVC should be still a sort of network storage to be resilient 
on "node down". I'm wondering how much benefits PVC approach gives compared to 
the existing approach as just directly use remote fault-tolerant file system. 
The benefits should be clear to cope with additional complexity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark

2021-05-21 Thread GitBox


HeartSaVioR commented on pull request #32136:
URL: https://github.com/apache/spark/pull/32136#issuecomment-846329353


   I'm not sure about the scenario of leveraging PVC as checkpoint location - 
at least that sounds to me as beyond the support of checkpoint in Structured 
Streaming.
   
   We have been clearly describing about the requirement of checkpoint location 
in Structured Streaming guide page, like following:
   
   > Checkpoint location: For some output sinks where the end-to-end 
fault-tolerance can be guaranteed, specify the location where the system will 
write all the checkpoint information. This should be a directory in an 
HDFS-compatible fault-tolerant file system. The semantics of checkpointing is 
discussed in more detail in the next section.
   
   I know we allow custom checkpoint manager implementations to deal with 
non-HDFS compatible file system, but they still deal with "remote" 
"fault-tolerant" file system, and doesn't require Spark scheduler to schedule 
specific task to specific executor based on the availability of checkpoint.
   
   In other words, only checkpoint manager handles the complexity of checkpoint 
on file system, not somewhere else. And sounds like it's no longer holding true 
if we want to support PVC based checkpoint. Please correct me if I'm missing 
something.
   
   I'm more likely novice on cloud/k8s, but from the common sense, I guess the 
actual storage of PVC should be still a sort of network storage to be resilient 
on "node down". I'm wondering how much benefits PVC approach gives compared to 
the existing approach as just directly use remote fault-tolerant file system. 
The benefits should be clear to cope with additional complexity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin opened a new pull request #32627: [SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks.

2021-05-21 Thread GitBox


ueshin opened a new pull request #32627:
URL: https://github.com/apache/spark/pull/32627


   ### What changes were proposed in this pull request?
   
   Adds more type annotations in the files:
   
   - `python/pyspark/pandas/spark/accessors.py`
   - `python/pyspark/pandas/typedef/typehints.py`
   - `python/pyspark/pandas/utils.py`
   
   and fixes the mypy check failures.
   
   ### Why are the changes needed?
   
   We should enable more `disallow_untyped_defs` mypy checks.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.
   This PR adds more type annotations in pandas APIs on Spark module, which can 
impact interaction with development tools for users.
   
   ### How was this patch tested?
   
   The mypy check with a new configuration and existing tests should pass.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32626: [WIP][SPARK-35452][PYTHON] Introduce ComplexOps for complex types

2021-05-21 Thread GitBox


SparkQA commented on pull request #32626:
URL: https://github.com/apache/spark/pull/32626#issuecomment-846328869


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43338/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


AmplabJenkins commented on pull request #32611:
URL: https://github.com/apache/spark/pull/32611#issuecomment-846328532


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43340/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


SparkQA commented on pull request #32611:
URL: https://github.com/apache/spark/pull/32611#issuecomment-846328527


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43340/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


AmplabJenkins removed a comment on pull request #32611:
URL: https://github.com/apache/spark/pull/32611#issuecomment-846326242


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138818/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


AmplabJenkins commented on pull request #32611:
URL: https://github.com/apache/spark/pull/32611#issuecomment-846326242


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138818/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


ueshin commented on a change in pull request #32611:
URL: https://github.com/apache/spark/pull/32611#discussion_r637332388



##
File path: python/pyspark/pandas/data_type_ops/boolean_ops.py
##
@@ -15,7 +15,27 @@
 # limitations under the License.
 #
 
-from pyspark.pandas.data_type_ops.base import DataTypeOps
+import numbers
+from typing import TYPE_CHECKING, Union
+
+import numpy as np
+from pandas.api.types import CategoricalDtype
+
+from pyspark.pandas.base import column_op, IndexOpsMixin, numpy_column_op
+from pyspark.pandas.data_type_ops.base import DataTypeOps, 
transform_boolean_operand_to_numeric
+from pyspark.pandas.typedef.typehints import as_spark_type
+from pyspark.sql import Column, functions as F
+from pyspark.sql.types import NumericType
+
+if TYPE_CHECKING:
+from pyspark.pandas.indexes import Index  # noqa: F401 (SPARK-34943)
+from pyspark.pandas.series import Series  # noqa: F401 (SPARK-34943)
+
+
+def is_numeric_index_ops(index_ops: IndexOpsMixin) -> bool:
+"""Check if the given index_ops is numeric IndexOpsMixin."""
+return isinstance(index_ops.spark.data_type, NumericType) and (
+not isinstance(index_ops.dtype, CategoricalDtype))

Review comment:
   I guess we can reuse `is_valid_operand_for_numeric_arithmetic` in 
`num_ops.py`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


SparkQA commented on pull request #32611:
URL: https://github.com/apache/spark/pull/32611#issuecomment-846324042


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43340/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


ueshin commented on a change in pull request #32611:
URL: https://github.com/apache/spark/pull/32611#discussion_r637328527



##
File path: python/pyspark/pandas/data_type_ops/boolean_ops.py
##
@@ -26,3 +46,183 @@ class BooleanOps(DataTypeOps):
 @property
 def pretty_name(self) -> str:
 return 'booleans'
+
+def add(self, left, right) -> Union["Series", "Index"]:
+if isinstance(right, numbers.Number):
+left = left.spark.transform(lambda scol: 
scol.cast(as_spark_type(type(right
+return column_op(Column.__add__)(left, right)

Review comment:
   I guess `return left + right` should be fine. The `DataTypeOps` related 
to `left` after converted should know how to handle it better.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


SparkQA removed a comment on pull request #32611:
URL: https://github.com/apache/spark/pull/32611#issuecomment-846317781


   **[Test build #138818 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138818/testReport)**
 for PR 32611 at commit 
[`925e34b`](https://github.com/apache/spark/commit/925e34b3bffb6496d62425fa08d36177589cb17d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


SparkQA commented on pull request #32611:
URL: https://github.com/apache/spark/pull/32611#issuecomment-846321739


   **[Test build #138818 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138818/testReport)**
 for PR 32611 at commit 
[`925e34b`](https://github.com/apache/spark/commit/925e34b3bffb6496d62425fa08d36177589cb17d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


ueshin commented on a change in pull request #32611:
URL: https://github.com/apache/spark/pull/32611#discussion_r637323275



##
File path: python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
##
@@ -42,7 +54,12 @@ def test_add(self):
 self.assertRaises(TypeError, lambda: psser + 
self.non_numeric_pssers["datetime"])
 self.assertRaises(TypeError, lambda: psser + 
self.non_numeric_pssers["date"])
 self.assertRaises(TypeError, lambda: psser + 
self.non_numeric_pssers["categorical"])
-self.assertRaises(TypeError, lambda: psser + 
self.non_numeric_pssers["bool"])
+self.assert_eq(
+(psser + self.non_numeric_pssers["bool"]).sort_index(),
+pser + self.non_numeric_psers["bool"],
+)
+
+self.assertRaises(TypeError, lambda: self.float_psser + True)

Review comment:
   This is supposed to be out of for-loop?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-05-21 Thread GitBox


AmplabJenkins commented on pull request #32303:
URL: https://github.com/apache/spark/pull/32303#issuecomment-846320283


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43339/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-05-21 Thread GitBox


SparkQA commented on pull request #32303:
URL: https://github.com/apache/spark/pull/32303#issuecomment-846320278


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43339/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


ueshin commented on a change in pull request #32611:
URL: https://github.com/apache/spark/pull/32611#discussion_r637311351



##
File path: python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
##
@@ -30,6 +34,14 @@ class NumOpsTest(PandasOnSparkTestCase, TestCasesUtils):
 returns float32.
 The underlying reason is the respective Spark operations return DoubleType 
always.
 """
+@property
+def float_pser(self):
+return pd.Series([1, 2, 3], dtype=float)
+
+@property
+def float_psser(self):
+return ps.from_pandas(self.float_pser)
+
 def test_add(self):
 for pser, psser in self.numeric_pser_psser_pairs:
 self.assert_eq(pser + pser, psser + psser)

Review comment:
   Isn't this PR supposed to support it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


SparkQA commented on pull request #32611:
URL: https://github.com/apache/spark/pull/32611#issuecomment-846317781


   **[Test build #138818 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138818/testReport)**
 for PR 32611 at commit 
[`925e34b`](https://github.com/apache/spark/commit/925e34b3bffb6496d62425fa08d36177589cb17d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #24801: [SPARK-27950][DSTREAMS][Kinesis] dynamoDBEndpointUrl and cloudWatchMetricsLevel for Kinesis

2021-05-21 Thread GitBox


github-actions[bot] closed pull request #24801:
URL: https://github.com/apache/spark/pull/24801


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #30681: [Spark-33657][SQL]After spark sql is executed to generate hdfs data, the relevant status information is printed

2021-05-21 Thread GitBox


github-actions[bot] commented on pull request #30681:
URL: https://github.com/apache/spark/pull/30681#issuecomment-846316749


   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #31509: [SPARK-34396][SQL] Add a new build-in function delegate

2021-05-21 Thread GitBox


github-actions[bot] closed pull request #31509:
URL: https://github.com/apache/spark/pull/31509


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-databricks commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


xinrong-databricks commented on a change in pull request #32611:
URL: https://github.com/apache/spark/pull/32611#discussion_r637305932



##
File path: python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
##
@@ -30,6 +34,14 @@ class NumOpsTest(PandasOnSparkTestCase, TestCasesUtils):
 returns float32.
 The underlying reason is the respective Spark operations return DoubleType 
always.
 """
+@property
+def float_pser(self):
+return pd.Series([1, 2, 3], dtype=float)
+
+@property
+def float_psser(self):
+return ps.from_pandas(self.float_pser)
+
 def test_add(self):
 for pser, psser in self.numeric_pser_psser_pairs:
 self.assert_eq(pser + pser, psser + psser)

Review comment:
   Arithmetic operations involving `bool` are not supported now. TypeErrors 
are introduced for these cases for now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32626: [WIP][SPARK-35452][PYTHON] Introduce ComplexOps for complex types

2021-05-21 Thread GitBox


SparkQA commented on pull request #32626:
URL: https://github.com/apache/spark/pull/32626#issuecomment-846314883


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43338/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-05-21 Thread GitBox


SparkQA commented on pull request #32303:
URL: https://github.com/apache/spark/pull/32303#issuecomment-846313915


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43339/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-databricks commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


xinrong-databricks commented on a change in pull request #32611:
URL: https://github.com/apache/spark/pull/32611#discussion_r637301587



##
File path: python/pyspark/pandas/tests/data_type_ops/test_date_ops.py
##
@@ -60,7 +61,9 @@ def test_sub(self):
 with option_context("compute.ops_on_diff_frames", True):
 for pser, psser in self.pser_psser_pairs:
 if isinstance(psser.spark.data_type, DateType):
-self.assert_eq((self.pser - pser).dt.days, (self.psser - 
psser).sort_index())
+if LooseVersion(pd.__version__) >= LooseVersion("0.24.2"):
+self.assert_eq(
+(self.pser - pser).dt.days, (self.psser - 
psser).sort_index())

Review comment:
   It raises weird TypeError for pandas in the `else` case, as 
https://github.com/databricks/koalas/runs/2568962950.
   
   No, the change is not related to this PR. Let me remove it for now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-databricks commented on a change in pull request #32611: [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

2021-05-21 Thread GitBox


xinrong-databricks commented on a change in pull request #32611:
URL: https://github.com/apache/spark/pull/32611#discussion_r637299344



##
File path: python/pyspark/pandas/data_type_ops/boolean_ops.py
##
@@ -26,3 +41,183 @@ class BooleanOps(DataTypeOps):
 @property
 def pretty_name(self) -> str:
 return 'booleans'
+
+def add(self, left, right):
+if isinstance(right, numbers.Number):
+left = left.spark.apply(lambda scol: 
scol.cast(as_spark_type(type(right
+return column_op(Column.__add__)(left, right)
+elif isinstance(right, IndexOpsMixin) and is_numeric_index_ops(right):
+left = transform_boolean_operand_to_numeric(left, 
right.spark.data_type)
+return column_op(Column.__add__)(left, right)
+else:
+raise TypeError(
+"Addition can not be applied to %s and the given type." % 
self.pretty_name)
+
+def sub(self, left, right):
+if isinstance(right, numbers.Number):
+left = left.spark.apply(lambda scol: 
scol.cast(as_spark_type(type(right
+return column_op(Column.__sub__)(left, right)
+elif isinstance(right, IndexOpsMixin) and is_numeric_index_ops(right):
+left = transform_boolean_operand_to_numeric(left, 
right.spark.data_type)
+return column_op(Column.__sub__)(left, right)
+else:
+raise TypeError(
+"Subtraction can not be applied to %s and the given type." % 
self.pretty_name)
+
+def mul(self, left, right):
+if isinstance(right, numbers.Number):
+left = left.spark.apply(lambda scol: 
scol.cast(as_spark_type(type(right
+return column_op(Column.__mul__)(left, right)
+elif isinstance(right, IndexOpsMixin) and is_numeric_index_ops(right):
+left = transform_boolean_operand_to_numeric(left, 
right.spark.data_type)
+return column_op(Column.__mul__)(left, right)
+else:
+raise TypeError(
+"Multiplication can not be applied to %s and the given type." 
% self.pretty_name)
+
+def truediv(self, left, right):
+def truediv(left, right):
+return F.when(F.lit(right != 0) | F.lit(right).isNull(), 
left.__div__(right)).otherwise(
+F.when(F.lit(left == np.inf) | F.lit(left == -np.inf), 
left).otherwise(
+F.lit(np.inf).__div__(left)
+)
+)
+
+if isinstance(right, numbers.Number):
+left = left.spark.apply(lambda scol: 
scol.cast(as_spark_type(type(right
+return numpy_column_op(truediv)(left, right)
+elif isinstance(right, IndexOpsMixin) and is_numeric_index_ops(right):
+left = transform_boolean_operand_to_numeric(left, 
right.spark.data_type)
+return numpy_column_op(truediv)(left, right)
+else:
+raise TypeError(
+"True division can not be applied to %s and the given type." % 
self.pretty_name)
+
+def floordiv(self, left, right):
+def floordiv(left, right):
+return F.when(F.lit(right is np.nan), np.nan).otherwise(
+F.when(
+F.lit(right != 0) | F.lit(right).isNull(), 
F.floor(left.__div__(right))
+).otherwise(
+F.when(F.lit(left == np.inf) | F.lit(left == -np.inf), 
left).otherwise(
+F.lit(np.inf).__div__(left)
+)
+)
+)
+
+if isinstance(right, numbers.Number):
+left = left.spark.apply(lambda scol: 
scol.cast(as_spark_type(type(right
+return numpy_column_op(floordiv)(left, right)
+elif isinstance(right, IndexOpsMixin) and is_numeric_index_ops(right):
+left = transform_boolean_operand_to_numeric(left, 
right.spark.data_type)
+return numpy_column_op(floordiv)(left, right)
+else:
+raise TypeError(
+"Floor division can not be applied to %s and the given type." 
% self.pretty_name)
+
+def mod(self, left, right):
+def mod(left, right):
+return ((left % right) + right) % right
+
+if isinstance(right, numbers.Number):
+left = left.spark.apply(lambda scol: 
scol.cast(as_spark_type(type(right
+return column_op(mod)(left, right)
+elif isinstance(right, IndexOpsMixin) and is_numeric_index_ops(right):
+left = transform_boolean_operand_to_numeric(left, 
right.spark.data_type)
+return column_op(mod)(left, right)
+else:
+raise TypeError(
+"Modulo can not be applied to %s and the given type." % 
self.pretty_name)
+
+def pow(self, left, right):
+def pow_func(left, right):
+return F.when(left == 1, left).otherwise(Column.__pow__(left, 
right))
+
+if isinstance(right, numbers.Number):
+   

  1   2   3   4   5   >