[GitHub] [spark] AmplabJenkins removed a comment on pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
AmplabJenkins removed a comment on pull request #30341: URL: https://github.com/apache/spark/pull/30341#issuecomment-727161445 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on pull request #30347: [SPARK-33209][SS] Refactor unit test of stream-stream join in UnsupportedOperationsSuite
c21 commented on pull request #30347: URL: https://github.com/apache/spark/pull/30347#issuecomment-727162216 Thanks @HeartSaVioR for heads up, please take your time. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
SparkQA commented on pull request #30341: URL: https://github.com/apache/spark/pull/30341#issuecomment-727161442 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35691/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
AmplabJenkins commented on pull request #30341: URL: https://github.com/apache/spark/pull/30341#issuecomment-727161445 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #30368: [SPARK-33442][SQL] Change Combine Limit to Eliminate limit using max row
viirya commented on a change in pull request #30368: URL: https://github.com/apache/spark/pull/30368#discussion_r523389535 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -1452,11 +1452,21 @@ object PushPredicateThroughJoin extends Rule[LogicalPlan] with PredicateHelper { } /** - * Combines two adjacent [[Limit]] operators into one, merging the - * expressions into one single expression. + * 1. Eliminate [[Limit]] operators if it's child max row <= limit. + * 2. Combines two adjacent [[Limit]] operators into one, merging the + *expressions into one single expression. */ -object CombineLimits extends Rule[LogicalPlan] { - def apply(plan: LogicalPlan): LogicalPlan = plan transform { +object EliminateLimits extends Rule[LogicalPlan] { + private def canEliminate(limitExpr: Expression, childMaxRow: Option[Long]): Boolean = { +limitExpr.foldable && childMaxRow.exists { _ <= limitExpr.eval().toString.toLong } Review comment: nit, but we can put `limitExpr.eval()` out of `exists` loop? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external
Ngone51 commented on a change in pull request #30164: URL: https://github.com/apache/spark/pull/30164#discussion_r523388191 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala ## @@ -657,6 +688,38 @@ class BlockManagerMasterEndpoint( } } + private def getShufflePushMergerLocations( + numMergersNeeded: Int, + hostsToFilter: Set[String]): Seq[BlockManagerId] = { +val blockManagersWithExecutors = blockManagerIdByExecutor.groupBy(_._2.host) + .mapValues(_.head).values.map(_._2).toSet +val filteredBlockManagersWithExecutors = blockManagersWithExecutors + .filterNot(x => hostsToFilter.contains(x.host)) +val filteredMergersWithExecutors = filteredBlockManagersWithExecutors.map( + x => BlockManagerId(x.executorId, x.host, StorageUtils.externalShuffleServicePort(conf))) + +// Enough mergers are available as part of active executors list +if (filteredMergersWithExecutors.size >= numMergersNeeded) { + filteredMergersWithExecutors.toSeq +} else { + // Delta mergers added from inactive mergers list to the active mergers list + val filteredMergersWithExecutorsHosts = filteredMergersWithExecutors.map(_.host) + // Pick random hosts instead of preferring the top of the list + val randomizedShuffleMergerLocations = Utils.randomize(shuffleMergerLocations.values.toSeq) Review comment: Yes! Thanks @mridulm This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
maropu commented on pull request #30341: URL: https://github.com/apache/spark/pull/30341#issuecomment-727158698 > I think it should be added as a separate PR and merge the test PR first? Then we can run against with it to check if the result is the same. @maropu Yea, it looks fine. I'll review this again tonight and thanks for the update. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
SparkQA commented on pull request #30341: URL: https://github.com/apache/spark/pull/30341#issuecomment-727158547 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35691/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
viirya commented on a change in pull request #30341: URL: https://github.com/apache/spark/pull/30341#discussion_r523295714 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EvaluationRuntime.scala ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.expressions + +import com.google.common.cache.{CacheBuilder, CacheLoader, LoadingCache} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.internal.SQLConf + +/** + * This class helps subexpression elimination for interpreted evaluation + * in `InterpretedUnsafeProjection`. It maintains an evaluation cache. + * This class wraps `ExpressionProxy` around given expressions. The `ExpressionProxy` + * intercepts expression evaluation and loads from the cache first. + */ +class EvaluationRuntime { + + val cache: LoadingCache[ExpressionProxy, ResultProxy] = CacheBuilder.newBuilder() +.maximumSize(SQLConf.get.subexpressionEliminationCacheMaxEntries) +.build( + new CacheLoader[ExpressionProxy, ResultProxy]() { +override def load(expr: ExpressionProxy): ResultProxy = { + ResultProxy(expr.proxyEval(currentInput)) +} + }) + + private var currentInput: InternalRow = null + + /** + * Sets given input row as current row for evaluating expressions. This cleans up the cache + * too as new input comes. + */ + def setInput(input: InternalRow = null): Unit = { +currentInput = input +cache.invalidateAll() + } + + /** + * Finds subexpressions and wraps them with `ExpressionProxy`. + */ + def proxyExpressions(expressions: Seq[Expression]): Seq[Expression] = { +val equivalentExpressions: EquivalentExpressions = new EquivalentExpressions + +expressions.foreach(equivalentExpressions.addExprTree(_)) + +var proxyMap = Map.empty[Expression, ExpressionProxy] + +val commonExprs = equivalentExpressions.getAllEquivalentExprs.filter(_.size > 1) +commonExprs.foreach { e => + val expr = e.head + val proxy = ExpressionProxy(expr, this) + + proxyMap ++= e.map(_ -> proxy).toMap +} + +// Only adding proxy if we find subexpressions. +if (proxyMap.nonEmpty) { + expressions.map { expr => +// `transform` will cause stackoverflow because it keeps transforming into +// `ExpressionProxy`. But we cannot use `transformUp` because we want to use +// subexpressions at higher level. So we `transformDown` until finding first +// subexpression. +var transformed = false +expr.transform { Review comment: Good point! I think I missed this point. I will add a test too. (actually there is a test) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #30347: [SPARK-33209][SS] Refactor unit test of stream-stream join in UnsupportedOperationsSuite
HeartSaVioR commented on pull request #30347: URL: https://github.com/apache/spark/pull/30347#issuecomment-727156640 Hey sorry. Please give me to some more time to start looking into, as here's Sat. already. I'd probably take this in early next week. Thanks for understanding. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
viirya commented on pull request #30341: URL: https://github.com/apache/spark/pull/30341#issuecomment-727155028 Hmm, actually for https://github.com/apache/spark/pull/30341#discussion_r522904967, that is to add some tests in `SQLQueryTestSuite` with `--CONFIG_DIM spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=NO_CODEGEN`, I think it should be added as a separate PR and merge the test PR first? Then we can run against with it to check if the result is the same. @maropu This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
viirya commented on pull request #30341: URL: https://github.com/apache/spark/pull/30341#issuecomment-727154709 I addressed most of comments except for https://github.com/apache/spark/pull/30341#discussion_r522822589 and https://github.com/apache/spark/pull/30341#discussion_r522904967. For https://github.com/apache/spark/pull/30341#discussion_r522822589, I will wait for confirmation. For https://github.com/apache/spark/pull/30341#discussion_r522904967, I will add some more tests later. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
viirya commented on a change in pull request #30341: URL: https://github.com/apache/spark/pull/30341#discussion_r523384442 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/EvaluationRuntimeSuite.scala ## @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.SparkFunSuite + +class EvaluationRuntimeSuite extends SparkFunSuite { Review comment: Okay, let me see how to add some tests for this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
viirya commented on a change in pull request #30341: URL: https://github.com/apache/spark/pull/30341#discussion_r523384163 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpressionProxy.scala ## @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.expressions + +import com.google.common.util.concurrent.{ExecutionError, UncheckedExecutionException} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode} +import org.apache.spark.sql.types.DataType + +/** + * A proxy for an catalyst `Expression`. Given a runtime object `EvaluationRuntime`, when this + * is asked to evaluate, it will load the evaluation cache in the runtime first. + */ +case class ExpressionProxy(child: Expression, runtime: EvaluationRuntime) extends Expression { + + final override def dataType: DataType = child.dataType + final override def nullable: Boolean = child.nullable + final override def children: Seq[Expression] = child :: Nil + + // `ExpressionProxy` is for interpreted expression evaluation only. So cannot `doGenCode`. + final override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = +throw new UnsupportedOperationException(s"Cannot generate code for expression: $this") + + def proxyEval(input: InternalRow = null): Any = { Review comment: Removed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
viirya commented on a change in pull request #30341: URL: https://github.com/apache/spark/pull/30341#discussion_r523384132 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EvaluationRuntime.scala ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.expressions + +import com.google.common.cache.{CacheBuilder, CacheLoader, LoadingCache} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.internal.SQLConf + +/** + * This class helps subexpression elimination for interpreted evaluation + * in `InterpretedUnsafeProjection`. It maintains an evaluation cache. + * This class wraps `ExpressionProxy` around given expressions. The `ExpressionProxy` + * intercepts expression evaluation and loads from the cache first. + */ +class EvaluationRuntime { Review comment: Renamed as `SubExprEvaluationRuntime` now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
viirya commented on a change in pull request #30341: URL: https://github.com/apache/spark/pull/30341#discussion_r523384138 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpressionProxy.scala ## @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.expressions + +import com.google.common.util.concurrent.{ExecutionError, UncheckedExecutionException} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode} +import org.apache.spark.sql.types.DataType + +/** + * A proxy for an catalyst `Expression`. Given a runtime object `EvaluationRuntime`, when this + * is asked to evaluate, it will load the evaluation cache in the runtime first. + */ +case class ExpressionProxy(child: Expression, runtime: EvaluationRuntime) extends Expression { Review comment: As a single file `SubExprEvaluationRuntime.scala` now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
viirya commented on a change in pull request #30341: URL: https://github.com/apache/spark/pull/30341#discussion_r523384143 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpressionProxy.scala ## @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.expressions + +import com.google.common.util.concurrent.{ExecutionError, UncheckedExecutionException} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode} +import org.apache.spark.sql.types.DataType + +/** + * A proxy for an catalyst `Expression`. Given a runtime object `EvaluationRuntime`, when this + * is asked to evaluate, it will load the evaluation cache in the runtime first. + */ +case class ExpressionProxy(child: Expression, runtime: EvaluationRuntime) extends Expression { + + final override def dataType: DataType = child.dataType + final override def nullable: Boolean = child.nullable + final override def children: Seq[Expression] = child :: Nil + + // `ExpressionProxy` is for interpreted expression evaluation only. So cannot `doGenCode`. + final override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = +throw new UnsupportedOperationException(s"Cannot generate code for expression: $this") + + def proxyEval(input: InternalRow = null): Any = { +child.eval(input) + } + + override def eval(input: InternalRow = null): Any = try { +runtime.cache.get(this).result + } catch { +// Cache.get() may wrap the original exception. See the following URL +// http://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/ +// Cache.html#get(K,%20java.util.concurrent.Callable) +case e @ (_: UncheckedExecutionException | _: ExecutionError) => Review comment: Moved. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
SparkQA commented on pull request #30341: URL: https://github.com/apache/spark/pull/30341#issuecomment-727154090 **[Test build #131088 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131088/testReport)** for PR 30341 at commit [`4780b65`](https://github.com/apache/spark/commit/4780b653f079c2fd4c950d43758587875d89e41a). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
viirya commented on a change in pull request #30341: URL: https://github.com/apache/spark/pull/30341#discussion_r523383340 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EvaluationRuntime.scala ## @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.expressions + +import com.google.common.cache.{CacheBuilder, CacheLoader, LoadingCache} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.internal.SQLConf + +/** + * This class helps subexpression elimination for interpreted evaluation + * in `InterpretedUnsafeProjection`. It maintains an evaluation cache. + * This class wraps `ExpressionProxy` around given expressions. The `ExpressionProxy` + * intercepts expression evaluation and loads from the cache first. + */ +class EvaluationRuntime { + + val cache: LoadingCache[ExpressionProxy, ResultProxy] = CacheBuilder.newBuilder() +.maximumSize(SQLConf.get.subexpressionEliminationCacheMaxEntries) +.build( + new CacheLoader[ExpressionProxy, ResultProxy]() { +override def load(expr: ExpressionProxy): ResultProxy = { + ResultProxy(expr.proxyEval(currentInput)) +} + }) + + private var currentInput: InternalRow = null + + /** + * Sets given input row as current row for evaluating expressions. This cleans up the cache + * too as new input comes. + */ + def setInput(input: InternalRow = null): Unit = { +currentInput = input +cache.invalidateAll() + } + + /** + * Finds subexpressions and wraps them with `ExpressionProxy`. + */ + def proxyExpressions(expressions: Seq[Expression]): Seq[Expression] = { +val equivalentExpressions: EquivalentExpressions = new EquivalentExpressions + +expressions.foreach(equivalentExpressions.addExprTree(_)) + +var proxyMap = Map.empty[Expression, ExpressionProxy] + +val commonExprs = equivalentExpressions.getAllEquivalentExprs.filter(_.size > 1) +commonExprs.foreach { e => + val expr = e.head + val proxy = ExpressionProxy(expr, this) + + proxyMap ++= e.map(_ -> proxy).toMap +} + +// Only adding proxy if we find subexpressions. +if (proxyMap.nonEmpty) { + expressions.map { expr => +// `transform` will cause stackoverflow because it keeps transforming into +// `ExpressionProxy`. But we cannot use `transformUp` because we want to use +// subexpressions at higher level. So we `transformDown` until finding first +// subexpression. +var transformed = false Review comment: Oh, good idea, I like it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30358: [SPARK-33394][SQL] Throw `NoSuchNamespaceException` for not existing namespace in `InMemoryTableCatalog.listTables()`
dongjoon-hyun commented on a change in pull request #30358: URL: https://github.com/apache/spark/pull/30358#discussion_r523383165 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala ## @@ -57,6 +58,13 @@ trait ShowTablesSuiteBase extends QueryTest with SQLTestUtils { } } + test("show table in a not existing namespace") { +val msg = intercept[NoSuchNamespaceException] { Review comment: Got it~, @MaxGekk . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #30358: [SPARK-33394][SQL] Throw `NoSuchNamespaceException` for not existing namespace in `InMemoryTableCatalog.listTables()`
MaxGekk commented on a change in pull request #30358: URL: https://github.com/apache/spark/pull/30358#discussion_r523382673 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala ## @@ -57,6 +58,13 @@ trait ShowTablesSuiteBase extends QueryTest with SQLTestUtils { } } + test("show table in a not existing namespace") { +val msg = intercept[NoSuchNamespaceException] { Review comment: I didn't touch v1, it behaves as before. `NoSuchDatabaseException` is a sub-class of `NoSuchNamespaceException`. or do you mean the change in the test? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on pull request #30347: [SPARK-33209][SS] Refactor unit test of stream-stream join in UnsupportedOperationsSuite
c21 commented on pull request #30347: URL: https://github.com/apache/spark/pull/30347#issuecomment-727151821 @HeartSaVioR - could you also help take a look when you have time? Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30358: [SPARK-33394][SQL] Throw `NoSuchNamespaceException` for not existing namespace in `InMemoryTableCatalog.listTables()`
dongjoon-hyun commented on a change in pull request #30358: URL: https://github.com/apache/spark/pull/30358#discussion_r523382095 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala ## @@ -57,6 +58,13 @@ trait ShowTablesSuiteBase extends QueryTest with SQLTestUtils { } } + test("show table in a not existing namespace") { +val msg = intercept[NoSuchNamespaceException] { Review comment: @cloud-fan and @MaxGekk . This means the behavior change for both v1 and v2, doesn't it? - `v1`: `NoSuchDatabaseException` -> `NoSuchNamespaceException` - `v2`: empty list -> `NoSuchNamespaceException` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #30341: [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
viirya commented on a change in pull request #30341: URL: https://github.com/apache/spark/pull/30341#discussion_r523381340 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpressionProxy.scala ## @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.expressions + +import com.google.common.util.concurrent.{ExecutionError, UncheckedExecutionException} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode} +import org.apache.spark.sql.types.DataType + +/** + * A proxy for an catalyst `Expression`. Given a runtime object `EvaluationRuntime`, when this + * is asked to evaluate, it will load the evaluation cache in the runtime first. + */ +case class ExpressionProxy(child: Expression, runtime: EvaluationRuntime) extends Expression { + + final override def dataType: DataType = child.dataType + final override def nullable: Boolean = child.nullable + final override def children: Seq[Expression] = child :: Nil + + // `ExpressionProxy` is for interpreted expression evaluation only. So cannot `doGenCode`. + final override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = +throw new UnsupportedOperationException(s"Cannot generate code for expression: $this") + + def proxyEval(input: InternalRow = null): Any = { +child.eval(input) + } + + override def eval(input: InternalRow = null): Any = try { +runtime.cache.get(this).result Review comment: I think it should be fine but want to make sure I get your point correctly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] BryanCutler commented on a change in pull request #30309: [SPARK-33407][PYTHON] Simplify the exception message from Python UDFs (disabled by default)
BryanCutler commented on a change in pull request #30309: URL: https://github.com/apache/spark/pull/30309#discussion_r523380710 ## File path: python/pyspark/util.py ## @@ -75,6 +79,144 @@ def wrapper(*args, **kwargs): return wrapper +def walk_tb(tb): +while tb is not None: +yield tb +tb = tb.tb_next + + +def try_simplify_traceback(tb): +""" +Simplify the traceback. It removes the tracebacks in the current package, and only +shows the traceback that is related to the thirdparty and user-specified codes. + +Returns +--- +TracebackType or None + Simplified traceback instance. It returns None if it fails to simplify. + +Notes +- +This keeps the tracebacks once it sees they are from a different file even +though the following tracebacks are from the current package. + +Examples + +>>> import importlib +>>> import sys +>>> import traceback +>>> import tempfile +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... with open("%s/dummy_module.py" % tmp_dir, "w") as f: +... _ = f.write( +... 'def raise_stop_iteration():\\n' +... 'raise StopIteration()\\n\\n' +... 'def simple_wrapper(f):\\n' +... 'def wrapper(*a, **k):\\n' +... 'return f(*a, **k)\\n' +... 'return wrapper\\n') +... f.flush() +... spec = importlib.util.spec_from_file_location( +... "dummy_module", "%s/dummy_module.py" % tmp_dir) +... dummy_module = importlib.util.module_from_spec(spec) +... spec.loader.exec_module(dummy_module) +>>> def skip_doctest_traceback(tb): +... import pyspark +... root = os.path.dirname(pyspark.__file__) +... pairs = zip(walk_tb(tb), traceback.extract_tb(tb)) +... for cur_tb, cur_frame in pairs: +... if cur_frame.filename.startswith(root): +... return cur_tb + +Regular exceptions should show the file name of the current package as below. + +>>> exc_info = None +>>> try: +...fail_on_stopiteration(dummy_module.raise_stop_iteration)() Review comment: indentation off here? ## File path: python/pyspark/util.py ## @@ -20,6 +20,10 @@ import re import sys import traceback +import types +import os +import platform +import itertools Review comment: nit: imports not in alphabetical order - not necessary though ## File path: python/pyspark/util.py ## @@ -75,6 +79,144 @@ def wrapper(*args, **kwargs): return wrapper +def walk_tb(tb): +while tb is not None: +yield tb +tb = tb.tb_next + + +def try_simplify_traceback(tb): +""" +Simplify the traceback. It removes the tracebacks in the current package, and only +shows the traceback that is related to the thirdparty and user-specified codes. + +Returns +--- +TracebackType or None + Simplified traceback instance. It returns None if it fails to simplify. + +Notes +- +This keeps the tracebacks once it sees they are from a different file even +though the following tracebacks are from the current package. + +Examples + +>>> import importlib +>>> import sys +>>> import traceback +>>> import tempfile +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... with open("%s/dummy_module.py" % tmp_dir, "w") as f: +... _ = f.write( +... 'def raise_stop_iteration():\\n' +... 'raise StopIteration()\\n\\n' +... 'def simple_wrapper(f):\\n' +... 'def wrapper(*a, **k):\\n' +... 'return f(*a, **k)\\n' +... 'return wrapper\\n') +... f.flush() +... spec = importlib.util.spec_from_file_location( +... "dummy_module", "%s/dummy_module.py" % tmp_dir) +... dummy_module = importlib.util.module_from_spec(spec) +... spec.loader.exec_module(dummy_module) +>>> def skip_doctest_traceback(tb): +... import pyspark +... root = os.path.dirname(pyspark.__file__) +... pairs = zip(walk_tb(tb), traceback.extract_tb(tb)) +... for cur_tb, cur_frame in pairs: +... if cur_frame.filename.startswith(root): +... return cur_tb + +Regular exceptions should show the file name of the current package as below. + +>>> exc_info = None +>>> try: +...fail_on_stopiteration(dummy_module.raise_stop_iteration)() +... except Exception as e: +... tb = sys.exc_info()[-1] +... e.__cause__ = None +... exc_info = "".join( +... traceback.format_exception(type(e), e, tb)) +>>> print(exc_info) # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS +Traceback
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30139: [SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `maxCh
AmplabJenkins removed a comment on pull request #30139: URL: https://github.com/apache/spark/pull/30139#issuecomment-727147876 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #30309: [SPARK-33407][PYTHON] Simplify the exception message from Python UDFs (disabled by default)
viirya commented on a change in pull request #30309: URL: https://github.com/apache/spark/pull/30309#discussion_r523380970 ## File path: python/pyspark/util.py ## @@ -177,6 +319,8 @@ def __del__(self): if __name__ == "__main__": import doctest -(failure_count, test_count) = doctest.testmod() -if failure_count: -sys.exit(-1) + +if "pypy" not in platform.python_implementation().lower() and sys.version_info[:2] >= (3, 7): +(failure_count, test_count) = doctest.testmod() Review comment: Does this mean we also skip other doctest in `python/pyspark/util.py`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR commented on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727149697 I just initiated the discussion on dev@ mailing list which I should have been done instead. https://lists.apache.org/thread.html/r30069e17f59e8d29267ae296d56840970905476019023f20164ee5a3%40%3Cdev.spark.apache.org%3E My apologize to make noises and feel anyone unhappy. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun removed a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
dongjoon-hyun removed a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727125565 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR removed a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR removed a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727138220 I was feeling odd and became feeling upset because my intention wasn't change from the first comment and the intention was disregarded (at least that's what I felt like) for a couple of early replies, even I emphasized whether you checked my comments were addressed or not is not the point of concern here. The replies were simply based on whether there was technically any fault or not, and in addition, trying to interpret my action on your view and framed me. I never said my approval is required because I'm domain expert of SS. I mentioned twos including me because twos are only the reviewers who left "actual" review comments, not saying we're dealing with SS hence we are gatekeeper. This is unchanged from the first comment and I never said about SS in my comments. Same happened for the statement I'm framing the Apache Spark community be the negative. I'm going to be a bad guy based on what I even didn't mention. How I can't be upset? Can we make sure we carefully read the comment and try not to interpret and just ask if we're unsure about the intention? Other than that I feel this is a waste of time. I have been trying to explain the committer's power to say "technically correct" is not always representing "we are doing good", but I also admit it's super hard to reconsider the practice we have been using for a while, and I feel the air that we're reluctant to even think about it. At least I shouldn't deal with individual, as it's against my voice as well and that feels someone to be educated. My apologize again. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR removed a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR removed a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727120756 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon removed a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HyukjinKwon removed a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-726476089 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR removed a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR removed a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-726418896 That's not my point. There's no indication I have produced all review comments (while actually I produced all review comments), and review comment author would be the right one to verify whether the review comments are addressed as suggested. I planned to review the new change again today but it's already merged. I see we tend to ask reviewers to see whether they're happy with the last change before merging. At least we "were", and I tend to ensure that, unless reviewers are not responded for days. I'd hope we ensure everyone is happy with the change before merging, otherwise discuss/debate if necessary. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun removed a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
dongjoon-hyun removed a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-726811877 @HeartSaVioR . It seems that you tend to frame the Apache Spark community in negative ways. For your examples, it clearly looks like a straw-man attack. All of us didn't claim such funny things. Putting them aside, I want to say that I have been appreciate your contributions really, especially in this area, `Structural Streaming` because this area really needs lots of more care from the community. To be constructive, it would be great if - You can let us know what was your un-address comments in this PR (https://github.com/apache/spark/pull/30210#issuecomment-726413507) - You can do review after commits. Please do it now instead of stopping like this. Apache Spark community is widely open for that and I am also happy to accept any comments at any timeframe. - You can reply on the corresponding consensus email threads about this PR (which we already did based on your advice on JIRA and here) if you changed your mind. We are open for that, too. - You can raise another email thread about your specific Spark practice requests like you did before. I was also happy about your issue raising before and will be at this time again. I believe both of us want to collaborate in a productive way in this area, @HeartSaVioR . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun removed a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
dongjoon-hyun removed a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-726413507 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR removed a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR removed a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-726410128 (I hope the PR can wait for reviewers' approvals who left valid review comments. I understand the PR can be merged "technically", but doesn't seem to be a good practice. If in doubt w.r.t reviewers being inactive, we can at least wait for a day based on timezone difference.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR removed a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR removed a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-726428939 No that's not also my point. I don't claim about the domain owner (but in practice I see there's implicit domain owner). You can merge the PR in SS area without me if I didn't review. The concern is that I was the one of active reviewers and you're just assuming my review comments were addressed without my confirmation. Both my and @xuanyuanking's comments. I don't require others to read my mind, and anyone can't read my mind, so there should be the process of "confirmation". That would just bring delay of merging a day or so, wouldn't hurt anything. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR removed a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR removed a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-726569827 I'd rather avoid the chance of "post-review" whenever possible, but I'd admit everyone has different thoughts. I'm OK with it, and if that's considered here (and no one would be disappointed from post-review), I was clearly wrong. My bad. I hope we prefer to say about "good practice" of doing something instead of saying "whether someone is the right to do something". Everything is technically possible. e.g. I don't think Spark has some sort of BYLAWS which prevents committers to self +1 and push, but I don't leverage it even I still have PRs sitting around years which got LGTM from contributors, not just from committers. It's just because it's a bad practice. Agree to disagree about the different understanding of "details" about good practice of reviewing is OK (because I may be wrong), but I'd be unhappy if we keep saying about the "right". This says I should always start with explicitly vote -1 whenever I start reviewing, which is the only way about the claim of "my right" of reviewer, but no one would do this as we have a belief and know about the huge negative effect on putting veto. And sure I don't change my reviewing practice to start from -1. (That's gonna be just funny.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR removed a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR removed a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-726433238 What do you say? I don't say you're missing the comment. I say the confirmation is better to be done by the reviewers who review, not from some other one. The confirmation would only take a day or so (due to the timezone) if the reviewers are quite active. Even you can say "I'll merge in tomorrow if there's no further feedback." to force the delay to a day. You're emphasizing 2 weeks but I can enumerate the PRs which waits for a half of year or even years. The code change is quite small (because the detection logic was there), but given it's breaking backward compatibility, it's not odd request to make a consensus. If the default is false probably we rather concern less. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR commented on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727149286 Sorry to all for all noises. Please disregard all conversation. I'll remove them now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30139: [SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `maxChunksBein
AmplabJenkins commented on pull request #30139: URL: https://github.com/apache/spark/pull/30139#issuecomment-727147876 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30139: [SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `maxChunksBe
SparkQA removed a comment on pull request #30139: URL: https://github.com/apache/spark/pull/30139#issuecomment-727132757 **[Test build #131087 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131087/testReport)** for PR 30139 at commit [`fab5557`](https://github.com/apache/spark/commit/fab5557e1e0a0f4b78868ab98e55b05b997864b2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30139: [SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `maxChunksBeingTrans
SparkQA commented on pull request #30139: URL: https://github.com/apache/spark/pull/30139#issuecomment-727147710 **[Test build #131087 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131087/testReport)** for PR 30139 at commit [`fab5557`](https://github.com/apache/spark/commit/fab5557e1e0a0f4b78868ab98e55b05b997864b2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30374: [WIP][SPARK-33444][ML] Added support for Initial model in Gaussian Mixture Model in ML
AmplabJenkins removed a comment on pull request #30374: URL: https://github.com/apache/spark/pull/30374#issuecomment-727147056 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30374: [WIP][SPARK-33444][ML] Added support for Initial model in Gaussian Mixture Model in ML
AmplabJenkins commented on pull request #30374: URL: https://github.com/apache/spark/pull/30374#issuecomment-727147154 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahar603 opened a new pull request #30374: [WIP][SPARK-33444][ML] Added support for Initial model in Gaussian Mixture Model in ML
shahar603 opened a new pull request #30374: URL: https://github.com/apache/spark/pull/30374 ### What changes were proposed in this pull request? Added an optional `initialModel` in the GaussianMixture class. ### Why are the changes needed? To allow for non random initialization in ML's Gaussian Mixture. The algorithm's results can be affected due to the initialization and the existing random initialization may be limiting in some cases. This feature exists in MLLIB. ### Does this PR introduce _any_ user-facing change? * A new method: `setInitialModel` in the `GaussianMixture` class ### How was this patch tested? Yes, tests were added to the `GaussianMixtureSuite.scala` file. The tests test: * The ability to add the model * The fact the GaussianMixture object fit data using the initial parameters * That the GaussianMixture class validates that `K` is the same to the initial model's `K` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30374: [WIP][SPARK-33444][ML] Added support for Initial model in Gaussian Mixture Model in ML
AmplabJenkins commented on pull request #30374: URL: https://github.com/apache/spark/pull/30374#issuecomment-727147056 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
dongjoon-hyun commented on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727146418 Can we talk over the phone? It would be better for us. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29414: [SPARK-32106][SQL] Implement script transform in sql/core
AmplabJenkins removed a comment on pull request #29414: URL: https://github.com/apache/spark/pull/29414#issuecomment-727143194 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29414: [SPARK-32106][SQL] Implement script transform in sql/core
AmplabJenkins commented on pull request #29414: URL: https://github.com/apache/spark/pull/29414#issuecomment-727143194 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29414: [SPARK-32106][SQL] Implement script transform in sql/core
SparkQA removed a comment on pull request #29414: URL: https://github.com/apache/spark/pull/29414#issuecomment-727093834 **[Test build #131086 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131086/testReport)** for PR 29414 at commit [`5b41fbc`](https://github.com/apache/spark/commit/5b41fbc028620a289cb8bc853e8de5861428da87). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29414: [SPARK-32106][SQL] Implement script transform in sql/core
SparkQA commented on pull request #29414: URL: https://github.com/apache/spark/pull/29414#issuecomment-727143021 **[Test build #131086 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131086/testReport)** for PR 29414 at commit [`5b41fbc`](https://github.com/apache/spark/commit/5b41fbc028620a289cb8bc853e8de5861428da87). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30373: [SPARK-33183][SQL][FOLLOW-UP] Adjust RemoveRedundantSorts rule order and update config version
AmplabJenkins removed a comment on pull request #30373: URL: https://github.com/apache/spark/pull/30373#issuecomment-727142229 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30373: [SPARK-33183][SQL][FOLLOW-UP] Adjust RemoveRedundantSorts rule order and update config version
AmplabJenkins commented on pull request #30373: URL: https://github.com/apache/spark/pull/30373#issuecomment-727142229 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30373: [SPARK-33183][SQL][FOLLOW-UP] Adjust RemoveRedundantSorts rule order and update config version
SparkQA removed a comment on pull request #30373: URL: https://github.com/apache/spark/pull/30373#issuecomment-727092021 **[Test build #131085 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131085/testReport)** for PR 30373 at commit [`e6b0bb3`](https://github.com/apache/spark/commit/e6b0bb301d928ce8e9970ec8b887ff06e3110b6d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30373: [SPARK-33183][SQL][FOLLOW-UP] Adjust RemoveRedundantSorts rule order and update config version
SparkQA commented on pull request #30373: URL: https://github.com/apache/spark/pull/30373#issuecomment-727142042 **[Test build #131085 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131085/testReport)** for PR 30373 at commit [`e6b0bb3`](https://github.com/apache/spark/commit/e6b0bb301d928ce8e9970ec8b887ff06e3110b6d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30139: [SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `maxCh
AmplabJenkins removed a comment on pull request #30139: URL: https://github.com/apache/spark/pull/30139#issuecomment-727140308 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/35690/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30139: [SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `maxCh
AmplabJenkins removed a comment on pull request #30139: URL: https://github.com/apache/spark/pull/30139#issuecomment-727140306 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30139: [SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `maxChunksBein
AmplabJenkins commented on pull request #30139: URL: https://github.com/apache/spark/pull/30139#issuecomment-727140306 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30139: [SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `maxChunksBeingTrans
SparkQA commented on pull request #30139: URL: https://github.com/apache/spark/pull/30139#issuecomment-727140293 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35690/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR edited a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727138220 I was feeling odd and became feeling upset because my intention wasn't change from the first comment and the intention was disregarded (at least that's what I felt like) for a couple of early replies, even I emphasized whether you checked my comments were addressed or not is not the point of concern here. The replies were simply based on whether there was technically any fault or not, and in addition, trying to interpret my action on your view and framed me. I never said my approval is required because I'm domain expert of SS. I mentioned twos including me because twos are only the reviewers who left "actual" review comments, not saying we're dealing with SS hence we are gatekeeper. This is unchanged from the first comment and I never said about SS in my comments. Same happened for the statement I'm framing the Apache Spark community be the negative. I'm going to be a bad guy based on what I even didn't mention. How I can't be upset? Can we make sure we carefully read the comment and try not to interpret and just ask if we're unsure about the intention? Other than that I feel this is a waste of time. I have been trying to explain the committer's power to say "technically correct" is not always representing "we are doing good", but I also admit it's super hard to reconsider the practice we have been using for a while, and I feel the air that we're reluctant to even think about it. At least I shouldn't deal with individual, as it's against my voice as well and that feels someone to be educated. My apologize again. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR commented on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727138220 I was feeling odd and became feeling upset because my intention wasn't change from the first comment and the intention was disregarded (at least that's what I felt like) for a couple of early replies, even I emphasized whether you checked my comments were addressed or not is not the point of concern here. The replies were simply based on whether there was any fault or not, and in addition, trying to interpret my action on your view and framed me. I never said my approval is required because I'm domain expert of SS. I mentioned twos including me because twos are only the reviewers who left "actual" review comments, not saying we're dealing with SS hence we are gatekeeper. This is unchanged from the first comment and I never said about SS in my comments. Same happened for the statement I'm framing the Apache Spark community be the negative. I'm going to be a bad guy based on what I even didn't mention. How I can't be upset? Can we make sure we carefully read the comment and try not to interpret and just ask if we're unsure about the intention? Other than that I feel this is a waste of time. I have been trying to explain the committer's power to say "technically correct" is not always representing "we are doing good", but I also admit it's super hard to reconsider the practice we have been using for a while, and I feel the air that we're reluctant to even think about it. At least I shouldn't deal with individual, as it's against my voice as well and that feels someone to be educated. My apologize again. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30139: [SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `maxChunksBeingTrans
SparkQA commented on pull request #30139: URL: https://github.com/apache/spark/pull/30139#issuecomment-727137606 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35690/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #30370: [SPARK-33446][CORE] Add config spark.executor.coresOverhead
mridulm commented on pull request #30370: URL: https://github.com/apache/spark/pull/30370#issuecomment-727133352 > I want an executor with 2 cores and 6 gb, but only 1 core used for task allocation, which means at most 1 task could be running on this executor. If I use existing configs, there would be at most 2 tasks. I want to give each task 6GB memory, but also use extra cpu time for gc or other things. GC is handled by the VM - we dont need an additional core for it. Without a good usecase, I am not in favor of adding additional configurations. Having said that, if the requirement is strictly what you mentioned : allocate two cores and 6gb, run only 1 task and leave 1 core around for "other things" (let me assume some background processing ? Not sure what is happening) : you can use spark.task.cpus=2 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] manuzhang commented on pull request #29797: [SPARK-32932][SQL] Do not use local shuffle reader at final stage on write command
manuzhang commented on pull request #29797: URL: https://github.com/apache/spark/pull/29797#issuecomment-727132781 @maryannxue thanks for pointing out the fundamental problems while I'm not familiar with the design behind. When starting out, I was trying to solve the most urgent issue in our use cases, too many output files caused by LSR, which broke the downstream jobs on the pipeline. Meanwhile, disabling LSR entirely downgrades the performance of BHJ switched from SMJ. Hence, this PR with help of @cloud-fan. > But again, this is a "repartition" issue rather than the write command issue. Introducing this kind of hack in AQE plan does not make sense. 1. Is that ok or even good if repartition is optimized away in cases other than on write command ? 2. Moving forward, what's your suggestion ? Do you want me revert this PR ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30139: [SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `maxChunksBeingTrans
SparkQA commented on pull request #30139: URL: https://github.com/apache/spark/pull/30139#issuecomment-727132757 **[Test build #131087 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131087/testReport)** for PR 30139 at commit [`fab5557`](https://github.com/apache/spark/commit/fab5557e1e0a0f4b78868ab98e55b05b997864b2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting externa
AmplabJenkins removed a comment on pull request #30164: URL: https://github.com/apache/spark/pull/30164#issuecomment-727123181 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/131082/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuf
SparkQA removed a comment on pull request #30164: URL: https://github.com/apache/spark/pull/30164#issuecomment-726987157 **[Test build #131082 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131082/testReport)** for PR 30164 at commit [`9ba4dfb`](https://github.com/apache/spark/commit/9ba4dfbed66fab7badf279ce3e83d4db569d61e2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting externa
AmplabJenkins removed a comment on pull request #30164: URL: https://github.com/apache/spark/pull/30164#issuecomment-727122853 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30139: [SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `ma
AngersZh commented on a change in pull request #30139: URL: https://github.com/apache/spark/pull/30139#discussion_r523364609 ## File path: common/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java ## @@ -88,12 +88,14 @@ public void processFetchRequest( logger.trace("Received req from {} to fetch block {}", getRemoteAddress(channel), msg.streamChunkId); } -long chunksBeingTransferred = streamManager.chunksBeingTransferred(); -if (chunksBeingTransferred >= maxChunksBeingTransferred) { - logger.warn("The number of chunks being transferred {} is above {}, close the connection.", -chunksBeingTransferred, maxChunksBeingTransferred); - channel.close(); - return; +if (maxChunksBeingTransferred < Long.MAX_VALUE) { + long chunksBeingTransferred = streamManager.chunksBeingTransferred(); + if (chunksBeingTransferred >= maxChunksBeingTransferred) { +logger.warn("The number of chunks being transferred {} is above {}, close the connection.", +chunksBeingTransferred, maxChunksBeingTransferred); Review comment: > Nit: indentation should be 2 Done ## File path: common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java ## @@ -124,12 +124,14 @@ private void processStreamRequest(final StreamRequest req) { req.streamId); } -long chunksBeingTransferred = streamManager.chunksBeingTransferred(); -if (chunksBeingTransferred >= maxChunksBeingTransferred) { - logger.warn("The number of chunks being transferred {} is above {}, close the connection.", -chunksBeingTransferred, maxChunksBeingTransferred); - channel.close(); - return; +if (maxChunksBeingTransferred < Long.MAX_VALUE) { + long chunksBeingTransferred = streamManager.chunksBeingTransferred(); + if (chunksBeingTransferred >= maxChunksBeingTransferred) { +logger.warn("The number of chunks being transferred {} is above {}, close the connection.", +chunksBeingTransferred, maxChunksBeingTransferred); Review comment: > Nit: indentation should be 2 spaces Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
dongjoon-hyun edited a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727130032 You are repeatedly saying that I had taken your right away here. In this PR's context, it doesn't make sense to me either. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
dongjoon-hyun commented on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727130032 You are saying that I had taken your right away repeatedly. In this PR's context, it doesn't make sense to me either. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
dongjoon-hyun edited a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727125565 Please remember that I also replied politely at your first comment like this. I apologized first and tried to understand your request and to fix it immiediately because I thought I missed your some comments. > Sorry, but could you be more specific? It seems that I missed some comments. But, it isn't true, is it? You didn't make any signal here. Suddenly, you were upset because your approval is not asked explicitly by me. After than, you did ignore my approval step and suggested me to have to say `I'll merge in tomorrow if there's no further feedback.`. That's the way I feel that you disregarded me. The rest of discussion looks like the same pattern to me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
dongjoon-hyun commented on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727125565 Please remember that I also replied politely at your first comment like this. I apologized first and tried to understand your request and to fix it immiediately because I thought I missed your some comments. > Sorry, but could you be more specific? It seems that I missed some comments. But, it isn't true, is it? You didn't make any signal here. Suddenly, you were upset that your approval is not asked explicitly by me. After than, you did ignore my approval step and suggested me to have to say `I'll merge in tomorrow if there's no further feedback.`. That's the way I feel that you disregarded me. The rest of discussion looks like the same pattern to me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on pull request #30242: [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends.
ueshin commented on pull request #30242: URL: https://github.com/apache/spark/pull/30242#issuecomment-727124941 gentle ping This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #30366: [WIP][SPARK-33440][CORE] Separate the calculation for the next renewal date for each delegation token
HeartSaVioR commented on pull request #30366: URL: https://github.com/apache/spark/pull/30366#issuecomment-727124742 I just changed the PR to draft, as I would make sure we consider about the better change (than the current approach of the PR) accounting my comments above. Reviewers, please consider the code change to only the reference, and go through below comments and share your opinion. Thanks in advance! https://github.com/apache/spark/pull/30366#issuecomment-726531257 https://github.com/apache/spark/pull/30366#issuecomment-726534500 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on a change in pull request #30139: [SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `maxChunk
otterc commented on a change in pull request #30139: URL: https://github.com/apache/spark/pull/30139#discussion_r523353481 ## File path: common/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java ## @@ -88,12 +88,14 @@ public void processFetchRequest( logger.trace("Received req from {} to fetch block {}", getRemoteAddress(channel), msg.streamChunkId); } -long chunksBeingTransferred = streamManager.chunksBeingTransferred(); -if (chunksBeingTransferred >= maxChunksBeingTransferred) { - logger.warn("The number of chunks being transferred {} is above {}, close the connection.", -chunksBeingTransferred, maxChunksBeingTransferred); - channel.close(); - return; +if (maxChunksBeingTransferred < Long.MAX_VALUE) { + long chunksBeingTransferred = streamManager.chunksBeingTransferred(); + if (chunksBeingTransferred >= maxChunksBeingTransferred) { +logger.warn("The number of chunks being transferred {} is above {}, close the connection.", +chunksBeingTransferred, maxChunksBeingTransferred); Review comment: Nit: indentation should be 2 ## File path: common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java ## @@ -124,12 +124,14 @@ private void processStreamRequest(final StreamRequest req) { req.streamId); } -long chunksBeingTransferred = streamManager.chunksBeingTransferred(); -if (chunksBeingTransferred >= maxChunksBeingTransferred) { - logger.warn("The number of chunks being transferred {} is above {}, close the connection.", -chunksBeingTransferred, maxChunksBeingTransferred); - channel.close(); - return; +if (maxChunksBeingTransferred < Long.MAX_VALUE) { + long chunksBeingTransferred = streamManager.chunksBeingTransferred(); + if (chunksBeingTransferred >= maxChunksBeingTransferred) { +logger.warn("The number of chunks being transferred {} is above {}, close the connection.", +chunksBeingTransferred, maxChunksBeingTransferred); Review comment: Nit: indentation should be 2 spaces This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR edited a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727121487 And I also admit I have different voice on post-review. I agree post-review would open the possibility for reviewers to review later who weren't active during the review period before merging, but IMHO it's over-use of post-review if it's being used to provide merger rationalization to cut the review period at any time, feel other reviewers to be **disregarded**. For sure, that's just a 2 cents. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffl
AmplabJenkins commented on pull request #30164: URL: https://github.com/apache/spark/pull/30164#issuecomment-727123177 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffle serv
SparkQA commented on pull request #30164: URL: https://github.com/apache/spark/pull/30164#issuecomment-727122995 **[Test build #131082 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131082/testReport)** for PR 30164 at commit [`9ba4dfb`](https://github.com/apache/spark/commit/9ba4dfbed66fab7badf279ce3e83d4db569d61e2). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffl
AmplabJenkins commented on pull request #30164: URL: https://github.com/apache/spark/pull/30164#issuecomment-727122853 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #30366: [SPARK-33440][CORE] Separate the calculation for the next renewal date for each delegation token
HeartSaVioR edited a comment on pull request #30366: URL: https://github.com/apache/spark/pull/30366#issuecomment-727107472 @tgravescs Thanks for the comment. > I thought the way we did it was just got the earliest renewal so we didn't have to track all them individually because the common case is renewal only happens in hours time frame - like once every 24 hours but its obviously configurable. Yes that's configurable but agree that it only happens in hours time frame. The problematic token also produced expired time as 7 days later, so even longer than the Hadoop delegation token. The problem is simply just because Spark has a belief that token identifier should have valid issue date, whereas it's not "guaranteed" for every implementations. Once the precondition is broken, the calculation is completely going on unexpected way. There's safeguard but the safeguard also makes thing bad. That's why I had to change the approach. Before reaching this approach, I just fixed the issue via simply discarding the invalid next renewal date like less than now, but that really feels me just adding a workaround. (This workaround also can't handle the new case I left in above comment.) And given we're here, I'd like to collect some ideas to make this "concrete" with tolerating known case of broken assumption. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR edited a comment on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727120756 I was writing a wall of text and Chrome happily (?) killed itself. Rewriting one. What I really asked you to do is exactly this. The practice is also happened for the first time I made a first contribution on Apache Spark. If you want I can find that too. https://github.com/apache/spark/pull/30139#issuecomment-727103124 I'm really unhappy now you're saying I'm framing Apache Spark community to be negative ways. I was never saying about the community here. Let me quote my first comment again, where's the strong voice like the attacking or making things being negative, and mentioning any sort of thing representing community? > (I hope the PR can wait for reviewers' approvals who left valid review comments. I understand the PR can be merged "technically", but doesn't seem to be a good practice. If in doubt w.r.t reviewers being inactive, we can at least wait for a day based on timezone difference.) This comment is exactly explaining the same thing what I link for reference. Could you find any difference? Even I didn't say that strongly. A PMC member is just a "one of" PMC members. Before you make consensus along with PMC members and construct a BYLAWS, you can't frame me as one guy complaining to Spark community. The practice is not a thing "everyone" has a belief, so attacking the practice shouldn't be interpreted like so. Makes sense? For sure, I said about -1 as a sort of joke, but what you requested to me to block merger to merge at any time is "technically" the same. That is just a "representational" difference, and for me it's also not welcomed one if I have to always mention merger to stop doing before I approve. And this suggestion is your own, not from the consensus. But in overall, I have to admit I really did one bad thing, trying to educate someone which is not constructed like BYLAW in any way, regardless of the intention. I should avoid doing that and raise discussion in dev mailing thread like I have been doing before. My apologize. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR commented on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727121487 And I also admit I have different voice on post-review. I agree post-review would open the possibility for reviewers to review later who weren't active during the review period before merging, but IMHO it's over-use of post-review if it's being used to provide merger rationalization to cut the review period at any time, feel other reviewers to be **disregarded**. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #30210: [SPARK-33259][SS] Disable streaming query with possible correctness issue by default
HeartSaVioR commented on pull request #30210: URL: https://github.com/apache/spark/pull/30210#issuecomment-727120756 I was writing a wall of text and Chrome happily (?) killed itself. Rewriting one. What I really asked you to do is exactly this. The practice is also happened for the first time I made a first contribution on Apache Spark. If you want I can find that too. https://github.com/apache/spark/pull/30139#issuecomment-727103124 I'm really unhappy now you're saying I'm framing Apache Spark community to be negative ways. I was never saying about the community here. Let me quote my comment again, where's the strong voice like the attacking or making things being negative, and mentioning any sort of thing representing community? > (I hope the PR can wait for reviewers' approvals who left valid review comments. I understand the PR can be merged "technically", but doesn't seem to be a good practice. If in doubt w.r.t reviewers being inactive, we can at least wait for a day based on timezone difference.) This comment is exactly explaining the same thing what I link for reference. Could you find any difference? Even I didn't say that strongly. A PMC member is just a "one of" PMC members. Before you make consensus along with PMC members and construct a BYLAWS, you can't frame me as one guy complaining to Spark community. The practice is not a thing "everyone" has a belief, so attacking the practice shouldn't be interpreted like so. Makes sense? For sure, I said about -1 as a sort of joke, but what you requested to me to block merger to merge at any time is "technically" the same. That is just a "representational" difference, and for me it's also not welcomed one if I have to always mention merger to stop doing before I approve. And this suggestion is your own, not from the consensus. But in overall, I have to admit I really did one bad thing, trying to educate someone which is not constructed like BYLAW in any way, regardless of the intention. I should avoid doing that and raise discussion in dev mailing thread like I have been doing before. My apologize. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30371: [SPARK-33337][SQL][FOLLOWUP] Prevent possible flakyness in SubexpressionEliminationSuite
AmplabJenkins removed a comment on pull request #30371: URL: https://github.com/apache/spark/pull/30371#issuecomment-727107898 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #30366: [SPARK-33440][CORE] Separate the calculation for the next renewal date for each delegation token
HeartSaVioR commented on pull request #30366: URL: https://github.com/apache/spark/pull/30366#issuecomment-727108543 @dongjoon-hyun Thanks for the review. Given @tgravescs is revisiting the logic (and considering the above comments as well), I expect the code change would occur afterwards. I'd defer addressing review comments until we are in agreement with the direction and the PR reflects the consensus. In the meanwhile I'll try to look into how to add the test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30371: [SPARK-33337][SQL][FOLLOWUP] Prevent possible flakyness in SubexpressionEliminationSuite
AmplabJenkins commented on pull request #30371: URL: https://github.com/apache/spark/pull/30371#issuecomment-727107898 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30371: [SPARK-33337][SQL][FOLLOWUP] Prevent possible flakyness in SubexpressionEliminationSuite
SparkQA removed a comment on pull request #30371: URL: https://github.com/apache/spark/pull/30371#issuecomment-727018690 **[Test build #131084 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131084/testReport)** for PR 30371 at commit [`9816402`](https://github.com/apache/spark/commit/9816402ffddaf9d7c28a5d734ef4411728089dd2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30371: [SPARK-33337][SQL][FOLLOWUP] Prevent possible flakyness in SubexpressionEliminationSuite
SparkQA commented on pull request #30371: URL: https://github.com/apache/spark/pull/30371#issuecomment-727107533 **[Test build #131084 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131084/testReport)** for PR 30371 at commit [`9816402`](https://github.com/apache/spark/commit/9816402ffddaf9d7c28a5d734ef4411728089dd2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #30366: [SPARK-33440][CORE] Separate the calculation for the next renewal date for each delegation token
HeartSaVioR edited a comment on pull request #30366: URL: https://github.com/apache/spark/pull/30366#issuecomment-727107472 @tgravescs Thanks for the comment. > I thought the way we did it was just got the earliest renewal so we didn't have to track all them individually because the common case is renewal only happens in hours time frame - like once every 24 hours but its obviously configurable. Yes that's configurable but agree that it only happens in hours time frame. The problematic token also produced expired time as 7 days later, so even longer than the Hadoop delegation token. The problem is simply just because Spark has a belief that token identifier should have valid issue date, whereas it's not "guaranteed" for every implementations. Once the precondition is broken, the calculation is completely going on unintentional way. There's safeguard but the safeguard also makes thing bad. That's why I had to change the approach. Before reaching this approach, I just fixed the issue via simply discarding the invalid next renewal date like less than now, but that really feels me just adding a workaround. (This workaround also can't handle the new case I left in above comment.) And given we're here, I'd like to collect some ideas to make this "concrete" with tolerating known case of broken assumption. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29414: [SPARK-32106][SQL] Implement script transform in sql/core
AmplabJenkins removed a comment on pull request #29414: URL: https://github.com/apache/spark/pull/29414#issuecomment-727107235 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #30366: [SPARK-33440][CORE] Separate the calculation for the next renewal date for each delegation token
HeartSaVioR commented on pull request #30366: URL: https://github.com/apache/spark/pull/30366#issuecomment-727107472 @tgravescs Thanks for the comment. > I thought the way we did it was just got the earliest renewal so we didn't have to track all them individually because the common case is renewal only happens in hours time frame - like once every 24 hours but its obviously configurable. Yes that's configurable but agree that it only happens in hours time frame. The problematic token also produced expired time as 7 days later, so even longer than the Hadoop delegation token. That is simply just because Spark has a belief that token identifier should have valid issue date, whereas it's not "guaranteed" for every implementations. Once the precondition is broken, the calculation is completely going on unintentional way. There's safeguard but the safeguard also makes thing bad. That's why I had to change the approach. Before reaching this approach, I just fixed the issue via simply discarding the invalid next renewal date like less than now, but that really feels me just adding a workaround. (This workaround also can't handle the new case I left in above comment.) And given we're here, I'd like to collect some ideas to make this "concrete" with tolerating known case of broken assumption. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29414: [SPARK-32106][SQL] Implement script transform in sql/core
SparkQA commented on pull request #29414: URL: https://github.com/apache/spark/pull/29414#issuecomment-727107227 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35689/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29414: [SPARK-32106][SQL] Implement script transform in sql/core
AmplabJenkins commented on pull request #29414: URL: https://github.com/apache/spark/pull/29414#issuecomment-727107235 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30373: [SPARK-33183][SQL][FOLLOW-UP] Adjust RemoveRedundantSorts rule order and update config version
AmplabJenkins removed a comment on pull request #30373: URL: https://github.com/apache/spark/pull/30373#issuecomment-727106355 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30373: [SPARK-33183][SQL][FOLLOW-UP] Adjust RemoveRedundantSorts rule order and update config version
AmplabJenkins commented on pull request #30373: URL: https://github.com/apache/spark/pull/30373#issuecomment-727106355 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30373: [SPARK-33183][SQL][FOLLOW-UP] Adjust RemoveRedundantSorts rule order and update config version
SparkQA commented on pull request #30373: URL: https://github.com/apache/spark/pull/30373#issuecomment-727106353 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35688/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] warrenzhu25 commented on pull request #30370: [SPARK-33446][CORE] Add config spark.executor.coresOverhead
warrenzhu25 commented on pull request #30370: URL: https://github.com/apache/spark/pull/30370#issuecomment-727105908 > I am not sure I follow. > > * If you want an executor with 2 cores and 6 gb, you can allocate them with existing configs. > * If you want an executor with 1 core and 6 gb, you can do the same - if underlying cluster allocates in 3gb/1core block, 1 core will be wasted - which is by requirement here. > * If you want 1 core and 6 gb - then setting core overhead will not help - since it is an underlying cluster issue that you cant get this : if I understood it properly (and extrapolating based on what yarn used to do). > > In other words, spark application makes the request for what it needs - cluster provides the next higher multiple which satisfies the requirements. If this means extra memory or additional cores, spark wont use it. > > I am trying to understand what is the actual usecase we have, and how the existing configs wont help. Thanks ! I want an executor with 2 cores and 6 gb, but only 1 core used for task allocation, which means at most 1 task could be running on this executor. If I use existing configs, there would be at most 2 tasks. I want to give each task 6GB memory, but also use extra cpu time for gc or other things. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm edited a comment on pull request #30370: [SPARK-33446][CORE] Add config spark.executor.coresOverhead
mridulm edited a comment on pull request #30370: URL: https://github.com/apache/spark/pull/30370#issuecomment-727104096 I am not sure I follow. * If you want an executor with 2 cores and 6 gb, you can allocate them with existing configs. * If you want an executor with 1 core and 6 gb, you can do the same - if underlying cluster allocates in 3gb/1core block, 1 core will be wasted - which is by requirement here. * If you want 1 core and 6 gb - then setting core overhead will not help - since it is an underlying cluster issue that you cant get this : if I understood it properly (and extrapolating based on what yarn used to do). In other words, spark application makes the request for what it needs - cluster provides the next higher multiple which satisfies the requirements. If this means extra memory or additional cores, spark wont use it. I am trying to understand what is the actual usecase we have, and how the existing configs wont help. Thanks ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org