[GitHub] [spark] AmplabJenkins removed a comment on pull request #30412: [SPARK-33480][SQL] Support char/varchar type
AmplabJenkins removed a comment on pull request #30412: URL: https://github.com/apache/spark/pull/30412#issuecomment-733202278 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30465: [SPARK-33045][SQL][FOLLOWUP] Support built-in function like_any and fix StackOverflowError issue.
AmplabJenkins commented on pull request #30465: URL: https://github.com/apache/spark/pull/30465#issuecomment-733202287 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30440: [SPARK-33496][SQL]Improve error message of ANSI explicit cast
AmplabJenkins commented on pull request #30440: URL: https://github.com/apache/spark/pull/30440#issuecomment-733202292 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30412: [SPARK-33480][SQL] Support char/varchar type
AmplabJenkins commented on pull request #30412: URL: https://github.com/apache/spark/pull/30412#issuecomment-733202282 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29000: [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode
AmplabJenkins commented on pull request #29000: URL: https://github.com/apache/spark/pull/29000#issuecomment-733202296 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30403: [SPARK-33448][SQL] Support CACHE/UNCACHE TABLE commands for v2 tables
AmplabJenkins commented on pull request #30403: URL: https://github.com/apache/spark/pull/30403#issuecomment-733202277 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30478: [SPARK-33525][SQL] Update hive-service-rpc to 3.1.2
AmplabJenkins commented on pull request #30478: URL: https://github.com/apache/spark/pull/30478#issuecomment-733202279 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xkrogen commented on a change in pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path
xkrogen commented on a change in pull request #29966: URL: https://github.com/apache/spark/pull/29966#discussion_r529838764 ## File path: core/src/main/scala/org/apache/spark/util/Utils.scala ## @@ -2980,6 +2980,75 @@ private[spark] object Utils extends Logging { metadata.toString } + /** + * Download Ivy URIs dependent jars. + * + * @param uri Ivy uri need to be downloaded. + * @return Comma separated string list of URIs of downloaded jars + */ + def resolveMavenDependencies(uri: URI): String = { +val Seq(repositories, ivyRepoPath, ivySettingsPath) = + Seq( +"spark.jars.repositories", +"spark.jars.ivy", +"spark.jars.ivySettings" + ).map(sys.props.get(_).orNull) +// Create the IvySettings, either load from file or build defaults +val ivySettings = Option(ivySettingsPath) match { + case Some(path) => +SparkSubmitUtils.loadIvySettings(path, Option(repositories), Option(ivyRepoPath)) Review comment: Let's not use default Ivy settings... In my experience with some custom logic we have, it's very valuable to ensure that all of the Ivy resolution obeys the settings. Maybe we can pull this out into a common utility that can be leveraged here and `DriverWrapper`? Then there is no need for testing twice. But I don't think saying that the logic is tested elsewhere then copy-pasted is sufficient -- if there is drift in the two code paths, we will lose the validation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xkrogen commented on a change in pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path
xkrogen commented on a change in pull request #29966: URL: https://github.com/apache/spark/pull/29966#discussion_r529837613 ## File path: core/src/main/scala/org/apache/spark/util/Utils.scala ## @@ -2980,6 +2980,75 @@ private[spark] object Utils extends Logging { metadata.toString } + /** + * Download Ivy URIs dependent jars. + * + * @param uri Ivy uri need to be downloaded. + * @return Comma separated string list of URIs of downloaded jars + */ + def resolveMavenDependencies(uri: URI): String = { +val Seq(repositories, ivyRepoPath, ivySettingsPath) = + Seq( +"spark.jars.repositories", +"spark.jars.ivy", +"spark.jars.ivySettings" + ).map(sys.props.get(_).orNull) +// Create the IvySettings, either load from file or build defaults +val ivySettings = Option(ivySettingsPath) match { + case Some(path) => +SparkSubmitUtils.loadIvySettings(path, Option(repositories), Option(ivyRepoPath)) Review comment: Let's not use default Ivy settings... In my experience with some custom logic we have, it's very valuable to ensure that all of the Ivy resolution obeys the settings. Maybe we can pull this out into a common utility that can be leveraged here and `DriverWrapper`? Then there is no need for testing twice. But I don't think saying that the logic is tested elsewhere then copy-pasted is sufficient -- if there is drift in the two code paths, we will lose the validation. ## File path: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ## @@ -1348,6 +1348,7 @@ private[spark] object SparkSubmitUtils { coordinates: String, ivySettings: IvySettings, exclusions: Seq[String] = Nil, + transitive: Boolean = true, Review comment: Nit: Add Scaladoc for this new parameter ## File path: sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala ## @@ -159,6 +161,13 @@ class SessionResourceLoader(session: SparkSession) extends FunctionResourceLoade } } + protected def resolveJars(path: String): List[String] = { +new Path(path).toUri.getScheme match { Review comment: Shouldn't we use `URI.create(path)` a single time, then re-use the URI in this line and the one below? I also think I remember you mentioning elsewhere that `new Path(path).toUri` can lose information. ## File path: core/src/main/scala/org/apache/spark/util/Utils.scala ## @@ -2980,6 +2980,77 @@ private[spark] object Utils extends Logging { metadata.toString } + /** + * Download Ivy URIs dependent jars. + * + * @param uri Ivy uri need to be downloaded. + * @return Comma separated string list of URIs of downloaded jars + */ + def resolveMavenDependencies(uri: URI): String = { +val Seq(repositories, ivyRepoPath, ivySettingsPath) = + Seq( +"spark.jars.repositories", +"spark.jars.ivy", +"spark.jars.ivySettings" + ).map(sys.props.get(_).orNull) +// Create the IvySettings, either load from file or build defaults +val ivySettings = Option(ivySettingsPath) match { + case Some(path) => +SparkSubmitUtils.loadIvySettings(path, Option(repositories), Option(ivyRepoPath)) + + case None => +SparkSubmitUtils.buildIvySettings(Option(repositories), Option(ivyRepoPath)) +} +SparkSubmitUtils.resolveMavenCoordinates(uri.getAuthority, ivySettings, + parseExcludeList(uri.getQuery), parseTransitive(uri.getQuery)) + } + + private def parseURLQueryParameter(queryString: String, queryTag: String): Array[String] = { +if (queryString == null || queryString.isEmpty) { + Array.empty[String] +} else { + val mapTokens = queryString.split("&") + assert(mapTokens.forall(_.split("=").length == 2), "Invalid query string: " + queryString) Review comment: IIRC this will accept URLs that looks like `?=foo`, `?foo=`, or `?bar=&baz=foo`. It would be good to add tests for this to confirm, and adjust as necessary. Same comment for `parseExcludeList` below. ## File path: core/src/main/scala/org/apache/spark/util/Utils.scala ## @@ -2980,6 +2980,77 @@ private[spark] object Utils extends Logging { metadata.toString } + /** + * Download Ivy URIs dependent jars. + * + * @param uri Ivy uri need to be downloaded. + * @return Comma separated string list of URIs of downloaded jars Review comment: Should we be returning a `List[String]` instead of `String` (here and in `SparkSubmitUtils`)? It seems odd to have `SparkSubmitUtils` do a `mkString` to convert a list to string, then re-convert back to a list later. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please
[GitHub] [spark] otterc commented on a change in pull request #30433: [SPARK-32916][SHUFFLE][test-maven][test-hadoop2.7] Ensure the number of chunks in meta file and index file are equal
otterc commented on a change in pull request #30433: URL: https://github.com/apache/spark/pull/30433#discussion_r529837473 ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java ## @@ -827,13 +833,16 @@ void resetChunkTracker() { void updateChunkInfo(long chunkOffset, int mapIndex) throws IOException { long idxStartPos = -1; try { -// update the chunk tracker to meta file before index file -writeChunkTracker(mapIndex); idxStartPos = indexFile.getFilePointer(); logger.trace("{} shuffleId {} reduceId {} updated index current {} updated {}", appShuffleId.appId, appShuffleId.shuffleId, reduceId, this.lastChunkOffset, chunkOffset); -indexFile.writeLong(chunkOffset); +indexFile.write(Longs.toByteArray(chunkOffset)); +// Chunk bitmap should be written to the meta file after the index file because if there are +// any exceptions during writing the offset to the index file, meta file should not be +// updated. If the update to the index file is successful but the update to meta file isn't +// then the index file position is reset in the catch clause. +writeChunkTracker(mapIndex); Review comment: > Besides, I'm thinking it would be good if we could dynamically change the merger location when such IO exception happens. Which one of these do you mean? 1. The executor should start pushing to a different server for the current shuffle that is going on. 2. Do you mean that for a future shuffle, the driver should decide the merger locations considering these IOExceptions. (1) is currently difficult. The driver decides the specific merger locations before a shuffle stage and all the executors work on the assumption that these are all the shuffle servers that they push the data to. This list of shuffle servers should be consistent across the shuffle servers because if it's not then they will be push the shuffle data belonging to a reducer to different shuffle servers. This is the code I am talking about- [code](https://github.com/linkedin/spark/blob/002cb69e8ddf49bbe114744b84c46e0fd452d852/core/src/main/scala/org/apache/spark/shuffle/ShuffleBlockPusher.scala#L360). (2) This could be a future improvement where we also include the metrics for merge reported back to driver when the driver triggers the finalize. The driver could use these metrics when deciding which shuffle server to use for the next push. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #30490: [SPARK-33543][SQL] Migrate SHOW COLUMNS command to use UnresolvedTableOrView to resolve the identifier
imback82 commented on a change in pull request #30490: URL: https://github.com/apache/spark/pull/30490#discussion_r529836248 ## File path: sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala ## @@ -467,25 +467,13 @@ class ResolveSessionCatalog( v1TableName.asTableIdentifier, partitionSpec) -case ShowColumnsStatement(tbl, ns) => - if (ns.isDefined && ns.get.length > 1) { -throw new AnalysisException( - s"Namespace name should have only one part if specified: ${ns.get.quoted}") - } Review comment: Note that this is not hit anymore now that namespace is always added to the table name in `AstBuilder` so that v2 tables can be resolved. If there are multi parts in the namespace name, it will fail while looking up in the session catalog: `The namespace in session catalog must have exactly one name part`. I added a test for this below. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 opened a new pull request #30490: [SPARK-33543][SQL] Migrate SHOW COLUMNS command to use UnresolvedTableOrView to resolve the identifier
imback82 opened a new pull request #30490: URL: https://github.com/apache/spark/pull/30490 ### What changes were proposed in this pull request? This PR proposes to migrate `SHOW COLUMNS` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `SHOW COLUMNS` is not yet supported for v2 tables. ### Why are the changes needed? To use `UnresolvedTableOrView` for table/view resolution. Note that `ShowColumnsCommand` internally resolves to a temp view first, so there is no resolution behavior change with this PR. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on a change in pull request #30312: [SPARK-32917][SHUFFLE][CORE] Adds support for executors to push shuffle blocks after successful map task completion
otterc commented on a change in pull request #30312: URL: https://github.com/apache/spark/pull/30312#discussion_r529831878 ## File path: core/src/main/scala/org/apache/spark/shuffle/ShuffleWriter.scala ## @@ -17,18 +17,466 @@ package org.apache.spark.shuffle -import java.io.IOException +import java.io.{File, IOException} +import java.net.ConnectException +import java.nio.ByteBuffer +import java.util.concurrent.ExecutorService +import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet, Queue} + +import com.google.common.base.Throwables + +import org.apache.spark.{ShuffleDependency, SparkConf, SparkEnv} +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.internal.config._ +import org.apache.spark.launcher.SparkLauncher +import org.apache.spark.network.buffer.{FileSegmentManagedBuffer, ManagedBuffer, NioManagedBuffer} +import org.apache.spark.network.netty.SparkTransportConf +import org.apache.spark.network.shuffle.{BlockFetchingListener, BlockStoreClient} +import org.apache.spark.network.shuffle.ErrorHandler.BlockPushErrorHandler +import org.apache.spark.network.util.TransportConf import org.apache.spark.scheduler.MapStatus +import org.apache.spark.shuffle.ShuffleWriter._ +import org.apache.spark.storage.{BlockId, BlockManagerId, ShufflePushBlockId} +import org.apache.spark.util.{ThreadUtils, Utils} /** - * Obtained inside a map task to write out records to the shuffle system. + * Obtained inside a map task to write out records to the shuffle system, and optionally + * initiate the block push process to remote shuffle services if push based shuffle is enabled. */ -private[spark] abstract class ShuffleWriter[K, V] { +private[spark] abstract class ShuffleWriter[K, V] extends Logging { + private[this] var maxBytesInFlight = 0L + private[this] var maxReqsInFlight = 0 + private[this] var maxBlocksInFlightPerAddress = 0 + private[this] var bytesInFlight = 0L + private[this] var reqsInFlight = 0 + private[this] val numBlocksInFlightPerAddress = new HashMap[BlockManagerId, Int]() + private[this] val deferredPushRequests = new HashMap[BlockManagerId, Queue[PushRequest]]() + private[this] val pushRequests = new Queue[PushRequest] + private[this] val errorHandler = createErrorHandler() + private[this] val unreachableBlockMgrs = new HashSet[BlockManagerId]() + /** Write a sequence of records to this task's output */ @throws[IOException] def write(records: Iterator[Product2[K, V]]): Unit /** Close this writer, passing along whether the map completed */ def stop(success: Boolean): Option[MapStatus] + + def getPartitionLengths(): Array[Long] + + // VisibleForTesting + private[shuffle] def createErrorHandler(): BlockPushErrorHandler = { +new BlockPushErrorHandler() { + // For a connection exception against a particular host, we will stop pushing any + // blocks to just that host and continue push blocks to other hosts. So, here push of + // all blocks will only stop when it is "Too Late". Also see updateStateAndCheckIfPushMore. + override def shouldRetryError(t: Throwable): Boolean = { +// If the block is too late, there is no need to retry it + !Throwables.getStackTraceAsString(t).contains(BlockPushErrorHandler.TOO_LATE_MESSAGE_SUFFIX) + } +} + } + + /** + * Initiate the block push process. This will be invoked after the shuffle writer + * finishes writing the shuffle file if push based shuffle is enabled. + * + * @param resolver block resolver used to locate mapper generated shuffle file + * @param partitionLengths array of shuffle block size so we can tell shuffle block + * boundaries within the shuffle file + * @param dep shuffle dependency to get shuffle ID and the location of remote shuffle + * services to push local shuffle blocks + * @param partitionId map index of the shuffle map task + * @param mapIdmapId of the shuffle map task + * @param conf spark configuration + */ + def initiateBlockPush( + resolver: IndexShuffleBlockResolver, + partitionLengths: Array[Long], + dep: ShuffleDependency[_, _, _], + partitionId: Int, + mapId: Long, + conf: SparkConf): Unit = { +val numPartitions = dep.partitioner.numPartitions +val dataFile = resolver.getDataFile(dep.shuffleId, mapId) +val transportConf = SparkTransportConf.fromSparkConf(conf, "shuffle") + +val maxBlockSizeToPush = conf.get(PUSH_SHUFFLE_MAX_BLOCK_SIZE_TO_PUSH) * 1024 +val maxBlockBatchSize = conf.get(PUSH_SHUFFLE_MAX_BLOCK_BATCH_SIZE) * 1024 * 1024 +val mergerLocs = dep.getMergerLocs.map(loc => + BlockManagerId("", loc.host, loc.port)) + +maxBytesInFlight = conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024 +maxReqsInFlight = conf.getInt("spark.reducer.maxReqsInFlight
[GitHub] [spark] otterc commented on a change in pull request #30312: [SPARK-32917][SHUFFLE][CORE] Adds support for executors to push shuffle blocks after successful map task completion
otterc commented on a change in pull request #30312: URL: https://github.com/apache/spark/pull/30312#discussion_r529831293 ## File path: core/src/main/scala/org/apache/spark/internal/config/package.scala ## @@ -1992,4 +1992,32 @@ package object config { .version("3.1.0") .doubleConf .createWithDefault(5) + + private[spark] val SHUFFLE_NUM_PUSH_THREADS = +ConfigBuilder("spark.shuffle.push.numPushThreads") + .doc("Specify the number of threads in the block pusher pool. These threads assist " + +"in creating connections and pushing blocks to remote shuffle services when push based " + +"shuffle is enabled. By default, the threadpool size is equal to the number of cores.") + .version("3.1.0") + .intConf + .createOptional + + private[spark] val SHUFFLE_MAX_BLOCK_SIZE_TO_PUSH = +ConfigBuilder("spark.shuffle.push.maxBlockSizeToPush") + .doc("The max size of an individual block to push to the remote shuffle services when push " + +"based shuffle is enabled. Blocks larger than this threshold are not pushed.") + .version("3.1.0") + .bytesConf(ByteUnit.KiB) + .createWithDefaultString("800k") + + private[spark] val SHUFFLE_MAX_BLOCK_BATCH_SIZE_FOR_PUSH = +ConfigBuilder("spark.shuffle.push.maxBlockBatchSize") + .doc("The max size of a batch of shuffle blocks to be grouped into a single push request " + +"when push based shuffle is enabled.") Review comment: done ## File path: core/src/main/scala/org/apache/spark/internal/config/package.scala ## @@ -1992,4 +1992,32 @@ package object config { .version("3.1.0") .doubleConf .createWithDefault(5) + + private[spark] val SHUFFLE_NUM_PUSH_THREADS = +ConfigBuilder("spark.shuffle.push.numPushThreads") + .doc("Specify the number of threads in the block pusher pool. These threads assist " + +"in creating connections and pushing blocks to remote shuffle services when push based " + +"shuffle is enabled. By default, the threadpool size is equal to the number of cores.") + .version("3.1.0") + .intConf + .createOptional + + private[spark] val SHUFFLE_MAX_BLOCK_SIZE_TO_PUSH = +ConfigBuilder("spark.shuffle.push.maxBlockSizeToPush") + .doc("The max size of an individual block to push to the remote shuffle services when push " + +"based shuffle is enabled. Blocks larger than this threshold are not pushed.") Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on a change in pull request #30312: [SPARK-32917][SHUFFLE][CORE] Adds support for executors to push shuffle blocks after successful map task completion
otterc commented on a change in pull request #30312: URL: https://github.com/apache/spark/pull/30312#discussion_r529830847 ## File path: core/src/main/scala/org/apache/spark/shuffle/PushShuffleWriterComponent.scala ## @@ -0,0 +1,464 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.shuffle + +import java.io.File +import java.net.ConnectException +import java.nio.ByteBuffer +import java.util.concurrent.ExecutorService + +import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet, Queue} + +import com.google.common.base.Throwables + +import org.apache.spark.{ShuffleDependency, SparkConf, SparkEnv} +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.internal.config._ +import org.apache.spark.launcher.SparkLauncher +import org.apache.spark.network.buffer.{FileSegmentManagedBuffer, ManagedBuffer, NioManagedBuffer} +import org.apache.spark.network.netty.SparkTransportConf +import org.apache.spark.network.shuffle.BlockFetchingListener +import org.apache.spark.network.shuffle.ErrorHandler.BlockPushErrorHandler +import org.apache.spark.network.util.TransportConf +import org.apache.spark.shuffle.PushShuffleWriterComponent._ +import org.apache.spark.storage.{BlockId, BlockManagerId, ShufflePushBlockId} +import org.apache.spark.util.{ThreadUtils, Utils} + +/** + * Used for pushing shuffle blocks to remote shuffle services when push shuffle is enabled. + * When push shuffle is enabled, it is created after the shuffle writer finishes writing the shuffle + * file and initiates the block push process. + * + * @param dataFile mapper generated shuffle data file + * @param partitionLengths array of shuffle block size so we can tell shuffle block + * boundaries within the shuffle file + * @param dep shuffle dependency to get shuffle ID and the location of remote shuffle + * services to push local shuffle blocks + * @param partitionId map index of the shuffle map task + * @param conf spark configuration + */ +private[spark] class PushShuffleWriterComponent( Review comment: I have renamed it to `ShuffleBlockPusher`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on a change in pull request #30312: [SPARK-32917][SHUFFLE][CORE] Adds support for executors to push shuffle blocks after successful map task completion
otterc commented on a change in pull request #30312: URL: https://github.com/apache/spark/pull/30312#discussion_r529830296 ## File path: core/src/main/scala/org/apache/spark/shuffle/PushShuffleWriterComponent.scala ## @@ -0,0 +1,464 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.shuffle + +import java.io.File +import java.net.ConnectException +import java.nio.ByteBuffer +import java.util.concurrent.ExecutorService + +import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet, Queue} + +import com.google.common.base.Throwables + +import org.apache.spark.{ShuffleDependency, SparkConf, SparkEnv} +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.internal.config._ +import org.apache.spark.launcher.SparkLauncher +import org.apache.spark.network.buffer.{FileSegmentManagedBuffer, ManagedBuffer, NioManagedBuffer} +import org.apache.spark.network.netty.SparkTransportConf +import org.apache.spark.network.shuffle.BlockFetchingListener +import org.apache.spark.network.shuffle.ErrorHandler.BlockPushErrorHandler +import org.apache.spark.network.util.TransportConf +import org.apache.spark.shuffle.PushShuffleWriterComponent._ +import org.apache.spark.storage.{BlockId, BlockManagerId, ShufflePushBlockId} +import org.apache.spark.util.{ThreadUtils, Utils} + +/** + * Used for pushing shuffle blocks to remote shuffle services when push shuffle is enabled. + * When push shuffle is enabled, it is created after the shuffle writer finishes writing the shuffle + * file and initiates the block push process. + * + * @param dataFile mapper generated shuffle data file + * @param partitionLengths array of shuffle block size so we can tell shuffle block + * boundaries within the shuffle file + * @param dep shuffle dependency to get shuffle ID and the location of remote shuffle + * services to push local shuffle blocks + * @param partitionId map index of the shuffle map task + * @param conf spark configuration + */ +private[spark] class PushShuffleWriterComponent( +dataFile: File, +partitionLengths: Array[Long], +dep: ShuffleDependency[_, _, _], +partitionId: Int, +conf: SparkConf) extends Logging { + private[this] var maxBytesInFlight = 0L + private[this] var maxReqsInFlight = 0 + private[this] var maxBlocksInFlightPerAddress = 0 + private[this] var bytesInFlight = 0L + private[this] var reqsInFlight = 0 + private[this] val numBlocksInFlightPerAddress = new HashMap[BlockManagerId, Int]() + private[this] val deferredPushRequests = new HashMap[BlockManagerId, Queue[PushRequest]]() + private[this] val pushRequests = new Queue[PushRequest] + private[this] val errorHandler = createErrorHandler() + private[this] val unreachableBlockMgrs = new HashSet[BlockManagerId]() + + // VisibleForTesting + private[shuffle] def createErrorHandler(): BlockPushErrorHandler = { +new BlockPushErrorHandler() { + // For a connection exception against a particular host, we will stop pushing any + // blocks to just that host and continue push blocks to other hosts. So, here push of + // all blocks will only stop when it is "Too Late". Also see updateStateAndCheckIfPushMore. + override def shouldRetryError(t: Throwable): Boolean = { +// If the block is too late, there is no need to retry it + !Throwables.getStackTraceAsString(t).contains(BlockPushErrorHandler.TOO_LATE_MESSAGE_SUFFIX) + } +} + } + + private[shuffle] def initiateBlockPush(): Unit = { +val numPartitions = dep.partitioner.numPartitions +val transportConf = SparkTransportConf.fromSparkConf(conf, "shuffle") + +val maxBlockSizeToPush = conf.get(SHUFFLE_MAX_BLOCK_SIZE_TO_PUSH) * 1024 +val maxBlockBatchSize = conf.get(SHUFFLE_MAX_BLOCK_BATCH_SIZE_FOR_PUSH) * 1024 * 1024 +val mergerLocs = dep.getMergerLocs.map(loc => + BlockManagerId("", loc.host, loc.port)) + +maxBytesInFlight = conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024 +maxReqsInFlight = conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue) +
[GitHub] [spark] SparkQA removed a comment on pull request #29000: [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode
SparkQA removed a comment on pull request #29000: URL: https://github.com/apache/spark/pull/29000#issuecomment-733113801 **[Test build #131682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131682/testReport)** for PR 29000 at commit [`85aa12a`](https://github.com/apache/spark/commit/85aa12a618ceadfe510a4f9fc3718a746a1bc357). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30478: [SPARK-33525][SQL] Update hive-service-rpc to 3.1.2
SparkQA removed a comment on pull request #30478: URL: https://github.com/apache/spark/pull/30478#issuecomment-733112768 **[Test build #131679 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131679/testReport)** for PR 30478 at commit [`43d90ca`](https://github.com/apache/spark/commit/43d90cafaf0aa4c8c4a355070d5c71008f6f3ea9). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on a change in pull request #30312: [SPARK-32917][SHUFFLE][CORE] Adds support for executors to push shuffle blocks after successful map task completion
otterc commented on a change in pull request #30312: URL: https://github.com/apache/spark/pull/30312#discussion_r529830075 ## File path: core/src/main/scala/org/apache/spark/internal/config/package.scala ## @@ -1992,4 +1992,32 @@ package object config { .version("3.1.0") .doubleConf .createWithDefault(5) + + private[spark] val SHUFFLE_NUM_PUSH_THREADS = +ConfigBuilder("spark.shuffle.push.numPushThreads") + .doc("Specify the number of threads in the block pusher pool. These threads assist " + +"in creating connections and pushing blocks to remote shuffle services when push based " + +"shuffle is enabled. By default, the threadpool size is equal to the number of cores.") + .version("3.1.0") + .intConf + .createOptional + + private[spark] val SHUFFLE_MAX_BLOCK_SIZE_TO_PUSH = +ConfigBuilder("spark.shuffle.push.maxBlockSizeToPush") + .doc("The max size of an individual block to push to the remote shuffle services when push " + +"based shuffle is enabled. Blocks larger than this threshold are not pushed.") + .version("3.1.0") + .bytesConf(ByteUnit.KiB) + .createWithDefaultString("800k") Review comment: @Victsm could you please address this one? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30478: [SPARK-33525][SQL] Update hive-service-rpc to 3.1.2
SparkQA commented on pull request #30478: URL: https://github.com/apache/spark/pull/30478#issuecomment-733189691 **[Test build #131679 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131679/testReport)** for PR 30478 at commit [`43d90ca`](https://github.com/apache/spark/commit/43d90cafaf0aa4c8c4a355070d5c71008f6f3ea9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on a change in pull request #30312: [SPARK-32917][SHUFFLE][CORE] Adds support for executors to push shuffle blocks after successful map task completion
otterc commented on a change in pull request #30312: URL: https://github.com/apache/spark/pull/30312#discussion_r529829788 ## File path: core/src/main/scala/org/apache/spark/shuffle/PushShuffleWriterComponent.scala ## @@ -0,0 +1,464 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.shuffle + +import java.io.File +import java.net.ConnectException +import java.nio.ByteBuffer +import java.util.concurrent.ExecutorService + +import scala.collection.mutable.{ArrayBuffer, HashMap, HashSet, Queue} + +import com.google.common.base.Throwables + +import org.apache.spark.{ShuffleDependency, SparkConf, SparkEnv} +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.internal.config._ +import org.apache.spark.launcher.SparkLauncher +import org.apache.spark.network.buffer.{FileSegmentManagedBuffer, ManagedBuffer, NioManagedBuffer} +import org.apache.spark.network.netty.SparkTransportConf +import org.apache.spark.network.shuffle.BlockFetchingListener +import org.apache.spark.network.shuffle.ErrorHandler.BlockPushErrorHandler +import org.apache.spark.network.util.TransportConf +import org.apache.spark.shuffle.PushShuffleWriterComponent._ +import org.apache.spark.storage.{BlockId, BlockManagerId, ShufflePushBlockId} +import org.apache.spark.util.{ThreadUtils, Utils} + +/** + * Used for pushing shuffle blocks to remote shuffle services when push shuffle is enabled. + * When push shuffle is enabled, it is created after the shuffle writer finishes writing the shuffle + * file and initiates the block push process. + * + * @param dataFile mapper generated shuffle data file + * @param partitionLengths array of shuffle block size so we can tell shuffle block + * boundaries within the shuffle file + * @param dep shuffle dependency to get shuffle ID and the location of remote shuffle + * services to push local shuffle blocks + * @param partitionId map index of the shuffle map task + * @param conf spark configuration + */ +private[spark] class PushShuffleWriterComponent( +dataFile: File, +partitionLengths: Array[Long], +dep: ShuffleDependency[_, _, _], +partitionId: Int, +conf: SparkConf) extends Logging { + private[this] var maxBytesInFlight = 0L + private[this] var maxReqsInFlight = 0 + private[this] var maxBlocksInFlightPerAddress = 0 + private[this] var bytesInFlight = 0L + private[this] var reqsInFlight = 0 + private[this] val numBlocksInFlightPerAddress = new HashMap[BlockManagerId, Int]() + private[this] val deferredPushRequests = new HashMap[BlockManagerId, Queue[PushRequest]]() + private[this] val pushRequests = new Queue[PushRequest] + private[this] val errorHandler = createErrorHandler() + private[this] val unreachableBlockMgrs = new HashSet[BlockManagerId]() + + // VisibleForTesting + private[shuffle] def createErrorHandler(): BlockPushErrorHandler = { +new BlockPushErrorHandler() { + // For a connection exception against a particular host, we will stop pushing any + // blocks to just that host and continue push blocks to other hosts. So, here push of + // all blocks will only stop when it is "Too Late". Also see updateStateAndCheckIfPushMore. + override def shouldRetryError(t: Throwable): Boolean = { +// If the block is too late, there is no need to retry it + !Throwables.getStackTraceAsString(t).contains(BlockPushErrorHandler.TOO_LATE_MESSAGE_SUFFIX) + } +} + } + + private[shuffle] def initiateBlockPush(): Unit = { +val numPartitions = dep.partitioner.numPartitions +val transportConf = SparkTransportConf.fromSparkConf(conf, "shuffle") + +val maxBlockSizeToPush = conf.get(SHUFFLE_MAX_BLOCK_SIZE_TO_PUSH) * 1024 +val maxBlockBatchSize = conf.get(SHUFFLE_MAX_BLOCK_BATCH_SIZE_FOR_PUSH) * 1024 * 1024 +val mergerLocs = dep.getMergerLocs.map(loc => + BlockManagerId("", loc.host, loc.port)) Review comment: done This is an automated message from the Apache Git Service. To respond to the message,
[GitHub] [spark] SparkQA commented on pull request #29000: [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode
SparkQA commented on pull request #29000: URL: https://github.com/apache/spark/pull/29000#issuecomment-733189434 **[Test build #131682 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131682/testReport)** for PR 29000 at commit [`85aa12a`](https://github.com/apache/spark/commit/85aa12a618ceadfe510a4f9fc3718a746a1bc357). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #30413: [SPARK-33252][PYTHON][DOCS] Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
zero323 commented on a change in pull request #30413: URL: https://github.com/apache/spark/pull/30413#discussion_r529829185 ## File path: python/pyspark/mllib/util.py ## @@ -273,24 +317,30 @@ def convertVectorColumnsFromML(dataset, *cols): return callMLlibFunc("convertVectorColumnsFromML", dataset, list(cols)) @staticmethod -@since("2.0.0") def convertMatrixColumnsToML(dataset, *cols): """ Converts matrix columns in an input DataFrame from the :py:class:`pyspark.mllib.linalg.Matrix` type to the new :py:class:`pyspark.ml.linalg.Matrix` type under the `spark.ml` package. -:param dataset: - input dataset -:param cols: - a list of matrix columns to be converted. - New matrix columns will be ignored. If unspecified, all old - matrix columns will be converted excepted nested ones. -:return: - the input dataset with old matrix columns converted to the - new matrix type +.. versionadded:: 2.0.0 +dataset : :py:class:`pyspark.sql.DataFrame` +input dataset +cols : str Review comment: That's a good question. Looking at [`numpy.ndarray.item`](https://numpy.org/devdocs/reference/generated/numpy.ndarray.item.html) we should make varargs explicit *cols : ... As of type I was thinking `str` as we support variable number of `str` values, but looking at the code, a single `List[str]` would do as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rf972 commented on a change in pull request #29695: [SPARK-22390][SPARK-32833][SQL] [WIP]JDBC V2 Datasource aggregate push down
rf972 commented on a change in pull request #29695: URL: https://github.com/apache/spark/pull/29695#discussion_r529825209 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala ## @@ -73,33 +77,25 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] { postScanFilters) Aggregate(groupingExpressions, resultExpressions, plan) } else { -val resultAttributes = resultExpressions.map(_.toAttribute) - .map ( e => e match { case a: AttributeReference => a }) -var index = 0 val aggOutputBuilder = ArrayBuilder.make[AttributeReference] -for (a <- resultAttributes) { - val newName = if (a.name.contains("FILTER")) { -a.name.substring(0, a.name.indexOf("FILTER") - 1) - } else if (a.name.contains("DISTINCT")) { -a.name.replace("DISTINCT ", "") - } else { -a.name - } - - aggOutputBuilder += -a.copy(name = newName, - dataType = aggregates(index).dataType)(exprId = NamedExpression.newExprId, - qualifier = a.qualifier) - index += 1 +for (a <- aggregates) { +aggOutputBuilder += AttributeReference(toPrettySQL(a), a.dataType)() } val aggOutput = aggOutputBuilder.result -var newOutput = aggOutput -for (col <- output) { - if (!aggOutput.exists(_.name.contains(col.name))) { -newOutput = col +: newOutput +val newOutputBuilder = ArrayBuilder.make[AttributeReference] +for (col1 <- output) { + var found = false + for (col2 <- aggOutput) { + if (contains(col2.name, col1.name)) { Review comment: Thanks @huaxingao for incorporating our suggestions into the patch regarding aggregates containing expressions! With your changes, TPCH Q6 test shows a substantial reduction in data transfer. With just filter and projection pushdown, Q6 transfer size is about 1.5 MB. But with aggregate pushdown this now gets reduced to 17 bytes. In our testing we found a case that throws an error. Grouping by more than one column seems to cause an error. Here is an example that fails for us: ``` val df = sparkSession.sql("select BONUS, SUM(SALARY+BONUS), SALARY FROM h2.test.employee" + " GROUP BY SALARY, BONUS") df.show() ``` If you have any thoughts or suggestions on a solution for this, we would appreciate it. Thanks ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #30403: [SPARK-33448][SQL] Support CACHE/UNCACHE TABLE commands for v2 tables
viirya commented on pull request #30403: URL: https://github.com/apache/spark/pull/30403#issuecomment-733185335 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #30413: [SPARK-33252][PYTHON][DOCS] Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
zero323 commented on a change in pull request #30413: URL: https://github.com/apache/spark/pull/30413#discussion_r529824599 ## File path: python/pyspark/mllib/stat/_statistics.py ## @@ -65,11 +65,19 @@ def colStats(rdd): """ Computes column-wise summary statistics for the input RDD[Vector]. -:param rdd: an RDD[Vector] for which column-wise summary statistics -are to be computed. -:return: :class:`MultivariateStatisticalSummary` object containing - column-wise summary statistics. - +Parameters +-- +rdd : :py:class:`pyspark.RDD` +an RDD[Vector] for which column-wise summary statistics Review comment: Conveniently, that's syntax we use both for Scala and Python, with corresponding type hints looking like this: https://github.com/apache/spark/blob/048a9821c788b6796d52d1e2a0cd174377ebd0f0/python/pyspark/mllib/stat/_statistics.pyi#L44 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #30482: [SPARK-33529][SQL] Handle '__HIVE_DEFAULT_PARTITION__' while resolving V2 partition specs
MaxGekk commented on pull request #30482: URL: https://github.com/apache/spark/pull/30482#issuecomment-733184440 jenkins, retest this, please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on a change in pull request #30433: [SPARK-32916][SHUFFLE][test-maven][test-hadoop2.7] Ensure the number of chunks in meta file and index file are equal
otterc commented on a change in pull request #30433: URL: https://github.com/apache/spark/pull/30433#discussion_r529823126 ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java ## @@ -827,13 +833,16 @@ void resetChunkTracker() { void updateChunkInfo(long chunkOffset, int mapIndex) throws IOException { long idxStartPos = -1; try { -// update the chunk tracker to meta file before index file -writeChunkTracker(mapIndex); idxStartPos = indexFile.getFilePointer(); logger.trace("{} shuffleId {} reduceId {} updated index current {} updated {}", appShuffleId.appId, appShuffleId.shuffleId, reduceId, this.lastChunkOffset, chunkOffset); -indexFile.writeLong(chunkOffset); +indexFile.write(Longs.toByteArray(chunkOffset)); +// Chunk bitmap should be written to the meta file after the index file because if there are +// any exceptions during writing the offset to the index file, meta file should not be +// updated. If the update to the index file is successful but the update to meta file isn't +// then the index file position is reset in the catch clause. +writeChunkTracker(mapIndex); Review comment: Had a discussion with @Victsm @mridulm yesterday. This is the approach currently we are thinking of: 1. Verify if `seek` is not recoverable. If it is not then let the clients know to stop pushing and stop merging this partition. 2. Have a threshold on number of `IOExceptions` from writes and when this threshold is reached for a single partition, inform the client to stop pushing and stop merging this partition. 3. When the update to metadata fails, not propagate this exception back to client so that they push the block again. The size of the current chunk may grow but with (2) in place it will still be of a manageable size. These changes will not be that complex. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30403: [SPARK-33448][SQL] Support CACHE/UNCACHE TABLE commands for v2 tables
SparkQA removed a comment on pull request #30403: URL: https://github.com/apache/spark/pull/30403#issuecomment-733112980 **[Test build #131681 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131681/testReport)** for PR 30403 at commit [`f22159c`](https://github.com/apache/spark/commit/f22159c1e5c4ea6f6d2838da77d43f6be18a00a5). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30403: [SPARK-33448][SQL] Support CACHE/UNCACHE TABLE commands for v2 tables
SparkQA commented on pull request #30403: URL: https://github.com/apache/spark/pull/30403#issuecomment-733183154 **[Test build #131681 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131681/testReport)** for PR 30403 at commit [`f22159c`](https://github.com/apache/spark/commit/f22159c1e5c4ea6f6d2838da77d43f6be18a00a5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class TruncateTable(` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30421: [SPARK-33474][SQL] Support TypeConstructed partition spec value
AmplabJenkins removed a comment on pull request #30421: URL: https://github.com/apache/spark/pull/30421#issuecomment-733180683 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30421: [SPARK-33474][SQL] Support TypeConstructed partition spec value
AmplabJenkins commented on pull request #30421: URL: https://github.com/apache/spark/pull/30421#issuecomment-733180683 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30482: [SPARK-33529][SQL] Handle '__HIVE_DEFAULT_PARTITION__' while resolving V2 partition specs
AmplabJenkins removed a comment on pull request #30482: URL: https://github.com/apache/spark/pull/30482#issuecomment-733179070 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30482: [SPARK-33529][SQL] Handle '__HIVE_DEFAULT_PARTITION__' while resolving V2 partition specs
AmplabJenkins commented on pull request #30482: URL: https://github.com/apache/spark/pull/30482#issuecomment-733179070 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30480: [SPARK-32921][SHUFFLE][test-maven][test-hadoop2.7] MapOutputTracker extensions to support push-based shuffle
SparkQA removed a comment on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-733173446 **[Test build #131691 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131691/testReport)** for PR 30480 at commit [`3723389`](https://github.com/apache/spark/commit/3723389c6d9a3bd1a101cb97bf5e98406dbf6fd6). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30480: [SPARK-32921][SHUFFLE][test-maven][test-hadoop2.7] MapOutputTracker extensions to support push-based shuffle
AmplabJenkins removed a comment on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-733175455 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jkleckner commented on pull request #30283: [SPARK-24266][K8S][2.4] Restart the watcher when we receive a version changed from k8s
jkleckner commented on pull request #30283: URL: https://github.com/apache/spark/pull/30283#issuecomment-733175795 Thank you, @dongjoon-hyun ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30480: [SPARK-32921][SHUFFLE][test-maven][test-hadoop2.7] MapOutputTracker extensions to support push-based shuffle
SparkQA commented on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-733175437 **[Test build #131691 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131691/testReport)** for PR 30480 at commit [`3723389`](https://github.com/apache/spark/commit/3723389c6d9a3bd1a101cb97bf5e98406dbf6fd6). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30480: [SPARK-32921][SHUFFLE][test-maven][test-hadoop2.7] MapOutputTracker extensions to support push-based shuffle
AmplabJenkins commented on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-733175455 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30421: [SPARK-33474][SQL] Support TypeConstructed partition spec value
AmplabJenkins removed a comment on pull request #30421: URL: https://github.com/apache/spark/pull/30421#issuecomment-733174299 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30421: [SPARK-33474][SQL] Support TypeConstructed partition spec value
AmplabJenkins commented on pull request #30421: URL: https://github.com/apache/spark/pull/30421#issuecomment-733174299 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30427: [SPARK-33224][SS] Add watermark gap information into SS UI page
SparkQA commented on pull request #30427: URL: https://github.com/apache/spark/pull/30427#issuecomment-733173573 **[Test build #131692 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131692/testReport)** for PR 30427 at commit [`d19fd10`](https://github.com/apache/spark/commit/d19fd10dab7c4fc28d1c4a893a2db74405d4ff9f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30475: [SPARK-33522][SQL] Improve exception messages while handling UnresolvedTableOrView
AmplabJenkins removed a comment on pull request #30475: URL: https://github.com/apache/spark/pull/30475#issuecomment-733173015 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30468: [SPARK-33518][ML][WIP] Improve performance of ML ALS recommendForAll by GEMV
AmplabJenkins removed a comment on pull request #30468: URL: https://github.com/apache/spark/pull/30468#issuecomment-733173014 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30480: [SPARK-32921][SHUFFLE][test-maven][test-hadoop2.7] MapOutputTracker extensions to support push-based shuffle
SparkQA commented on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-733173446 **[Test build #131691 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131691/testReport)** for PR 30480 at commit [`3723389`](https://github.com/apache/spark/commit/3723389c6d9a3bd1a101cb97bf5e98406dbf6fd6). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30470: [SPARK-33495][BUILD] Remove commons-logging.jar's dependency
AmplabJenkins removed a comment on pull request #30470: URL: https://github.com/apache/spark/pull/30470#issuecomment-733173016 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30427: [SPARK-33224][SS] Add watermark gap information into SS UI page
AmplabJenkins removed a comment on pull request #30427: URL: https://github.com/apache/spark/pull/30427#issuecomment-733173017 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30427: [SPARK-33224][SS] Add watermark gap information into SS UI page
AmplabJenkins commented on pull request #30427: URL: https://github.com/apache/spark/pull/30427#issuecomment-733173017 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30475: [SPARK-33522][SQL] Improve exception messages while handling UnresolvedTableOrView
AmplabJenkins commented on pull request #30475: URL: https://github.com/apache/spark/pull/30475#issuecomment-733173015 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30468: [SPARK-33518][ML][WIP] Improve performance of ML ALS recommendForAll by GEMV
AmplabJenkins commented on pull request #30468: URL: https://github.com/apache/spark/pull/30468#issuecomment-733173014 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30470: [SPARK-33495][BUILD] Remove commons-logging.jar's dependency
AmplabJenkins commented on pull request #30470: URL: https://github.com/apache/spark/pull/30470#issuecomment-733173016 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30468: [SPARK-33518][ML][WIP] Improve performance of ML ALS recommendForAll by GEMV
SparkQA removed a comment on pull request #30468: URL: https://github.com/apache/spark/pull/30468#issuecomment-733115921 **[Test build #131684 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131684/testReport)** for PR 30468 at commit [`8ca7d56`](https://github.com/apache/spark/commit/8ca7d562c20812062e11e8f6961034157cc08ea8). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30468: [SPARK-33518][ML][WIP] Improve performance of ML ALS recommendForAll by GEMV
SparkQA commented on pull request #30468: URL: https://github.com/apache/spark/pull/30468#issuecomment-733163795 **[Test build #131684 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131684/testReport)** for PR 30468 at commit [`8ca7d56`](https://github.com/apache/spark/commit/8ca7d562c20812062e11e8f6961034157cc08ea8). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #30427: [SPARK-33224][SS] Add watermark gap information into SS UI page
dongjoon-hyun commented on pull request #30427: URL: https://github.com/apache/spark/pull/30427#issuecomment-733156769 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30475: [SPARK-33522][SQL] Improve exception messages while handling UnresolvedTableOrView
SparkQA removed a comment on pull request #30475: URL: https://github.com/apache/spark/pull/30475#issuecomment-733123537 **[Test build #131686 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131686/testReport)** for PR 30475 at commit [`68ee277`](https://github.com/apache/spark/commit/68ee277cbde9ecb466b1480af676a2f831e11236). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30475: [SPARK-33522][SQL] Improve exception messages while handling UnresolvedTableOrView
SparkQA commented on pull request #30475: URL: https://github.com/apache/spark/pull/30475#issuecomment-733156509 **[Test build #131686 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131686/testReport)** for PR 30475 at commit [`68ee277`](https://github.com/apache/spark/commit/68ee277cbde9ecb466b1480af676a2f831e11236). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class TruncateTable(` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28647: [SPARK-31828][SQL] Retain table properties at CreateTableLikeCommand
AmplabJenkins removed a comment on pull request #28647: URL: https://github.com/apache/spark/pull/28647#issuecomment-733150165 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30479: [WIP][SPARK-33527][SQL] Extend the function of decode so as consistent with mainstream databases
AmplabJenkins removed a comment on pull request #30479: URL: https://github.com/apache/spark/pull/30479#issuecomment-733149740 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30487: [SPARK-33535][INFRA][TESTS] Export LANG to en_US.UTF-8 in run-tests-jenkins script
AmplabJenkins removed a comment on pull request #30487: URL: https://github.com/apache/spark/pull/30487#issuecomment-733149187 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #30283: [SPARK-24266][K8S][2.4] Restart the watcher when we receive a version changed from k8s
dongjoon-hyun closed pull request #30283: URL: https://github.com/apache/spark/pull/30283 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30283: [SPARK-24266][K8S][2.4] Restart the watcher when we receive a version changed from k8s
dongjoon-hyun commented on a change in pull request #30283: URL: https://github.com/apache/spark/pull/30283#discussion_r529783873 ## File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala ## @@ -42,13 +44,20 @@ private[k8s] trait LoggingPodStatusWatcher extends Watcher[Pod] { */ private[k8s] class LoggingPodStatusWatcherImpl( appId: String, -maybeLoggingInterval: Option[Long]) +maybeLoggingInterval: Option[Long], +waitForCompletion: Boolean) Review comment: ~Please revert this change. This is inconsistent with Apache Spark 3.1 and 3.0.~ ## File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala ## @@ -230,7 +242,9 @@ private[spark] class KubernetesClientApplication extends SparkApplication { val master = KubernetesUtils.parseMasterUrl(sparkConf.get("spark.master")) val loggingInterval = if (waitForAppCompletion) Some(sparkConf.get(REPORT_INTERVAL)) else None -val watcher = new LoggingPodStatusWatcherImpl(kubernetesAppId, loggingInterval) +val watcher = new LoggingPodStatusWatcherImpl(kubernetesAppId, + loggingInterval, + waitForAppCompletion) Review comment: ~Please revert this change. This is inconsistent with Apache Spark 3.1 and 3.0.~ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30283: [SPARK-24266][K8S][2.4] Restart the watcher when we receive a version changed from k8s
dongjoon-hyun commented on a change in pull request #30283: URL: https://github.com/apache/spark/pull/30283#discussion_r529783812 ## File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala ## @@ -230,7 +242,9 @@ private[spark] class KubernetesClientApplication extends SparkApplication { val master = KubernetesUtils.parseMasterUrl(sparkConf.get("spark.master")) val loggingInterval = if (waitForAppCompletion) Some(sparkConf.get(REPORT_INTERVAL)) else None -val watcher = new LoggingPodStatusWatcherImpl(kubernetesAppId, loggingInterval) +val watcher = new LoggingPodStatusWatcherImpl(kubernetesAppId, + loggingInterval, + waitForAppCompletion) Review comment: Please revert this change. This is inconsistent with Apache Spark 3.1 and 3.0. ## File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala ## @@ -42,13 +44,20 @@ private[k8s] trait LoggingPodStatusWatcher extends Watcher[Pod] { */ private[k8s] class LoggingPodStatusWatcherImpl( appId: String, -maybeLoggingInterval: Option[Long]) +maybeLoggingInterval: Option[Long], +waitForCompletion: Boolean) Review comment: Please revert this change. This is inconsistent with Apache Spark 3.1 and 3.0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28647: [SPARK-31828][SQL] Retain table properties at CreateTableLikeCommand
AmplabJenkins commented on pull request #28647: URL: https://github.com/apache/spark/pull/28647#issuecomment-733150165 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30479: [WIP][SPARK-33527][SQL] Extend the function of decode so as consistent with mainstream databases
AmplabJenkins commented on pull request #30479: URL: https://github.com/apache/spark/pull/30479#issuecomment-733149740 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30487: [SPARK-33535][INFRA][TESTS] Export LANG to en_US.UTF-8 in run-tests-jenkins script
AmplabJenkins commented on pull request #30487: URL: https://github.com/apache/spark/pull/30487#issuecomment-733149187 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30336: [SPARK-33287][SS][UI]Expose state custom metrics information on SS UI
SparkQA commented on pull request #30336: URL: https://github.com/apache/spark/pull/30336#issuecomment-733148850 **[Test build #131690 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131690/testReport)** for PR 30336 at commit [`e2f1594`](https://github.com/apache/spark/commit/e2f15947ef6e97fbcee5f9f328e6e8315a2eea66). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30408: [SPARK-33477][SQL] Hive Metastore should support filter by date type
SparkQA commented on pull request #30408: URL: https://github.com/apache/spark/pull/30408#issuecomment-733148701 **[Test build #131689 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131689/testReport)** for PR 30408 at commit [`752eb8d`](https://github.com/apache/spark/commit/752eb8dab286aa78925851bb2ebd0b0ce816def6). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30480: [SPARK-32921][SHUFFLE][test-maven][test-hadoop2.7] MapOutputTracker extensions to support push-based shuffle
SparkQA removed a comment on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-733146296 **[Test build #131687 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131687/testReport)** for PR 30480 at commit [`b9c43c5`](https://github.com/apache/spark/commit/b9c43c5550724970ebc6446cbcbb96d4ab9772ff). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30480: [SPARK-32921][SHUFFLE][test-maven][test-hadoop2.7] MapOutputTracker extensions to support push-based shuffle
AmplabJenkins removed a comment on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-733147893 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30487: [SPARK-33535][INFRA][TESTS] Export LANG to en_US.UTF-8 in run-tests-jenkins script
SparkQA removed a comment on pull request #30487: URL: https://github.com/apache/spark/pull/30487#issuecomment-733031613 **[Test build #131669 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131669/testReport)** for PR 30487 at commit [`5c90411`](https://github.com/apache/spark/commit/5c9041103fff89089ca136e97d8181c359e76b7a). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30480: [SPARK-32921][SHUFFLE][test-maven][test-hadoop2.7] MapOutputTracker extensions to support push-based shuffle
AmplabJenkins commented on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-733147893 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30480: [SPARK-32921][SHUFFLE][test-maven][test-hadoop2.7] MapOutputTracker extensions to support push-based shuffle
SparkQA commented on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-733147874 **[Test build #131687 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131687/testReport)** for PR 30480 at commit [`b9c43c5`](https://github.com/apache/spark/commit/b9c43c5550724970ebc6446cbcbb96d4ab9772ff). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30487: [SPARK-33535][INFRA][TESTS] Export LANG to en_US.UTF-8 in run-tests-jenkins script
SparkQA commented on pull request #30487: URL: https://github.com/apache/spark/pull/30487#issuecomment-733147816 **[Test build #131669 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131669/testReport)** for PR 30487 at commit [`5c90411`](https://github.com/apache/spark/commit/5c9041103fff89089ca136e97d8181c359e76b7a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30440: [SPARK-33496][SQL]Improve error message of ANSI explicit cast
AmplabJenkins removed a comment on pull request #30440: URL: https://github.com/apache/spark/pull/30440#issuecomment-733146882 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30483: [WIP][SPARK-33449][SQL] Add File Metadata cache support for Parquet and Orc
dongjoon-hyun commented on a change in pull request #30483: URL: https://github.com/apache/spark/pull/30483#discussion_r529778176 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -765,6 +765,11 @@ object SQLConf { .booleanConf .createWithDefault(false) + val PARQUET_META_CACHE_ENABLED = buildConf("spark.sql.parquet.metadataCache.enabled") Review comment: Please add `.doc("...")`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30440: [SPARK-33496][SQL]Improve error message of ANSI explicit cast
AmplabJenkins commented on pull request #30440: URL: https://github.com/apache/spark/pull/30440#issuecomment-733146882 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30440: [SPARK-33496][SQL]Improve error message of ANSI explicit cast
SparkQA removed a comment on pull request #30440: URL: https://github.com/apache/spark/pull/30440#issuecomment-732993024 **[Test build #131664 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131664/testReport)** for PR 30440 at commit [`e762162`](https://github.com/apache/spark/commit/e762162311e04c20bb06f9a4735514547050b832). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30440: [SPARK-33496][SQL]Improve error message of ANSI explicit cast
SparkQA commented on pull request #30440: URL: https://github.com/apache/spark/pull/30440#issuecomment-733146375 **[Test build #131664 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131664/testReport)** for PR 30440 at commit [`e762162`](https://github.com/apache/spark/commit/e762162311e04c20bb06f9a4735514547050b832). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30480: [SPARK-32921][SHUFFLE][test-maven][test-hadoop2.7] MapOutputTracker extensions to support push-based shuffle
SparkQA commented on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-733146296 **[Test build #131687 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131687/testReport)** for PR 30480 at commit [`b9c43c5`](https://github.com/apache/spark/commit/b9c43c5550724970ebc6446cbcbb96d4ab9772ff). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30480: [SPARK-32921][SHUFFLE][test-maven][test-hadoop2.7] MapOutputTracker extensions to support push-based shuffle
AmplabJenkins removed a comment on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-732756039 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30478: [SPARK-33525][SQL] Update hive-service-rpc to 3.1.2
SparkQA commented on pull request #30478: URL: https://github.com/apache/spark/pull/30478#issuecomment-733146309 **[Test build #131688 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131688/testReport)** for PR 30478 at commit [`43d90ca`](https://github.com/apache/spark/commit/43d90cafaf0aa4c8c4a355070d5c71008f6f3ea9). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #30408: [SPARK-33477][SQL] Hive Metastore should support filter by date type
wangyum commented on pull request #30408: URL: https://github.com/apache/spark/pull/30408#issuecomment-733146199 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #30480: [SPARK-32921][SHUFFLE][test-maven][test-hadoop2.7] MapOutputTracker extensions to support push-based shuffle
dongjoon-hyun commented on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-733145845 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30480: [SPARK-32921][SHUFFLE][test-maven][test-hadoop2.7] MapOutputTracker extensions to support push-based shuffle
dongjoon-hyun commented on a change in pull request #30480: URL: https://github.com/apache/spark/pull/30480#discussion_r529776436 ## File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ## @@ -1011,4 +1333,47 @@ private[spark] object MapOutputTracker extends Logging { splitsByAddress.mapValues(_.toSeq).iterator } + + /** + * Given a shuffle ID, a partition ID, an array of map statuses, and bitmap corresponding + * to either a merged shuffle partition or a merged shuffle partition chunk, identify + * the metadata about the shuffle partition blocks that are merged into the merged shuffle + * partition or partition chunk represented by the bitmap. + * + * @param shuffleId Identifier for the shuffle + * @param partitionId The partition ID of the MergeStatus for which we look for the metadata + *of the merged shuffle partition blocks + * @param mapStatuses List of map statuses, indexed by map ID + * @param tracker bitmap containing mapIndexes that belong to the merged block or merged + *block chunk. + * @return A sequence of 2-item tuples, where the first item in the tuple is a BlockManagerId, + * and the second item is a sequence of (shuffle block ID, shuffle block size) tuples + * describing the shuffle blocks that are stored at that block manager. + */ + def getMapStatusesForMergeStatus( + shuffleId: Int, + partitionId: Int, + mapStatuses: Array[MapStatus], + tracker: RoaringBitmap): Seq[(BlockManagerId, Seq[(BlockId, Long, Int)])] = { +assert (mapStatuses != null && tracker != null) +val splitsByAddress = new HashMap[BlockManagerId, ListBuffer[(BlockId, Long, Int)]] +for ((status, mapIndex) <- mapStatuses.zipWithIndex) { + // Only add blocks that are merged + if (tracker.contains(mapIndex)) { +MapOutputTracker.validateStatus(status, shuffleId, partitionId) +splitsByAddress.getOrElseUpdate(status.location, ListBuffer()) += + ((ShuffleBlockId(shuffleId, status.mapId, partitionId), +status.getSizeForBlock(partitionId), mapIndex)) + } +} +splitsByAddress.toSeq Review comment: @Victsm . First of all, could you check this place? This seems to break Scala 2.13 compilation. ``` [error] /home/runner/work/spark/spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala:1369:21: type mismatch; [error] found : Seq[(org.apache.spark.storage.BlockManagerId, scala.collection.mutable.ListBuffer[(org.apache.spark.storage.BlockId, Long, Int)])] [error] required: Seq[(org.apache.spark.storage.BlockManagerId, Seq[(org.apache.spark.storage.BlockId, Long, Int)])] [error] splitsByAddress.toSeq ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #30413: [SPARK-33252][PYTHON][DOCS] Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
viirya commented on a change in pull request #30413: URL: https://github.com/apache/spark/pull/30413#discussion_r529776310 ## File path: python/pyspark/mllib/util.py ## @@ -313,24 +363,30 @@ def convertMatrixColumnsToML(dataset, *cols): return callMLlibFunc("convertMatrixColumnsToML", dataset, list(cols)) @staticmethod -@since("2.0.0") def convertMatrixColumnsFromML(dataset, *cols): """ Converts matrix columns in an input DataFrame to the :py:class:`pyspark.mllib.linalg.Matrix` type from the new :py:class:`pyspark.ml.linalg.Matrix` type under the `spark.ml` package. -:param dataset: - input dataset -:param cols: - a list of matrix columns to be converted. - Old matrix columns will be ignored. If unspecified, all new - matrix columns will be converted except nested ones. -:return: - the input dataset with new matrix columns converted to the - old matrix type +.. versionadded:: 2.0.0 + +dataset : :py:class:`pyspark.sql.DataFrame` +input dataset +cols : str Review comment: ditto. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #30413: [SPARK-33252][PYTHON][DOCS] Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
viirya commented on a change in pull request #30413: URL: https://github.com/apache/spark/pull/30413#discussion_r529776085 ## File path: python/pyspark/mllib/util.py ## @@ -273,24 +317,30 @@ def convertVectorColumnsFromML(dataset, *cols): return callMLlibFunc("convertVectorColumnsFromML", dataset, list(cols)) @staticmethod -@since("2.0.0") def convertMatrixColumnsToML(dataset, *cols): """ Converts matrix columns in an input DataFrame from the :py:class:`pyspark.mllib.linalg.Matrix` type to the new :py:class:`pyspark.ml.linalg.Matrix` type under the `spark.ml` package. -:param dataset: - input dataset -:param cols: - a list of matrix columns to be converted. - New matrix columns will be ignored. If unspecified, all old - matrix columns will be converted excepted nested ones. -:return: - the input dataset with old matrix columns converted to the - new matrix type +.. versionadded:: 2.0.0 +dataset : :py:class:`pyspark.sql.DataFrame` +input dataset +cols : str Review comment: Is it a str or list of str? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30465: [SPARK-33045][SQL][FOLLOWUP] Support built-in function like_any and fix StackOverflowError issue.
AmplabJenkins removed a comment on pull request #30465: URL: https://github.com/apache/spark/pull/30465#issuecomment-733143722 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30482: [SPARK-33529][SQL] Handle '__HIVE_DEFAULT_PARTITION__' while resolving V2 partition specs
AmplabJenkins removed a comment on pull request #30482: URL: https://github.com/apache/spark/pull/30482#issuecomment-733143719 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30481: [SPARK-33526][SQL] Add config to control if cancel invoke interrupt task on thriftserver
AmplabJenkins removed a comment on pull request #30481: URL: https://github.com/apache/spark/pull/30481#issuecomment-733143723 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30467: [SPARK-32002][SQL]Support ExtractValue from nested ArrayStruct
AmplabJenkins removed a comment on pull request #30467: URL: https://github.com/apache/spark/pull/30467#issuecomment-733143724 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30465: [SPARK-33045][SQL][FOLLOWUP] Support built-in function like_any and fix StackOverflowError issue.
AmplabJenkins commented on pull request #30465: URL: https://github.com/apache/spark/pull/30465#issuecomment-733143722 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30481: [SPARK-33526][SQL] Add config to control if cancel invoke interrupt task on thriftserver
AmplabJenkins commented on pull request #30481: URL: https://github.com/apache/spark/pull/30481#issuecomment-733143723 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30482: [SPARK-33529][SQL] Handle '__HIVE_DEFAULT_PARTITION__' while resolving V2 partition specs
AmplabJenkins commented on pull request #30482: URL: https://github.com/apache/spark/pull/30482#issuecomment-733143719 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30467: [SPARK-32002][SQL]Support ExtractValue from nested ArrayStruct
AmplabJenkins commented on pull request #30467: URL: https://github.com/apache/spark/pull/30467#issuecomment-733143724 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaborgsomogyi commented on a change in pull request #30336: [SPARK-33287][SS][UI]Expose state custom metrics information on SS UI
gaborgsomogyi commented on a change in pull request #30336: URL: https://github.com/apache/spark/pull/30336#discussion_r529772096 ## File path: sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala ## @@ -199,49 +209,100 @@ private[ui] class StreamingQueryStatisticsPage(parent: StreamingQueryTab) "records") graphUIDataForNumRowsDroppedByWatermark.generateDataJs(jsCollector) - // scalastyle:off - - - -Aggregated Number Of Total State Rows {SparkUIUtils.tooltip("Aggregated number of total state rows.", "right")} - - -{graphUIDataForNumberTotalRows.generateTimelineHtml(jsCollector)} -{graphUIDataForNumberTotalRows.generateHistogramHtml(jsCollector)} - - - - -Aggregated Number Of Updated State Rows {SparkUIUtils.tooltip("Aggregated number of updated state rows.", "right")} - - -{graphUIDataForNumberUpdatedRows.generateTimelineHtml(jsCollector)} -{graphUIDataForNumberUpdatedRows.generateHistogramHtml(jsCollector)} - - - - -Aggregated State Memory Used In Bytes {SparkUIUtils.tooltip("Aggregated state memory used in bytes.", "right")} - - -{graphUIDataForMemoryUsedBytes.generateTimelineHtml(jsCollector)} -{graphUIDataForMemoryUsedBytes.generateHistogramHtml(jsCollector)} - - - - -Aggregated Number Of Rows Dropped By Watermark {SparkUIUtils.tooltip("Accumulates all input rows being dropped in stateful operators by watermark. 'Inputs' are relative to operators.", "right")} - - -{graphUIDataForNumRowsDroppedByWatermark.generateTimelineHtml(jsCollector)} -{graphUIDataForNumRowsDroppedByWatermark.generateHistogramHtml(jsCollector)} - - // scalastyle:on + val result = +// scalastyle:off + + + + Aggregated Number Of Total State Rows {SparkUIUtils.tooltip("Aggregated number of total state rows.", "right")} + + + {graphUIDataForNumberTotalRows.generateTimelineHtml(jsCollector)} + {graphUIDataForNumberTotalRows.generateHistogramHtml(jsCollector)} + + + + + Aggregated Number Of Updated State Rows {SparkUIUtils.tooltip("Aggregated number of updated state rows.", "right")} + + + {graphUIDataForNumberUpdatedRows.generateTimelineHtml(jsCollector)} + {graphUIDataForNumberUpdatedRows.generateHistogramHtml(jsCollector)} + + + + + Aggregated State Memory Used In Bytes {SparkUIUtils.tooltip("Aggregated state memory used in bytes.", "right")} + + + {graphUIDataForMemoryUsedBytes.generateTimelineHtml(jsCollector)} + {graphUIDataForMemoryUsedBytes.generateHistogramHtml(jsCollector)} + + + + + Aggregated Number Of Rows Dropped By Watermark {SparkUIUtils.tooltip("Accumulates all input rows being dropped in stateful operators by watermark. 'Inputs' are relative to operators.", "right")} + + + {graphUIDataForNumRowsDroppedByWatermark.generateTimelineHtml(jsCollector)} + {graphUIDataForNumRowsDroppedByWatermark.generateHistogramHtml(jsCollector)} + +// scalastyle:on + + result ++= generateAggregatedCustomMetrics(query, minBatchTime, maxBatchTime, jsCollector) + result } else { new NodeBuffer() } } + def generateAggregatedCustomMetrics( + query: StreamingQueryUIData, + minBatchTime: Long, + maxBatchTime: Long, + jsCollector: JsCollector): NodeBuffer = { +val result: NodeBuffer = new NodeBuffer + +// This is made sure on caller side but put it here to be defensive +require(query.lastProgress.stateOperators.nonEmpty) +val enabledCustomMetrics = parent.parent.conf.get(ENABLED_STREAMING_UI_CUSTOM_METRIC_LIST) +logDebug(s"Enabled custom metrics: $enabledCustomMetrics") +query.lastProgress.stateOperators.head.customMetrics.keySet().asScala + .filter(enabledCustomMetrics.contains(_)).map { metricName => Review comment: OK, if that's so we I'll add it. I've checked similar list based configs where it's not done like that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For ad
[GitHub] [spark] viirya commented on a change in pull request #30413: [SPARK-33252][PYTHON][DOCS] Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
viirya commented on a change in pull request #30413: URL: https://github.com/apache/spark/pull/30413#discussion_r529771614 ## File path: python/pyspark/mllib/stat/_statistics.py ## @@ -65,11 +65,19 @@ def colStats(rdd): """ Computes column-wise summary statistics for the input RDD[Vector]. -:param rdd: an RDD[Vector] for which column-wise summary statistics -are to be computed. -:return: :class:`MultivariateStatisticalSummary` object containing - column-wise summary statistics. - +Parameters +-- +rdd : :py:class:`pyspark.RDD` +an RDD[Vector] for which column-wise summary statistics Review comment: `RDD[Vector]` looks Scala syntax? How about RDD of Vector? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #30478: [SPARK-33525][SQL] Update hive-service-rpc to 3.1.2
dongjoon-hyun commented on pull request #30478: URL: https://github.com/apache/spark/pull/30478#issuecomment-733141118 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #30429: [SPARK-33492][SQL] DSv2: Append/Overwrite/ReplaceTable should invalidate cache
dongjoon-hyun commented on pull request #30429: URL: https://github.com/apache/spark/pull/30429#issuecomment-733140355 Thank you, @rdblue , @cloud-fan , @sunchao . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #30413: [SPARK-33252][PYTHON][DOCS] Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
viirya commented on a change in pull request #30413: URL: https://github.com/apache/spark/pull/30413#discussion_r529769785 ## File path: python/pyspark/mllib/regression.py ## @@ -224,11 +234,13 @@ def _regression_train_wrapper(train_func, modelClass, data, initial_weights): class LinearRegressionWithSGD(object): """ +Train a linear regression model with no regularization using Stochastic Gradient Descent. + .. versionadded:: 0.9.0 -.. note:: Deprecated in 2.0.0. Use ml.regression.LinearRegression. +.. deprecated:: 2.0.0. Review comment: 2.0.0. -> 2.0.0 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #30487: [SPARK-33535][INFRA][TESTS] Export LANG to en_US.UTF-8 in run-tests-jenkins script
dongjoon-hyun commented on pull request #30487: URL: https://github.com/apache/spark/pull/30487#issuecomment-733139251 Merged to master/branch-3.0/branch-2.4. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org