[GitHub] [spark] HyukjinKwon closed pull request #35137: [SPARK-37846][TESTS] Fix test logic by getting sleep interval in task
HyukjinKwon closed pull request #35137: URL: https://github.com/apache/spark/pull/35137 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #35137: [SPARK-37846][TESTS] Fix test logic by getting sleep interval in task
HyukjinKwon commented on pull request #35137: URL: https://github.com/apache/spark/pull/35137#issuecomment-1008239365 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #33446: [SPARK-36215][SHUFFLE] Add logging for slow fetches to diagnose external shuffle service issues
mridulm commented on a change in pull request #33446: URL: https://github.com/apache/spark/pull/33446#discussion_r780737708 ## File path: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala ## @@ -334,6 +353,7 @@ final class ShuffleBlockFetcherIterator( s"due to Netty OOM, will retry") } remainingBlocks -= blockId +logFetchOnCompletionIfSlow() Review comment: If we are calling `logFetchOnCompletionIfSlow` from here, we should recompute what is `totalRequestSize` and `blockCount` - since fetch failed before all blocks were fetched. Essentially, we need to compute the stats based on infoMap.keys -- remainingBlocks Given the failure, `remainingBlocks.isEmpty` would never be empty except if this was last block - potentially loosing out on all suffix requests taking time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #33446: [SPARK-36215][SHUFFLE] Add logging for slow fetches to diagnose external shuffle service issues
mridulm commented on a change in pull request #33446: URL: https://github.com/apache/spark/pull/33446#discussion_r780737343 ## File path: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala ## @@ -281,6 +292,13 @@ final class ShuffleBlockFetcherIterator( } } +@inline def logFetchOnCompletionIfSlow(): Unit = { + if (remainingBlocks.isEmpty) { +logFetchIfSlow(TimeUnit.NANOSECONDS.toMillis(clock.nanoTime() - requestStartTime), + infoMap.values.map(_._1).sum, blockIds.size, address) Review comment: Use `req.size` instead. (See more below) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #33446: [SPARK-36215][SHUFFLE] Add logging for slow fetches to diagnose external shuffle service issues
mridulm commented on a change in pull request #33446: URL: https://github.com/apache/spark/pull/33446#discussion_r780737343 ## File path: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala ## @@ -281,6 +292,13 @@ final class ShuffleBlockFetcherIterator( } } +@inline def logFetchOnCompletionIfSlow(): Unit = { + if (remainingBlocks.isEmpty) { +logFetchIfSlow(TimeUnit.NANOSECONDS.toMillis(clock.nanoTime() - requestStartTime), + infoMap.values.map(_._1).sum, blockIds.size, address) Review comment: Use `req.size` instead. ## File path: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala ## @@ -342,6 +362,7 @@ final class ShuffleBlockFetcherIterator( val block = BlockId(blockId) if (block.isShuffleChunk) { remainingBlocks -= blockId +logFetchOnCompletionIfSlow() Review comment: Same here as the other failed case. ## File path: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala ## @@ -281,6 +292,13 @@ final class ShuffleBlockFetcherIterator( } } +@inline def logFetchOnCompletionIfSlow(): Unit = { + if (remainingBlocks.isEmpty) { +logFetchIfSlow(TimeUnit.NANOSECONDS.toMillis(clock.nanoTime() - requestStartTime), + infoMap.values.map(_._1).sum, blockIds.size, address) Review comment: pull use `req.size` instead. ## File path: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala ## @@ -291,6 +309,7 @@ final class ShuffleBlockFetcherIterator( // This needs to be released after use. buf.retain() remainingBlocks -= blockId +logFetchOnCompletionIfSlow() blockOOMRetryCounts.remove(blockId) results.put(new SuccessFetchResult(BlockId(blockId), infoMap(blockId)._2, address, infoMap(blockId)._1, buf, remainingBlocks.isEmpty)) Review comment: Make `remainingBlocks.isEmpty` into a local variable `networkReqDone` and use that for both `SuccessFetchResult` and whether to invoke `logFetchOnCompletionIfSlow` (and you can remove the `remainingBlocks.isEmpty` in that method) ## File path: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala ## @@ -334,6 +353,7 @@ final class ShuffleBlockFetcherIterator( s"due to Netty OOM, will retry") } remainingBlocks -= blockId +logFetchOnCompletionIfSlow() Review comment: If we are calling `logFetchOnCompletionIfSlow` from here, we should recompute what is `totalRequestSize` and `blockCount` - since fetch failed before all blocks were fetched. Essentially, we need to compute the stats based on infoMap.keys -- remainingBlocks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7
Yikun commented on pull request #35142: URL: https://github.com/apache/spark/pull/35142#issuecomment-1008233769 Agree @zero323, besides, another information is [CentOS will reach EOL](https://www.centos.org/centos-linux-eol/). But there are indeed some gaps that let users to use Spark easily with different OS, so I think some documentation or dockerfile reference might be necessary. I also heard some from local developer and customer that they need some initial dockerfile template for specific OS. Whether should we deal with these requirement appropriately? Just some random thoughts, I think the problem is who is responsible for maintaining these doc and dockerfile the continuous evolution and quality of these codes. Perhaps a potential solution should be these scirpt provided by the OS community's SIG, the Spark community documentation link to these. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wzhfy commented on pull request #35137: [SPARK-37846][TESTS] Fix test logic by getting sleep interval in task
wzhfy commented on pull request #35137: URL: https://github.com/apache/spark/pull/35137#issuecomment-1008231606 > (I think it's best to file a JIRA BTW) added jira [SPARK-37846](https://issues.apache.org/jira/browse/SPARK-37846), thanks for the reminder -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 edited a comment on pull request #35083: [WIP][SPARK-37798] PySpark Pandas API: Cross and conditional merging
zero323 edited a comment on pull request #35083: URL: https://github.com/apache/spark/pull/35083#issuecomment-1008213524 > Also, is it possible to see the specific formatting issues caught by the black formatter in the CI? Not at the moment. > It seems to contain some different black configuration than when I run it dev/format-python locally This seems to be your current offender ```patch diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py index 949e3e601b..da91da59bc 100644 --- a/python/pyspark/pandas/frame.py +++ b/python/pyspark/pandas/frame.py @@ -359,7 +359,8 @@ RIGHT_MERGE_PREFIX = MERGE_PREFIX_TEMPLATE.format("right") def _resolve_internal_merge_frame( -internal: InternalFrame, side: Optional[str] = None, +internal: InternalFrame, +side: Optional[str] = None, ) -> InternalFrame: internal = internal.resolved_copy if side is None: ``` In general you might want to resync this with master (also, please make sure that you use matching black version ‒ 21.12b0 at the moment. It should be detected automatically, but who knows?). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #35083: [WIP][SPARK-37798] PySpark Pandas API: Cross and conditional merging
zero323 commented on pull request #35083: URL: https://github.com/apache/spark/pull/35083#issuecomment-1008213524 > Also, is it possible to see the specific formatting issues caught by the black formatter in the CI? Not at the moment. > It seems to contain some different black configuration than when I run it dev/format-python locally This seems to be your current offender ```patch diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py index 949e3e601b..da91da59bc 100644 --- a/python/pyspark/pandas/frame.py +++ b/python/pyspark/pandas/frame.py @@ -359,7 +359,8 @@ RIGHT_MERGE_PREFIX = MERGE_PREFIX_TEMPLATE.format("right") def _resolve_internal_merge_frame( -internal: InternalFrame, side: Optional[str] = None, +internal: InternalFrame, +side: Optional[str] = None, ) -> InternalFrame: internal = internal.resolved_copy if side is None: ``` In general you might want to resync this with master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7
HyukjinKwon commented on pull request #35142: URL: https://github.com/apache/spark/pull/35142#issuecomment-1008208411 Yeah, I doubt if we want this. cc @dongjoon-hyun @holdenk FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #35140: [SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values
HyukjinKwon commented on pull request #35140: URL: https://github.com/apache/spark/pull/35140#issuecomment-1008208298 cc @cloud-fan @viirya FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #35139: [SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values
HyukjinKwon commented on pull request #35139: URL: https://github.com/apache/spark/pull/35139#issuecomment-1008208142 cc @cloud-fan @viirya FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #35137: [TEST] Fix test logic by getting sleep interval in task
HyukjinKwon commented on pull request #35137: URL: https://github.com/apache/spark/pull/35137#issuecomment-1008208011 (I think it's best to file a JIRA BTW) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #35129: [SPARK-37835][SQL][DOCS] Fix the comments on SQLQueryTestSuite.scala/ThriftServerQueryTestSuite.scala to more explicit
itholic commented on a change in pull request #35129: URL: https://github.com/apache/spark/pull/35129#discussion_r780721573 ## File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ## @@ -36,6 +36,7 @@ import org.apache.spark.sql.test.SharedSparkSession import org.apache.spark.tags.ExtendedSQLTest import org.apache.spark.util.Utils +// scalastyle:off line.size.limit Review comment: Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #35129: [SPARK-37835][SQL][DOCS] Fix the comments on SQLQueryTestSuite.scala/ThriftServerQueryTestSuite.scala to more explicit
itholic commented on a change in pull request #35129: URL: https://github.com/apache/spark/pull/35129#discussion_r780721366 ## File path: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala ## @@ -43,22 +43,22 @@ import org.apache.spark.sql.types._ * * To run the entire test suite: * {{{ - * build/sbt -Phive-thriftserver "hive-thriftserver/testOnly *ThriftServerQueryTestSuite" + * build/sbt -Phive-thriftserver "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite" * }}} * * To run a single test file upon change: * {{{ - * build/sbt -Phive-thriftserver "hive-thriftserver/testOnly *ThriftServerQueryTestSuite -- -z inline-table.sql" + * build/sbt -Phive-thriftserver "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite -- -z inline-table.sql" * }}} * * This test suite won't generate golden files. To re-generate golden files for entire suite, run: * {{{ - * SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite" + * SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" Review comment: Yeah, I think we use only one file for SQL tests. ```shell $ find ./ -name "*SQLQueryTestSuite*" .//sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #33174: [SPARK-35721][PYTHON] Path level discover for python unittests
github-actions[bot] closed pull request #33174: URL: https://github.com/apache/spark/pull/33174 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #34141: [SPARK-33887][SQL] Allow insert overwrite same table with static partition if using dynamic partition overwrite mode
github-actions[bot] closed pull request #34141: URL: https://github.com/apache/spark/pull/34141 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #34024: [SPARK-36784][SHUFFLE][WIP] Handle DNS issues on executor to prevent shuffle nodes from getting added to exclude list
github-actions[bot] closed pull request #34024: URL: https://github.com/apache/spark/pull/34024 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #34083: Add docs about using Shiv for packaging (similar to PEX)
zero323 commented on pull request #34083: URL: https://github.com/apache/spark/pull/34083#issuecomment-1008189532 > Sounds good, we can revisit the PR once we have validated things. Shall we add a JIRA ticket for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #34354: [SPARK-37085][PYTHON][SQL] Add list/tuple overloads to array, struct, create_map, map_concat
zero323 commented on pull request #34354: URL: https://github.com/apache/spark/pull/34354#issuecomment-1008188701 > t seems to be the best for now, until mypy itself supports looser typing conventions. As @ueshin pointed out, using `Sequence` might provide more general approach (and allow us to forget about variance), but there are two problems: - It is high level interface that potentially covers more than lists and tuples. So this would require change of the logic, and to provide consistent UX, we should probably to the same in other places where sequence-ish input is accepted. - There is concern of ambiguity (`str` is a `Sequence[str]`), which causes my irrational feat that it can break in some hard to contain ways (most likely, I am overthinking it). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #34354: [SPARK-37085][PYTHON][SQL] Add list/tuple overloads to array, struct, create_map, map_concat
zero323 commented on pull request #34354: URL: https://github.com/apache/spark/pull/34354#issuecomment-1008186488 Merged into master. Thanks everyone! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 closed pull request #34354: [SPARK-37085][PYTHON][SQL] Add list/tuple overloads to array, struct, create_map, map_concat
zero323 closed pull request #34354: URL: https://github.com/apache/spark/pull/34354 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 removed a comment on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7
zero323 removed a comment on pull request #35142: URL: https://github.com/apache/spark/pull/35142#issuecomment-1008181473 Sorry, but this doesn't seem like a good idea. Even if we didn't already drop Python 3.6 support with supporting code, maintaining another set of images just for this purpose, wouldn't justify the maintenance cost. These images provide good starting point, but are by no means comprehensive ‒ if you're want to use different interpreters, virtual machines or native dependencies it makes more sense to start from scratch with your own images that fit your requirements. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7
zero323 commented on pull request #35142: URL: https://github.com/apache/spark/pull/35142#issuecomment-1008181473 Sorry, but this doesn't seem like a good idea. Even if we didn't already drop Python 3.6 support with supporting code, maintaining another set of images just for this purpose, wouldn't justify the maintenance cost. These images provide good starting point, but are by no means comprehensive ‒ if you're want to use different interpreters, virtual machines or native dependencies it makes more sense to start from scratch with your own images that fit your requirements. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #35076: [SPARK-37793][CORE][SHUFFLE] Fallback to fetch original blocks when noLocalMergedBlockDataError
mridulm commented on pull request #35076: URL: https://github.com/apache/spark/pull/35076#issuecomment-1008181443 One possibility I was thinking of was if we calling `finalizePartition` before any chunks have been fully written out for a reducer. Can you test with this change ? ``` diff --git a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java index d0eb4aed65..0d3a3c7448 100644 --- a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java +++ b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java @@ -575,17 +575,23 @@ public class RemoteBlockPushResolver implements MergedShuffleFileManager { List sizes = new ArrayList<>(shuffleMergePartitions.size()); for (AppShufflePartitionInfo partition: shuffleMergePartitions.values()) { synchronized (partition) { + boolean shouldDelete = false; try { // This can throw IOException which will marks this shuffle partition as not merged. partition.finalizePartition(); -bitmaps.add(partition.mapTracker); -reduceIds.add(partition.reduceId); -sizes.add(partition.getLastChunkOffset()); +long size = partition.getLastChunkOffset(); +if (size > 0) { + bitmaps.add(partition.mapTracker); + reduceIds.add(partition.reduceId); + sizes.add(size); +} else { + shouldDelete = true; +} } catch (IOException ioe) { logger.warn("Exception while finalizing shuffle partition {}_{} {} {}", msg.appId, msg.appAttemptId, msg.shuffleId, partition.reduceId, ioe); } finally { -partition.closeAllFilesAndDeleteIfNeeded(false); +partition.closeAllFilesAndDeleteIfNeeded(shouldDelete); } } } ``` +CC @otterc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #34083: Add docs about using Shiv for packaging (similar to PEX)
mridulm commented on pull request #34083: URL: https://github.com/apache/spark/pull/34083#issuecomment-1008180057 Sounds good, we can revisit the PR once we have validated things. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen commented on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7
bjornjorgensen commented on pull request #35142: URL: https://github.com/apache/spark/pull/35142#issuecomment-1008158090 You have problems with ImportError: libffi.so.6 It looks like many others have the same problems like this post show. https://stackoverflow.com/questions/61875869/ubuntu-20-04-upgrade-python-missing-libffi-so-6 In yours docker file you fix this with installing python 3.9.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen removed a comment on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7
bjornjorgensen removed a comment on pull request #35142: URL: https://github.com/apache/spark/pull/35142#issuecomment-1008155592 Why do you wanna have support for CentOS 7 witch are end off live? In yours docker filer you are installing python 3.9.0 but here you are using spark.pyspark.driver.python=python3.6/bin/python3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen commented on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7
bjornjorgensen commented on pull request #35142: URL: https://github.com/apache/spark/pull/35142#issuecomment-1008155592 Why do you wanna have support for CentOS 7 witch are end off live? In yours docker filer you are installing python 3.9.0 but here you are using spark.pyspark.driver.python=python3.6/bin/python3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shardulm94 commented on pull request #33446: [SPARK-36215][SHUFFLE] Add logging for slow fetches to diagnose external shuffle service issues
shardulm94 commented on pull request #33446: URL: https://github.com/apache/spark/pull/33446#issuecomment-1008103255 @mridulm I have rebased to master and also addressed comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #35139: [SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values
AmplabJenkins commented on pull request #35139: URL: https://github.com/apache/spark/pull/35139#issuecomment-1008094771 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #35140: [SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values
AmplabJenkins commented on pull request #35140: URL: https://github.com/apache/spark/pull/35140#issuecomment-1008094749 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7
AmplabJenkins commented on pull request #35142: URL: https://github.com/apache/spark/pull/35142#issuecomment-1008094732 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #35143: [SPARK-37844][CORE][TESTS] Remove `slf4j-log4j12` transitive test dependency from `hadoop-minikdc`
dongjoon-hyun commented on pull request #35143: URL: https://github.com/apache/spark/pull/35143#issuecomment-1008088119 I'm going to announce today the followings. 1. The recovery of Maven test coverage 2. Apple Silicon test coverage -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #35143: [SPARK-37844][CORE][TESTS] Remove `slf4j-log4j12` transitive test dependency from `hadoop-minikdc`
dongjoon-hyun commented on pull request #35143: URL: https://github.com/apache/spark/pull/35143#issuecomment-1008087344 Thank you again, @srowen and @viirya . To @srowen , I invited you to the new Jenkins infra as an admin to show you the progress. For now, I've been installing the required softwares and stabilizing the new M1 Mac. https://user-images.githubusercontent.com/9700541/148655033-f7fe8caf-cd27-40b9-9528-6ce6ec24ca78.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #35143: [SPARK-37844][CORE][TESTS] Remove `slf4j-log4j12` transitive test dependency from `hadoop-minikdc`
viirya commented on pull request #35143: URL: https://github.com/apache/spark/pull/35143#issuecomment-1008072312 Thanks @dongjoon-hyun . lgtm. I found it last night too, was planning to submit a PR today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #35143: [SPARK-37844][CORE][TESTS] Remove `slf4j-log4j12` transitive test dependency from `hadoop-minikdc`
dongjoon-hyun closed pull request #35143: URL: https://github.com/apache/spark/pull/35143 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #35143: [SPARK-37844][CORE][TESTS] Remove `slf4j-log4j12` transitive test dependency from `hadoop-minikdc`
dongjoon-hyun commented on pull request #35143: URL: https://github.com/apache/spark/pull/35143#issuecomment-1008071876 Thank you, @srowen . All tests passed although GitHub Action status on here seems to show the first run. https://user-images.githubusercontent.com/9700541/148653678-9ae6594c-4f53-4cb7-9006-b64bf6088d11.png;> ! Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #35134: [SPARK-37842][SQL][MINIOR] Use `multi-catch` to simplify duplicate exception handling behavior in Java code
srowen commented on pull request #35134: URL: https://github.com/apache/spark/pull/35134#issuecomment-1008048012 These are good changes, but, these are changes to copied Hive code. We can 'fix' it but might introduce tiny differences and merge conflicts if we update the Hive code. Are there any instances in the Spark code (not in org.apache.hive)? those would be fine to change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #35144: Branch 2.4
srowen closed pull request #35144: URL: https://github.com/apache/spark/pull/35144 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Franc-Z opened a new pull request #35144: Branch 2.4
Franc-Z opened a new pull request #35144: URL: https://github.com/apache/spark/pull/35144 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Hi, could anyone can tell me that if Spark will have a 2.5.0 version in future? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #34326: [SPARK-37053][CORE] Add metrics to SparkHistoryServer
AngersZh commented on pull request #34326: URL: https://github.com/apache/spark/pull/34326#issuecomment-1007968668 > at a high level seems fine, I need to take a more detailed look at a few parts. It also needs docs. Doc added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34326: [SPARK-37053][CORE] Add metrics to SparkHistoryServer
AngersZh commented on a change in pull request #34326: URL: https://github.com/apache/spark/pull/34326#discussion_r780655043 ## File path: core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala ## @@ -157,6 +161,15 @@ class HistoryServer( contextHandler.setContextPath(HistoryServer.UI_PATH_PREFIX) contextHandler.addServlet(new ServletHolder(loaderServlet), "/*") attachHandler(contextHandler) + +// Register history server source to history server metrics system. +cacheMetrics.init() Review comment: > why are we calling init here? Doesn't this end up getting called twice? Only called once here, before this method is never called. I am also confused about this.. ## File path: core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala ## @@ -157,6 +161,15 @@ class HistoryServer( contextHandler.setContextPath(HistoryServer.UI_PATH_PREFIX) contextHandler.addServlet(new ServletHolder(loaderServlet), "/*") attachHandler(contextHandler) + +// Register history server source to history server metrics system. +cacheMetrics.init() Review comment: > why are we calling init here? Doesn't this end up getting called twice? Only called once here. At before this method is never called. I am also confused about this.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #33934: [SPARK-36691][PYTHON] PythonRunner failed should pass error message to ApplicationMaster too
AngersZh commented on pull request #33934: URL: https://github.com/apache/spark/pull/33934#issuecomment-1007946564 > @holdenk who may have dealt with the python side before and have thoughts Thanks for your ping. It's a really useful feature and my wired pr have be deployed to our prod. Make user convenient to check failed reason. Hope some suggestion to make this feature to be continued. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #35130: [SPARK-37839][SQL] DS V2 supports partial aggregate push-down AVG
cloud-fan commented on a change in pull request #35130: URL: https://github.com/apache/spark/pull/35130#discussion_r780294560 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala ## @@ -212,7 +224,10 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper { private def supportPartialAggPushDown(agg: Aggregation): Boolean = { // We don't know the agg buffer of `GeneralAggregateFunc`, so can't do partial agg push down. -agg.aggregateExpressions().forall(!_.isInstanceOf[GeneralAggregateFunc]) +agg.aggregateExpressions().forall { aggregateFunc => + !aggregateFunc.isInstanceOf[GeneralAggregateFunc] || +aggregateFunc.asInstanceOf[GeneralAggregateFunc].name() == "AVG" Review comment: If `AVG` can be partially pushed, I'd like to create a dedicated class for it, instead of using `GeneralAggregateFunc`. ## File path: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala ## @@ -223,7 +225,11 @@ abstract class JdbcDialect extends Serializable with Logging{ case f: GeneralAggregateFunc if f.name() == "AVG" => assert(f.inputs().length == 1) val distinct = if (f.isDistinct) "DISTINCT " else "" -Some(s"AVG($distinct${f.inputs().head})") +if (supportCompletePushDown) { + Some(s"AVG($distinct${f.inputs().head})") +} else { + Some(s"Sum($distinct${f.inputs().head}), Count($distinct${f.inputs().head})") Review comment: why do we need it? It's Spark's responsibility to rewrite avg to sum/count, not the data source. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #35127: [SPARK-37837][INFRA] Enable black formatter in dev Python scripts
dongjoon-hyun closed pull request #35127: URL: https://github.com/apache/spark/pull/35127 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #35130: [SPARK-37839][SQL] DS V2 supports partial aggregate push-down AVG
beliefer commented on pull request #35130: URL: https://github.com/apache/spark/pull/35130#issuecomment-1007849173 > What's the new algorithm? It's simple before: if agg funcs contain `GeneralAggregateFunc`, don't try partial pushdown. Othwewise, try complete pushdown first, then partial pushdown. Yes. But AVG is common used aggregate function, we hope try my best to pushdown. Maybe we could construct Arerage in DS V2, so we keep the simply. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hiboyang commented on pull request #34672: [SPARK-37394][CORE] Skip registering with ESS if a customized shuffle manager is configured
hiboyang commented on pull request #34672: URL: https://github.com/apache/spark/pull/34672#issuecomment-1007611924 @tgravescs I am the author of Uber RSS. In terms of your question "I'm not familiar with the details of Uber RSS, can it just fake up a remote shuffle port and ignore what is sent? have you requested their version to support dynamic allocation?"... Uber RSS could support dynamic allocation on Kubernetes by some work around, e.g enabling shuffle tracking feature and set shuffle timeout to be zero. But it is a hack. The code change here will make that hack unnecessary. In terms of adding a config like "spark.shuffle.registration.enabled", any concern? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao closed pull request #35125: [SPARK-37802][SQL][3.2] Composite field name should work with Aggregate push down
huaxingao closed pull request #35125: URL: https://github.com/apache/spark/pull/35125 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #35132: [SPARK-37841][SQL] BasicWriteTaskStatsTracker should not try get status for a skipped file
cloud-fan commented on a change in pull request #35132: URL: https://github.com/apache/spark/pull/35132#discussion_r780458052 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala ## @@ -142,8 +142,13 @@ class BasicWriteTaskStatsTracker( numSubmittedFiles += 1 } - override def closeFile(filePath: String): Unit = { -updateFileStats(filePath) + override def closeFile(filePath: String, isPathCreated: Boolean): Unit = { +if (isPathCreated) { + updateFileStats(filePath) +} else { + logDebug(s"$filePath is not pre-touched by writer, skipping update file stats") Review comment: ```suggestion logDebug(s"$filePath is not created due to no data, skip updating file stats") ``` ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ## @@ -305,6 +305,12 @@ private[orc] class OrcOutputWriter( recordWriter.close(Reporter.NULL) } } + + /** + * If `recordWriterInstantiated` is false, the output file is not pretouched. Review comment: Alternatively, can we fix the Hive ORC data source to always write the file? This seems wrong to me. We should at least write one file even if the input query is empty, to record the output schema. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #35143: [SPARK-37844][CORE][TESTS] Remove slf4j-log4j12 dependency from hadoop-minikdc
dongjoon-hyun commented on pull request #35143: URL: https://github.com/apache/spark/pull/35143#issuecomment-1007943026 It seems that the GitHub Action notification is not forwarded to the user GitHub Action. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #34982: [SPARK-37712][YARN] Spark request yarn cluster metrics slow cause delay
AngersZh commented on pull request #34982: URL: https://github.com/apache/spark/pull/34982#issuecomment-1007945556 > I guess I'm fine with this, I kind of assume this is because your yarn resource manager is overloaded or unresponsive though again? Yea, yarn's performance is not so stable. With this patch, performance become better == -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun removed a comment on pull request #35143: [SPARK-37844][CORE][TESTS] Remove `slf4j-log4j12` transitive test dependency from `hadoop-minikdc`
dongjoon-hyun removed a comment on pull request #35143: URL: https://github.com/apache/spark/pull/35143#issuecomment-1007943026 It seems that the GitHub Action notification is not forwarded to the user GitHub Action. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #35143: [SPARK-37844][CORE][TESTS] Remove slf4j-log4j12 dependency from hadoop-minikdc
dongjoon-hyun closed pull request #35143: URL: https://github.com/apache/spark/pull/35143 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #35141: [SPARK-37843][CORE] Suppress NoSuchFieldError at setMDCForTask
dongjoon-hyun commented on pull request #35141: URL: https://github.com/apache/spark/pull/35141#issuecomment-1007837947 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #35138: [SPARK-35703][SQL][FOLLOWUP] Only eliminate shuffles if partition keys contain all the join keys
cloud-fan commented on pull request #35138: URL: https://github.com/apache/spark/pull/35138#issuecomment-1007635990 cc @sunchao @c21 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #31267: [SPARK-21195][CORE] MetricSystem should pick up dynamically registered metrics in sources
github-actions[bot] closed pull request #31267: URL: https://github.com/apache/spark/pull/31267 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wzhfy commented on pull request #35137: [TEST] Fix test logic by getting sleep interval in task
wzhfy commented on pull request #35137: URL: https://github.com/apache/spark/pull/35137#issuecomment-1007875631 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs edited a comment on pull request #34672: [SPARK-37394][CORE] Skip registering with ESS if a customized shuffle manager is configured
tgravescs edited a comment on pull request #34672: URL: https://github.com/apache/spark/pull/34672#issuecomment-1007641251 yes my concern is that its a config for something that isn't public. it doesn't make sense to me to have a public config for a non-public api and by itself without this 3rd party lib the config would not apply. I don't think my question was answered (can't you just open a port on one of your remote shuffle services and ignore the messages) and it sounds like it does work now with some hack (is that hack just setup and configs)? If so then I would much rather prefer a real solution like was already mentioned above because you can make it work now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #35127: [SPARK-37837][INFRA] Enable black formatter in dev Python scripts
HyukjinKwon commented on pull request #35127: URL: https://github.com/apache/spark/pull/35127#issuecomment-1007171463 cc @zero323 @dongjoon-hyun FYI. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #35129: [SC-37835][SQL] Fix the comments on SQLQueryTestSuite.scala to more explicit
HyukjinKwon commented on a change in pull request #35129: URL: https://github.com/apache/spark/pull/35129#discussion_r780066447 ## File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ## @@ -44,22 +44,23 @@ import org.apache.spark.util.Utils * * To run the entire test suite: * {{{ - * build/sbt "sql/testOnly *SQLQueryTestSuite" + * build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" * }}} * * To run a single test file upon change: * {{{ - * build/sbt "~sql/testOnly *SQLQueryTestSuite -- -z inline-table.sql" + * build/sbt "~sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z inline-table.sql" Review comment: This one shouldn't be updated because there's `sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala` to verify the results. We should be only explicit when we generate golden files. Can we document this since we're here? ## File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ## @@ -44,22 +44,23 @@ import org.apache.spark.util.Utils * * To run the entire test suite: * {{{ - * build/sbt "sql/testOnly *SQLQueryTestSuite" + * build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" * }}} * * To run a single test file upon change: * {{{ - * build/sbt "~sql/testOnly *SQLQueryTestSuite -- -z inline-table.sql" + * build/sbt "~sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z inline-table.sql" * }}} * * To re-generate golden files for entire suite, run: * {{{ - * SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite" + * SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" * }}} * * To re-generate golden file for a single test, run: * {{{ - * SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z describe.sql" + * SPARK_GENERATE_GOLDEN_FILES=1 build/sbt Review comment: Can we add line breaks so we can run this via copying and pasting? ## File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ## @@ -44,22 +44,23 @@ import org.apache.spark.util.Utils * * To run the entire test suite: * {{{ - * build/sbt "sql/testOnly *SQLQueryTestSuite" + * build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" * }}} * * To run a single test file upon change: * {{{ - * build/sbt "~sql/testOnly *SQLQueryTestSuite -- -z inline-table.sql" + * build/sbt "~sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z inline-table.sql" Review comment: This one shouldn't be updated because there's `sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala` to verify the results. We should be only explicit when we generate golden files. Can we document this since we're here? ## File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ## @@ -36,6 +36,7 @@ import org.apache.spark.sql.test.SharedSparkSession import org.apache.spark.tags.ExtendedSQLTest import org.apache.spark.util.Utils +// scalastyle:off line.size.limit Review comment: can you on this style back at the end of this Scaladoc? ## File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala ## @@ -36,6 +36,7 @@ import org.apache.spark.sql.test.SharedSparkSession import org.apache.spark.tags.ExtendedSQLTest import org.apache.spark.util.Utils +// scalastyle:off line.size.limit Review comment: can you turn on this style back at the end of this Scaladoc? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yoda-mon commented on pull request #34896: [SPARK-37568][SQL] Support 2-arguments by the convert_timezone() function
yoda-mon commented on pull request #34896: URL: https://github.com/apache/spark/pull/34896#issuecomment-1007202017 @MaxGekk Sorry for the delayed response, I tried to follow your advice. However at ``` def this(targetTz: Expression, sourceTs: Expression) = this(Literal("UTC"), targetTz, sourceTs) ``` `sourceTz` is not ignored. After the constructor I could use `timeZoneId` as`sourceTz` that copied by the mix-in, but in that case, it's difficult to distinguish `TimeStampLTZ` and `TimeStampNTZ`. Please let me know if there is something I missed to use the mix-in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 edited a comment on pull request #35076: [SPARK-37793][CORE][SHUFFLE] Fallback to fetch original blocks when noLocalMergedBlockDataError
pan3793 edited a comment on pull request #35076: URL: https://github.com/apache/spark/pull/35076#issuecomment-1007354842 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on pull request #33934: [SPARK-36691][PYTHON] PythonRunner failed should pass error message to ApplicationMaster too
tgravescs commented on pull request #33934: URL: https://github.com/apache/spark/pull/33934#issuecomment-1007523767 @holdenk who may have dealt with the python side before and have thoughts -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaborgsomogyi commented on a change in pull request #23348: [SPARK-25857][core] Add developer documentation regarding delegation tokens.
gaborgsomogyi commented on a change in pull request #23348: URL: https://github.com/apache/spark/pull/23348#discussion_r780256850 ## File path: core/src/main/scala/org/apache/spark/deploy/security/README.md ## @@ -0,0 +1,249 @@ +# Delegation Token Handling In Spark + +This document aims to explain and demystify delegation tokens as they are used by Spark, since +this topic is generally a huge source of confusion. + + +## What are delegation tokens and why use them? + +Delegation tokens (DTs from now on) are authentication tokens used by some services to replace +Kerberos service tokens. Many services in the Hadoop ecosystem have support for DTs, since they +have some very desirable advantages over Kerberos tokens: + +* No need to distribute Kerberos credentials + +In a distributed application, distributing Kerberos credentials is tricky. Not all users have +keytabs, and when they do, it's generally frowned upon to distribute them over the network as +part of application data. + +DTs allow for a single place (e.g. the Spark driver) to require Kerberos credentials. That entity +can then distribute the DTs to other parts of the distributed application (e.g. Spark executors), +so they can authenticate to services. + +* A single token per service is used for authentication + +If Kerberos authentication were used, each client connection to a server would require a trip +to the KDC and generation of a service ticket. In a distributed system, the number of service +tickets can balloon pretty quickly when you think about the number of client processes (e.g. Spark +executors) vs. the number of service processes (e.g. HDFS DataNodes). That generates unnecessary +extra load on the KDC, and may even run into usage limits set up by the KDC admin. + +* DTs are only used for authentication + +DTs, unlike TGTs, can only be used to authenticate to the specific service for which they were +issued. You cannot use an existing DT to create new DTs or to create DTs for a different service. + +So in short, DTs are *not* Kerberos tokens. They are used by many services to replace Kerberos +authentication, or even other forms of authentication, although there is nothing (aside from +maybe implementation details) that ties them to Kerberos or any other authentication mechanism. + + +## Lifecycle of DTs + +DTs, unlike Kerberos tokens, are service-specific. There is no centralized location you contact +to create a DT for a service. So, the first step needed to get a DT is being able to authenticate +to the service in question. In the Hadoop ecosystem, that is generally done using Kerberos. + +This requires Kerberos credentials to be available somewhere for the application to use. The user +is generally responsible for providing those credentials, which is most commonly done by logging +in to the KDC (e.g. using "kinit"). That generates a (Kerberos) "token cache" containing a TGT +(ticket granting ticket), which can then be used to request service tickets. + +There are other ways of obtaining TGTs, but, ultimately, you need a TGT to bootstrap the process. + +Once a TGT is available, the target service's client library can then be used to authenticate +to the service using the Kerberos credentials, and request the creation of a delegation token. +This token can now be sent to other processes and used to authenticate to different daemons +belonging to that service. + +And thus the first drawback of DTs becomes apparent: you need service-specific logic to create and +use them. + +Spark implements a (somewhat) pluggable, internal DT creation API. Support for new services can be +added by implementing a `HadoopDelegationTokenProvider` that is then called by Spark when generating +delegation tokens for an application. Spark makes the DTs available to code by stashing them in the +`UserGroupInformation` credentials, and it's up to the DT provider and the respective client library +to agree on how to use those tokens. + +Once they are created, the semantics of how DTs operate are also service-specific. But, in general, +they try to follow the semantics of Kerberos tokens: + +* A "renewable period (equivalent to TGT's "lifetime") which is for how long the DT is valid + before it requires renewal. +* A "max lifetime" (equivalent to TGT's "renewable life") which is for how long the DT can be + renewed. + +Once the token reaches its "max lifetime", a new one needs to be created by contacting the +appropriate service, restarting the above process. + + +## DT Renewal, Renewers, and YARN + +This is the most confusing part of DT handling, and part of it is because much of the system was +designed with MapReduce, and later YARN, in mind. + +As seen above, DTs need to be renewed periodically until they finally expire for good. An example of +this is the default configuration of HDFS services: delegation tokens are valid for up to 7 days, +and need to be renewed every 24 hours. If 24 hours
[GitHub] [spark] LuciferYang commented on a change in pull request #34471: [SPARK-36879][SQL] Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path
LuciferYang commented on a change in pull request #34471: URL: https://github.com/apache/spark/pull/34471#discussion_r780024037 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala ## @@ -76,6 +79,7 @@ object DataSourceReadBenchmark extends SqlBasedBenchmark { saveAsCsvTable(testDf, dir.getCanonicalPath + "/csv") saveAsJsonTable(testDf, dir.getCanonicalPath + "/json") saveAsParquetTable(testDf, dir.getCanonicalPath + "/parquet") +saveAsParquetV2Table(testDf, dir.getCanonicalPath + "/parquetV2") Review comment: I found that there are still unsupported encoding in Data Page V2, such as RLE for Boolean. It seems that it is not time to update the benchmark, please ignore my previous comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #34108: [SPARK-36638][SQL][TEST] Generalize OptimizeSkewedJoin - correctness
github-actions[bot] closed pull request #34108: URL: https://github.com/apache/spark/pull/34108 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #35121: [SPARK-37833][INFRA] Add `precondition` job to skip the main GitHub Action jobs
dongjoon-hyun commented on pull request #35121: URL: https://github.com/apache/spark/pull/35121#issuecomment-1007238145 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #32289: [SPARK-33357][K8S] Support Spark application managing with SparkAppHandle on Kubernetes
github-actions[bot] closed pull request #32289: URL: https://github.com/apache/spark/pull/32289 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #35133: [SPARK-37833][INFRA][FOLLOW-UP] Run checking modules of precondition only in forked repository
HyukjinKwon closed pull request #35133: URL: https://github.com/apache/spark/pull/35133 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 closed pull request #34674: [SPARK-37419][PYTHON][ML] Rewrite _shared_params_code_gen.py to inline type hints for ml/param/shared.py
zero323 closed pull request #34674: URL: https://github.com/apache/spark/pull/34674 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #35131: [SPARK-37827][SQL] Put the some built-in table properties into V1Table.propertie to adapt to V2 command
cloud-fan commented on a change in pull request #35131: URL: https://github.com/apache/spark/pull/35131#discussion_r780285249 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala ## @@ -384,9 +384,12 @@ case class CatalogTable( def toLinkedHashMap: mutable.LinkedHashMap[String, String] = { val map = new mutable.LinkedHashMap[String, String]() val tableProperties = properties - .filterKeys(!_.startsWith(VIEW_PREFIX)) + .filter(!_._1.startsWith(VIEW_PREFIX)) + .filter(kv => !CatalogV2Util.TABLE_RESERVED_PROPERTIES.contains(kv._1)) + .filter(!_._1.startsWith(TableCatalog.OPTION_PREFIX)) + .filter(_._1 != TableCatalog.PROP_EXTERNAL) Review comment: hmm, isn't `PROP_EXTERNAL` included in `TABLE_RESERVED_PROPERTIES`? ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala ## @@ -384,9 +384,12 @@ case class CatalogTable( def toLinkedHashMap: mutable.LinkedHashMap[String, String] = { val map = new mutable.LinkedHashMap[String, String]() val tableProperties = properties - .filterKeys(!_.startsWith(VIEW_PREFIX)) + .filter(!_._1.startsWith(VIEW_PREFIX)) + .filter(kv => !CatalogV2Util.TABLE_RESERVED_PROPERTIES.contains(kv._1)) + .filter(!_._1.startsWith(TableCatalog.OPTION_PREFIX)) + .filter(_._1 != TableCatalog.PROP_EXTERNAL) .toSeq.sortBy(_._1) - .map(p => p._1 + "=" + p._2).mkString("[", ", ", "]") Review comment: why this change? ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V1Table.scala ## @@ -55,7 +56,8 @@ private[sql] case class V1Table(v1Table: CatalogTable) extends Table { } } - override lazy val properties: util.Map[String, String] = v1Table.properties.asJava + override lazy val properties: util.Map[String, String] = +V1Table.filterTableProperties(v1Table).asJava Review comment: I think we should generate these v2 table properties here, instead of when creating the table. It's a breaking change to generate v2 properties when creating tables, as it changes what to store in the metastore, which may have unknown consequences. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #33174: [SPARK-35721][PYTHON] Path level discover for python unittests
github-actions[bot] commented on pull request #33174: URL: https://github.com/apache/spark/pull/33174#issuecomment-1007834106 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #35135: [SPARK-35442][SQL] Support propagate empty relation through aggregate
cloud-fan commented on pull request #35135: URL: https://github.com/apache/spark/pull/35135#issuecomment-1007644925 can you also address https://github.com/apache/spark/pull/32602#discussion_r780377367 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #34326: [SPARK-37053][CORE] Add metrics to SparkHistoryServer
tgravescs commented on a change in pull request #34326: URL: https://github.com/apache/spark/pull/34326#discussion_r780369492 ## File path: core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala ## @@ -157,6 +161,15 @@ class HistoryServer( contextHandler.setContextPath(HistoryServer.UI_PATH_PREFIX) contextHandler.addServlet(new ServletHolder(loaderServlet), "/*") attachHandler(contextHandler) + +// Register history server source to history server metrics system. +cacheMetrics.init() Review comment: why are we calling init here? Doesn't this end up getting called twice? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #34674: [SPARK-37419][PYTHON][ML] Rewrite _shared_params_code_gen.py to inline type hints for ml/param/shared.py
zero323 commented on pull request #34674: URL: https://github.com/apache/spark/pull/34674#issuecomment-1007510055 Merged into master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Peng-Lei commented on a change in pull request #35131: [SPARK-37827][SQL] Put the some built-in table properties into V1Table.propertie to adapt to V2 command
Peng-Lei commented on a change in pull request #35131: URL: https://github.com/apache/spark/pull/35131#discussion_r780390548 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V1Table.scala ## @@ -55,7 +56,8 @@ private[sql] case class V1Table(v1Table: CatalogTable) extends Table { } } - override lazy val properties: util.Map[String, String] = v1Table.properties.asJava + override lazy val properties: util.Map[String, String] = +V1Table.filterTableProperties(v1Table).asJava Review comment: Thank you very much. I try do it. In fact, officially because store these v2 table properties into metastore when create table, I spent a lot of time looking for failed use cases and trying to fix. It is indeed a breaking change to generate v2 properties when creating tables and store them. Thank you for your reminder very much. @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #35113: [SPARK-37636][SQL] Migrate CREATE NAMESPACE to use V2 command by default
imback82 commented on a change in pull request #35113: URL: https://github.com/apache/spark/pull/35113#discussion_r780069534 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/command/TestsV1AndV2Commands.scala ## @@ -28,14 +28,21 @@ import org.apache.spark.sql.internal.SQLConf trait TestsV1AndV2Commands extends DDLCommandTestUtils { private var _version: String = "" override def commandVersion: String = _version + def runningV1Command: Boolean = commandVersion == "V1" // Tests using V1 catalogs will run with `spark.sql.legacy.useV1Command` on and off // to test both V1 and V2 commands. override def test(testName: String, testTags: Tag*)(testFun: => Any) (implicit pos: Position): Unit = { Seq(true, false).foreach { useV1Command => - _version = if (useV1Command) "V1" else "V2" + def setCommandVersion(): Unit = { +_version = if (useV1Command) "V1" else "V2" + } + setCommandVersion() super.test(testName, testTags: _*) { +// Need to set command version inside this test function so that +// the correct command version is available in each test. +setCommandVersion() Review comment: This `def test` doesn't run `testFun` when it's invoked; it only registers the test). So by the time, `testFunc` is actually run, `_version` will always be set to "V2" (`useV1Command == false`). So we need to capture this inside the lambda that's passed into `super.test`. Also, note that we need to call `setCommandVersion()` before calling `super.test` because `super.test` being called is utilizing the `commandVersion` to set the right test name: https://github.com/apache/spark/blob/527e842ee6ed3bb1aabba61d02337ab81a56f87b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLCommandTestUtils.scala#L56-L57 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ## @@ -189,8 +189,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat override def createDatabase( dbDefinition: CatalogDatabase, - ignoreIfExists: Boolean): Unit = withClient { -client.createDatabase(dbDefinition, ignoreIfExists) + ignoreIfExists: Boolean): Unit = { +try { + withClient { +client.createDatabase(dbDefinition, ignoreIfExists) + } +} catch { + case e: AnalysisException if e.message.contains("already exists") => +throw new DatabaseAlreadyExistsException(dbDefinition.name) Review comment: @cloud-fan I intercepted the exception here. looks a bit hacky? ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ## @@ -189,8 +189,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat override def createDatabase( dbDefinition: CatalogDatabase, - ignoreIfExists: Boolean): Unit = withClient { -client.createDatabase(dbDefinition, ignoreIfExists) + ignoreIfExists: Boolean): Unit = { +try { + withClient { +client.createDatabase(dbDefinition, ignoreIfExists) + } +} catch { + case e: AnalysisException if e.message.contains("already exists") => +throw new DatabaseAlreadyExistsException(dbDefinition.name) Review comment: Two concerns if we move the logic to `withClient`: 1. How can we guarantee `org.apache.hadoop.hive.metastore.api.AlreadyExistsException` is thrown by `createDatabase` but not for other calls like `createTable`? 2. Since db name is not available, we need to parse out the db name from the message. Are you OK with these? ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ## @@ -189,8 +189,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat override def createDatabase( dbDefinition: CatalogDatabase, - ignoreIfExists: Boolean): Unit = withClient { -client.createDatabase(dbDefinition, ignoreIfExists) + ignoreIfExists: Boolean): Unit = { +try { + withClient { +client.createDatabase(dbDefinition, ignoreIfExists) + } +} catch { + case e: AnalysisException if e.message.contains("already exists") => +throw new DatabaseAlreadyExistsException(dbDefinition.name) Review comment: Two concerns if we move the logic to `withClient`: 1. How can we guarantee `org.apache.hadoop.hive.metastore.api.AlreadyExistsException` is thrown only by `createDatabase` but not for other calls like `createTable`? 2. Since db name is not available, we need to parse out the db name from the message. Are you OK with these? ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ## @@ -189,8 +189,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
[GitHub] [spark] github-actions[bot] closed pull request #34083: Add docs about using Shiv for packaging (similar to PEX)
github-actions[bot] closed pull request #34083: URL: https://github.com/apache/spark/pull/34083 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #35133: [SPARK-37833][INFRA][FOLLOW-UP] Run checking modules of precondition only in forked repository
HyukjinKwon commented on pull request #35133: URL: https://github.com/apache/spark/pull/35133#issuecomment-1007284852 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #35126: [SPARK-37836][PYTHON][INFRA] Enable F841, E722, E305 and E226 for PEP 8 compliance
zero323 commented on pull request #35126: URL: https://github.com/apache/spark/pull/35126#issuecomment-1007350779 E305, E226 should be covered by black, if enabled for particular path, but there is no harm in adding them. Personally, I have mixed feelings about E722 ‒ using `BaseException` or `Exception` doesn't really seem to add anything. But I guess there is no harm in that either :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #35127: [SPARK-37837][INFRA] Enable black formatter in dev Python scripts
dongjoon-hyun commented on pull request #35127: URL: https://github.com/apache/spark/pull/35127#issuecomment-1007179964 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #35121: [SPARK-37833][INFRA] Add `precondition` jobs to skip the main GitHub Action jobs
dongjoon-hyun commented on a change in pull request #35121: URL: https://github.com/apache/spark/pull/35121#discussion_r780047564 ## File path: .github/workflows/build_and_test.yml ## @@ -96,15 +96,39 @@ jobs: echo '::set-output name=hadoop::hadoop3' fi + build-precondition: +name: Check a code change +runs-on: ubuntu-20.04 +outputs: + required: ${{ steps.set-outputs.outputs.required }} +steps: +- name: Checkout Spark repository + uses: actions/checkout@v2 + with: +fetch-depth: 0 +repository: apache/spark +ref: master +- name: Sync the current branch with the latest in Apache Spark + if: github.repository != 'apache/spark' + run: | +echo "APACHE_SPARK_REF=$(git rev-parse HEAD)" >> $GITHUB_ENV +git fetch https://github.com/$GITHUB_REPOSITORY.git ${GITHUB_REF#refs/heads/} +git -c user.name='Apache Spark Test Account' -c user.email='sparktest...@gmail.com' merge --no-commit --progress --squash FETCH_HEAD +git -c user.name='Apache Spark Test Account' -c user.email='sparktest...@gmail.com' commit -m "Merged commit" +- name: Check all modules except 'docs' + id: set-outputs + run: | +echo "::set-output name=required::$(./dev/is-changed.py -m avro,build,catalyst,core,docker-integration-tests,examples,graphx,hadoop-cloud,hive,hive-thriftserver,kubernetes,kvstore,launcher,mesos,mllib,mllib-local,network-common,network-shuffle,pyspark-core,pyspark-ml,pyspark-mllib,pyspark-pandas,pyspark-pandas-slow,pyspark-resource,pyspark-sql,pyspark-streaming,repl,sketch,spark-ganglia-lgpl,sparkr,sql,sql-kafka-0-10,streaming,streaming-kafka-0-10,streaming-kinesis-asl,tags,unsafe,yarn)" Review comment: Yes, it's really a missing part . BTW, `is-changed.py` is not designed to support multi-monule-support. Each precondition has its own `-m` option here. So, this is the best and flexible way for now. ## File path: .github/workflows/build_and_test.yml ## @@ -96,15 +96,39 @@ jobs: echo '::set-output name=hadoop::hadoop3' fi + build-precondition: +name: Check a code change +runs-on: ubuntu-20.04 +outputs: + required: ${{ steps.set-outputs.outputs.required }} +steps: +- name: Checkout Spark repository + uses: actions/checkout@v2 + with: +fetch-depth: 0 +repository: apache/spark +ref: master +- name: Sync the current branch with the latest in Apache Spark + if: github.repository != 'apache/spark' + run: | +echo "APACHE_SPARK_REF=$(git rev-parse HEAD)" >> $GITHUB_ENV +git fetch https://github.com/$GITHUB_REPOSITORY.git ${GITHUB_REF#refs/heads/} +git -c user.name='Apache Spark Test Account' -c user.email='sparktest...@gmail.com' merge --no-commit --progress --squash FETCH_HEAD +git -c user.name='Apache Spark Test Account' -c user.email='sparktest...@gmail.com' commit -m "Merged commit" +- name: Check all modules except 'docs' + id: set-outputs + run: | +echo "::set-output name=required::$(./dev/is-changed.py -m avro,build,catalyst,core,docker-integration-tests,examples,graphx,hadoop-cloud,hive,hive-thriftserver,kubernetes,kvstore,launcher,mesos,mllib,mllib-local,network-common,network-shuffle,pyspark-core,pyspark-ml,pyspark-mllib,pyspark-pandas,pyspark-pandas-slow,pyspark-resource,pyspark-sql,pyspark-streaming,repl,sketch,spark-ganglia-lgpl,sparkr,sql,sql-kafka-0-10,streaming,streaming-kafka-0-10,streaming-kinesis-asl,tags,unsafe,yarn)" Review comment: Yes, it's really a missing part . BTW, `is-changed.py` is not designed to support multi-monule-support. Each precondition has its own `-m` option here and the values are not the same. So, this is the best and flexible way for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #35124: [WIP][SPARK-37398][PYTHON] Inline type hints for python/pyspark/ml/classification.py
AmplabJenkins commented on pull request #35124: URL: https://github.com/apache/spark/pull/35124#issuecomment-1007838795 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aa1371 commented on pull request #35083: [WIP][SPARK-37798] PySpark Pandas API: Cross and conditional merging
aa1371 commented on pull request #35083: URL: https://github.com/apache/spark/pull/35083#issuecomment-1007846355 @HyukjinKwon - thoughts on the PR in its current state? Also, is it possible to see the specific formatting issues caught by the black formatter in the CI? It seems to contain some different black configuration than when I run it dev/format-python locally -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #35130: [SPARK-37839][SQL] DS V2 supports partial aggregate push-down AVG
cloud-fan commented on pull request #35130: URL: https://github.com/apache/spark/pull/35130#issuecomment-1007449447 What's the new algorithm? It's simple before: if agg funcs contain `GeneralAggregateFunc`, don't try partial pushdown. Othwewise, try complete pushdown first, then partial pushdown. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sathiyapk commented on a change in pull request #34729: [SPARK-37475][SQL] Add scale parameter to floor and ceil functions
sathiyapk commented on a change in pull request #34729: URL: https://github.com/apache/spark/pull/34729#discussion_r780514098 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala ## @@ -658,6 +669,62 @@ class MathExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { val intResultsB: Seq[Int] = Seq(31400, 31420, 31416, 314159000, 314159300, 314159260) ++ Seq.fill(7)(314159265) +def doubleResultsFloor(i: Int): Any = { + val results = Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3, +3.1, 3.14, 3.141, 3.1415, 3.14159, 3.141592) + if (i <= 6) results(i).toLong else results(i) +} + +def doubleResultsCeil(i: Int): Any = { + val results = Seq(100.0, 10.0, 1.0, 1000.0, 100.0, 10.0, +4L, 3.2, 3.15, 3.142, 3.1416, 3.1416, 3.141593) + if (i <= 6) results(i).toLong else results(i) +} + +def floatResultsFloor(i: Int): Any = { + val results = Seq(0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 3L, +3.1d, 3.14d, 3.141d, 3.1414d, 3.14149d, 3.141499d) + if (i <= 6) results(i).toLong else results(i) +} + +def floatResultsCeil(i: Int): Any = { + val results = Seq(100.0f, 10.0f, 1.0f, 1000.0f, 100.0f, 10.0f, +4L, 3.2d, 3.15d, 3.142d, 3.1415d, 3.1415d, 3.1415d) + if (i <= 6) results(i).toLong else results(i) +} + +def shortResultsFloor(i: Int): Any = { + val results = Seq[Long](0L, 0L, 3L, 31000L, +31400L, 31410L, 31415L) ++ Seq.fill[Long](7)(31415) + results(i) +} + +def shortResultsCeil(i: Int): Any = { Review comment: Sorry, i am not sure i got it right. Actually we test `RoundCeil` and `RoundFloor` directly for positive `scale` on big decimal, i can add long and double in that case, if you prefer. But we can't test `RoundCeil` and `RoundFloor` directly for negative `scale` because the casting to `Long` for negative scale happens one step before. May be if put the casting in the `RoundCeil` and `RoundFloor` objects and let the `ExpressionBuilder` call these objects like it was before, we can able to test `RoundCeil` and `RoundFloor` directly for both negative and positive `scale`. What do you say ? ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala ## @@ -658,6 +669,62 @@ class MathExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { val intResultsB: Seq[Int] = Seq(31400, 31420, 31416, 314159000, 314159300, 314159260) ++ Seq.fill(7)(314159265) +def doubleResultsFloor(i: Int): Any = { + val results = Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3, +3.1, 3.14, 3.141, 3.1415, 3.14159, 3.141592) + if (i <= 6) results(i).toLong else results(i) +} + +def doubleResultsCeil(i: Int): Any = { + val results = Seq(100.0, 10.0, 1.0, 1000.0, 100.0, 10.0, +4L, 3.2, 3.15, 3.142, 3.1416, 3.1416, 3.141593) + if (i <= 6) results(i).toLong else results(i) +} + +def floatResultsFloor(i: Int): Any = { + val results = Seq(0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 3L, +3.1d, 3.14d, 3.141d, 3.1414d, 3.14149d, 3.141499d) + if (i <= 6) results(i).toLong else results(i) +} + +def floatResultsCeil(i: Int): Any = { + val results = Seq(100.0f, 10.0f, 1.0f, 1000.0f, 100.0f, 10.0f, +4L, 3.2d, 3.15d, 3.142d, 3.1415d, 3.1415d, 3.1415d) + if (i <= 6) results(i).toLong else results(i) +} + +def shortResultsFloor(i: Int): Any = { + val results = Seq[Long](0L, 0L, 3L, 31000L, +31400L, 31410L, 31415L) ++ Seq.fill[Long](7)(31415) + results(i) +} + +def shortResultsCeil(i: Int): Any = { Review comment: Sorry, i am not sure i got it right. Actually we test `RoundCeil` and `RoundFloor` directly for positive `scale` on big decimal, i can add long and double in that case, if you prefer. But we can't test `RoundCeil` and `RoundFloor` directly for negative `scale` because the casting to `Long` for negative scale happens one step before. May be if i put the casting in the `RoundCeil` and `RoundFloor` objects and let the `ExpressionBuilder` call these objects like it was before, we can able to test `RoundCeil` and `RoundFloor` directly for both negative and positive `scale`. What do you say ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:
[GitHub] [spark] tgravescs commented on pull request #34982: [SPARK-37712][YARN] Spark request yarn cluster metrics slow cause delay
tgravescs commented on pull request #34982: URL: https://github.com/apache/spark/pull/34982#issuecomment-1007504670 I guess I'm fine with this, I kind of assume this is because your yarn resource manager is overloaded or unresponsive though again? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #23348: [SPARK-25857][core] Add developer documentation regarding delegation tokens.
tgravescs commented on a change in pull request #23348: URL: https://github.com/apache/spark/pull/23348#discussion_r780320642 ## File path: core/src/main/scala/org/apache/spark/deploy/security/README.md ## @@ -0,0 +1,249 @@ +# Delegation Token Handling In Spark + +This document aims to explain and demystify delegation tokens as they are used by Spark, since +this topic is generally a huge source of confusion. + + +## What are delegation tokens and why use them? + +Delegation tokens (DTs from now on) are authentication tokens used by some services to replace +Kerberos service tokens. Many services in the Hadoop ecosystem have support for DTs, since they +have some very desirable advantages over Kerberos tokens: + +* No need to distribute Kerberos credentials + +In a distributed application, distributing Kerberos credentials is tricky. Not all users have +keytabs, and when they do, it's generally frowned upon to distribute them over the network as +part of application data. + +DTs allow for a single place (e.g. the Spark driver) to require Kerberos credentials. That entity +can then distribute the DTs to other parts of the distributed application (e.g. Spark executors), +so they can authenticate to services. + +* A single token per service is used for authentication + +If Kerberos authentication were used, each client connection to a server would require a trip +to the KDC and generation of a service ticket. In a distributed system, the number of service +tickets can balloon pretty quickly when you think about the number of client processes (e.g. Spark +executors) vs. the number of service processes (e.g. HDFS DataNodes). That generates unnecessary +extra load on the KDC, and may even run into usage limits set up by the KDC admin. + +* DTs are only used for authentication + +DTs, unlike TGTs, can only be used to authenticate to the specific service for which they were +issued. You cannot use an existing DT to create new DTs or to create DTs for a different service. + +So in short, DTs are *not* Kerberos tokens. They are used by many services to replace Kerberos +authentication, or even other forms of authentication, although there is nothing (aside from +maybe implementation details) that ties them to Kerberos or any other authentication mechanism. + + +## Lifecycle of DTs + +DTs, unlike Kerberos tokens, are service-specific. There is no centralized location you contact +to create a DT for a service. So, the first step needed to get a DT is being able to authenticate +to the service in question. In the Hadoop ecosystem, that is generally done using Kerberos. + +This requires Kerberos credentials to be available somewhere for the application to use. The user +is generally responsible for providing those credentials, which is most commonly done by logging +in to the KDC (e.g. using "kinit"). That generates a (Kerberos) "token cache" containing a TGT +(ticket granting ticket), which can then be used to request service tickets. + +There are other ways of obtaining TGTs, but, ultimately, you need a TGT to bootstrap the process. + +Once a TGT is available, the target service's client library can then be used to authenticate +to the service using the Kerberos credentials, and request the creation of a delegation token. +This token can now be sent to other processes and used to authenticate to different daemons +belonging to that service. + +And thus the first drawback of DTs becomes apparent: you need service-specific logic to create and +use them. + +Spark implements a (somewhat) pluggable, internal DT creation API. Support for new services can be +added by implementing a `HadoopDelegationTokenProvider` that is then called by Spark when generating +delegation tokens for an application. Spark makes the DTs available to code by stashing them in the +`UserGroupInformation` credentials, and it's up to the DT provider and the respective client library +to agree on how to use those tokens. + +Once they are created, the semantics of how DTs operate are also service-specific. But, in general, +they try to follow the semantics of Kerberos tokens: + +* A "renewable period (equivalent to TGT's "lifetime") which is for how long the DT is valid + before it requires renewal. +* A "max lifetime" (equivalent to TGT's "renewable life") which is for how long the DT can be + renewed. + +Once the token reaches its "max lifetime", a new one needs to be created by contacting the +appropriate service, restarting the above process. + + +## DT Renewal, Renewers, and YARN + +This is the most confusing part of DT handling, and part of it is because much of the system was +designed with MapReduce, and later YARN, in mind. + +As seen above, DTs need to be renewed periodically until they finally expire for good. An example of +this is the default configuration of HDFS services: delegation tokens are valid for up to 7 days, +and need to be renewed every 24 hours. If 24 hours pass
[GitHub] [spark] wzhfy removed a comment on pull request #35137: [TEST] Fix test logic by getting sleep interval in task
wzhfy removed a comment on pull request #35137: URL: https://github.com/apache/spark/pull/35137#issuecomment-1007875631 cc @Ngone51 @agrawaldevesh -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #34439: [SPARK-37095][PYTHON] Inline type hints for files in python/pyspark/broadcast.py
zero323 commented on pull request #34439: URL: https://github.com/apache/spark/pull/34439#issuecomment-1007619850 Could you please resolve the conflicts @dchvn? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #35126: [SPARK-37836][PYTHON][INFRA] Enable F841, E722, E305 and E226 for PEP 8 compliance
HyukjinKwon closed pull request #35126: URL: https://github.com/apache/spark/pull/35126 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #32602: [SPARK-35455][SQL] Unify empty relation optimization between normal and AQE optimizer
cloud-fan commented on a change in pull request #32602: URL: https://github.com/apache/spark/pull/32602#discussion_r780377367 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala ## @@ -137,3 +125,55 @@ object PropagateEmptyRelation extends Rule[LogicalPlan] with PredicateHelper wit } } } + +/** + * This rule runs in the normal optimizer and optimizes more cases + * compared to [[PropagateEmptyRelationBase]]: + * 1. Higher-node Logical Plans + *- Union with all empty children. + * 2. Unary-node Logical Plans + *- Project/Filter/Sample with all empty children. + * + * The reason why we don't apply this rule at AQE optimizer side is: the benefit is not big enough + * and it may introduce extra exchanges. Review comment: After more thought, I think this is a big performance issue if we can't propagate empty relations through project/filter which are quite common. The risk of introducing new shuffles is relatively small compared to this. @ulysses-you can we move all the logic to `PropagateEmptyRelationBase`? `PropagateEmptyRelation` should not have any extra logic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #35132: [SPARK-37841][SQL] BasicWriteTaskStatsTracker should not try get status for a skipped file
yaooqinn commented on pull request #35132: URL: https://github.com/apache/spark/pull/35132#issuecomment-1007269119 cc @dongjoon-hyun @cloud-fan thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #34141: [SPARK-33887][SQL] Allow insert overwrite same table with static partition if using dynamic partition overwrite mode
github-actions[bot] commented on pull request #34141: URL: https://github.com/apache/spark/pull/34141#issuecomment-1007834075 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #33878: [SPARK-36303][SQL] Refactor fourthteenth set of 20 query execution errors to use error classes
github-actions[bot] closed pull request #33878: URL: https://github.com/apache/spark/pull/33878 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #35121: [SPARK-37833][INFRA] Add `precondition` job to skip the main GitHub Action jobs
dongjoon-hyun edited a comment on pull request #35121: URL: https://github.com/apache/spark/pull/35121#issuecomment-1007238145 Thank you. I addressed your comments. Could you review this once more, @HyukjinKwon ? - `precondition` job runs all precondition checks and build `json` file. - Each jobs `needs` only `precondition`'s result. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #34933: [SPARK-37674][SQL] Reduce the output partition of output stage to avoid producing small files.
ulysses-you commented on pull request #34933: URL: https://github.com/apache/spark/pull/34933#issuecomment-1007224052 I see the requirement but there are some potential issue if we only use a new config for writing's final stage. - if the final stage is heavy it will cause regression if we make partition size big, e.g. the final stage is join even multi-join - the input shuffle size is not equal to the output size. if the plan of final stage changes the data size, this config is less meaning - not all query contains shuffle, then the semantics of this config is broken since the config is not used - it's not enough for dynamic partition writing that just increase the partition size. we should cluster the same partition value in several partitions as far as possible - and this config should also affect the rebalance I think it's a good idea to add a `RebalancePartitions` node for all writing command as @wangyum working on SPARK-31264. And then we can consider adding a special partition size config for the added shuffle which is from `RebalancePartitions`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #35126: [SPARK-37836][PYTHON][INFRA] Enable F841, E722, E305 and E226 for PEP 8 compliance
HyukjinKwon commented on a change in pull request #35126: URL: https://github.com/apache/spark/pull/35126#discussion_r780048734 ## File path: python/pyspark/tests/test_readwrite.py ## @@ -231,9 +231,10 @@ def test_newhadoop(self): ) self.assertEqual(result, data) -fmt = "org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat" Review comment: this is technically a followup of https://github.com/apache/spark/pull/35126 but just decided to piggy-back here since im touching arbitrary codes here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #35124: [WIP][SPARK-37398][PYTHON] Inline type hints for python/pyspark/ml/classification.py
zero323 commented on pull request #35124: URL: https://github.com/apache/spark/pull/35124#issuecomment-1007376991 Thanks @javierivanov. Let's revisit this after all prerequisites are done (core for starters, but there are also internal dependencies that have to be followed). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org