date:20220108

[GitHub] [spark] HyukjinKwon closed pull request #35137: [SPARK-37846][TESTS] Fix test logic by getting sleep interval in task

2022-01-08 Thread GitBox



HyukjinKwon closed pull request #35137:
URL: https://github.com/apache/spark/pull/35137


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35137: [SPARK-37846][TESTS] Fix test logic by getting sleep interval in task

2022-01-08 Thread GitBox



HyukjinKwon commented on pull request #35137:
URL: https://github.com/apache/spark/pull/35137#issuecomment-1008239365


   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a change in pull request #33446: [SPARK-36215][SHUFFLE] Add logging for slow fetches to diagnose external shuffle service issues

2022-01-08 Thread GitBox



mridulm commented on a change in pull request #33446:
URL: https://github.com/apache/spark/pull/33446#discussion_r780737708



##
File path: 
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
##
@@ -334,6 +353,7 @@ final class ShuffleBlockFetcherIterator(
 s"due to Netty OOM, will retry")
 }
 remainingBlocks -= blockId
+logFetchOnCompletionIfSlow()

Review comment:
   If we are calling `logFetchOnCompletionIfSlow` from here, we should 
recompute what is `totalRequestSize` and `blockCount` - since fetch failed 
before all blocks were fetched.
   Essentially, we need to compute the stats based on infoMap.keys -- 
remainingBlocks
   Given the failure, `remainingBlocks.isEmpty` would never be empty except if 
this was last block - potentially loosing out on all suffix requests taking 
time.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a change in pull request #33446: [SPARK-36215][SHUFFLE] Add logging for slow fetches to diagnose external shuffle service issues

2022-01-08 Thread GitBox



mridulm commented on a change in pull request #33446:
URL: https://github.com/apache/spark/pull/33446#discussion_r780737343



##
File path: 
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
##
@@ -281,6 +292,13 @@ final class ShuffleBlockFetcherIterator(
   }
 }
 
+@inline def logFetchOnCompletionIfSlow(): Unit = {
+  if (remainingBlocks.isEmpty) {
+logFetchIfSlow(TimeUnit.NANOSECONDS.toMillis(clock.nanoTime() - 
requestStartTime),
+  infoMap.values.map(_._1).sum, blockIds.size, address)

Review comment:
   Use `req.size` instead.
   (See more below)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a change in pull request #33446: [SPARK-36215][SHUFFLE] Add logging for slow fetches to diagnose external shuffle service issues

2022-01-08 Thread GitBox



mridulm commented on a change in pull request #33446:
URL: https://github.com/apache/spark/pull/33446#discussion_r780737343



##
File path: 
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
##
@@ -281,6 +292,13 @@ final class ShuffleBlockFetcherIterator(
   }
 }
 
+@inline def logFetchOnCompletionIfSlow(): Unit = {
+  if (remainingBlocks.isEmpty) {
+logFetchIfSlow(TimeUnit.NANOSECONDS.toMillis(clock.nanoTime() - 
requestStartTime),
+  infoMap.values.map(_._1).sum, blockIds.size, address)

Review comment:
   Use `req.size` instead.

##
File path: 
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
##
@@ -342,6 +362,7 @@ final class ShuffleBlockFetcherIterator(
   val block = BlockId(blockId)
   if (block.isShuffleChunk) {
 remainingBlocks -= blockId
+logFetchOnCompletionIfSlow()

Review comment:
   Same here as the other failed case.

##
File path: 
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
##
@@ -281,6 +292,13 @@ final class ShuffleBlockFetcherIterator(
   }
 }
 
+@inline def logFetchOnCompletionIfSlow(): Unit = {
+  if (remainingBlocks.isEmpty) {
+logFetchIfSlow(TimeUnit.NANOSECONDS.toMillis(clock.nanoTime() - 
requestStartTime),
+  infoMap.values.map(_._1).sum, blockIds.size, address)

Review comment:
   pull use `req.size` instead.

##
File path: 
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
##
@@ -291,6 +309,7 @@ final class ShuffleBlockFetcherIterator(
 // This needs to be released after use.
 buf.retain()
 remainingBlocks -= blockId
+logFetchOnCompletionIfSlow()
 blockOOMRetryCounts.remove(blockId)
 results.put(new SuccessFetchResult(BlockId(blockId), 
infoMap(blockId)._2,
   address, infoMap(blockId)._1, buf, remainingBlocks.isEmpty))

Review comment:
   Make `remainingBlocks.isEmpty` into a local variable `networkReqDone` 
and use that for both `SuccessFetchResult` and whether to invoke 
`logFetchOnCompletionIfSlow` (and you can remove the `remainingBlocks.isEmpty` 
in that method)

##
File path: 
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
##
@@ -334,6 +353,7 @@ final class ShuffleBlockFetcherIterator(
 s"due to Netty OOM, will retry")
 }
 remainingBlocks -= blockId
+logFetchOnCompletionIfSlow()

Review comment:
   If we are calling `logFetchOnCompletionIfSlow` from here, we should 
recompute what is `totalRequestSize` and `blockCount` - since fetch failed 
before all blocks were fetched.
   Essentially, we need to compute the stats based on infoMap.keys -- 
remainingBlocks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7

2022-01-08 Thread GitBox



Yikun commented on pull request #35142:
URL: https://github.com/apache/spark/pull/35142#issuecomment-1008233769


   Agree @zero323, besides, another information is [CentOS will reach 
EOL](https://www.centos.org/centos-linux-eol/).
   
   But there are indeed some gaps that let users to use Spark easily with 
different OS, so I think some documentation or dockerfile reference might be 
necessary. I also heard some from local developer and customer that they need 
some initial dockerfile template for specific OS. Whether should we deal with 
these requirement appropriately?
   
   Just some random thoughts, I think the problem is who is responsible for 
maintaining these doc and dockerfile the continuous evolution and quality of 
these codes. Perhaps a potential solution should be these scirpt provided by 
the OS community's SIG, the Spark community documentation link to these.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wzhfy commented on pull request #35137: [SPARK-37846][TESTS] Fix test logic by getting sleep interval in task

2022-01-08 Thread GitBox



wzhfy commented on pull request #35137:
URL: https://github.com/apache/spark/pull/35137#issuecomment-1008231606


   > (I think it's best to file a JIRA BTW)
   
   added jira [SPARK-37846](https://issues.apache.org/jira/browse/SPARK-37846), 
thanks for the reminder


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 edited a comment on pull request #35083: [WIP][SPARK-37798] PySpark Pandas API: Cross and conditional merging

2022-01-08 Thread GitBox



zero323 edited a comment on pull request #35083:
URL: https://github.com/apache/spark/pull/35083#issuecomment-1008213524


   > Also, is it possible to see the specific formatting issues caught by the 
black formatter in the CI?
   
   Not at the moment. 
   
   > It seems to contain some different black configuration than when I run it 
dev/format-python locally
   
   This seems to be your current offender
   
   ```patch
   diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py
   index 949e3e601b..da91da59bc 100644
   --- a/python/pyspark/pandas/frame.py
   +++ b/python/pyspark/pandas/frame.py
   @@ -359,7 +359,8 @@ RIGHT_MERGE_PREFIX = 
MERGE_PREFIX_TEMPLATE.format("right")


def _resolve_internal_merge_frame(
   -internal: InternalFrame, side: Optional[str] = None,
   +internal: InternalFrame,
   +side: Optional[str] = None,
) -> InternalFrame:
internal = internal.resolved_copy
if side is None:
   ```
   
   In general you might want to resync this with master (also, please make sure 
that you use matching black version ‒ 21.12b0 at the moment. It should be 
detected automatically, but who knows?).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #35083: [WIP][SPARK-37798] PySpark Pandas API: Cross and conditional merging

2022-01-08 Thread GitBox



zero323 commented on pull request #35083:
URL: https://github.com/apache/spark/pull/35083#issuecomment-1008213524


   > Also, is it possible to see the specific formatting issues caught by the 
black formatter in the CI?
   
   Not at the moment. 
   
   > It seems to contain some different black configuration than when I run it 
dev/format-python locally
   
   This seems to be your current offender
   
   ```patch
   diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py
   index 949e3e601b..da91da59bc 100644
   --- a/python/pyspark/pandas/frame.py
   +++ b/python/pyspark/pandas/frame.py
   @@ -359,7 +359,8 @@ RIGHT_MERGE_PREFIX = 
MERGE_PREFIX_TEMPLATE.format("right")


def _resolve_internal_merge_frame(
   -internal: InternalFrame, side: Optional[str] = None,
   +internal: InternalFrame,
   +side: Optional[str] = None,
) -> InternalFrame:
internal = internal.resolved_copy
if side is None:
   ```
   
   In general you might want to resync this with master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7

2022-01-08 Thread GitBox



HyukjinKwon commented on pull request #35142:
URL: https://github.com/apache/spark/pull/35142#issuecomment-1008208411


   Yeah, I doubt if we want this. cc @dongjoon-hyun @holdenk FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35140: [SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values

2022-01-08 Thread GitBox



HyukjinKwon commented on pull request #35140:
URL: https://github.com/apache/spark/pull/35140#issuecomment-1008208298


   cc @cloud-fan @viirya FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35139: [SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values

2022-01-08 Thread GitBox



HyukjinKwon commented on pull request #35139:
URL: https://github.com/apache/spark/pull/35139#issuecomment-1008208142


   cc @cloud-fan @viirya FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35137: [TEST] Fix test logic by getting sleep interval in task

2022-01-08 Thread GitBox



HyukjinKwon commented on pull request #35137:
URL: https://github.com/apache/spark/pull/35137#issuecomment-1008208011


   (I think it's best to file a JIRA BTW)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #35129: [SPARK-37835][SQL][DOCS] Fix the comments on SQLQueryTestSuite.scala/ThriftServerQueryTestSuite.scala to more explicit

2022-01-08 Thread GitBox



itholic commented on a change in pull request #35129:
URL: https://github.com/apache/spark/pull/35129#discussion_r780721573



##
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala
##
@@ -36,6 +36,7 @@ import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.tags.ExtendedSQLTest
 import org.apache.spark.util.Utils
 
+// scalastyle:off line.size.limit

Review comment:
   Done!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #35129: [SPARK-37835][SQL][DOCS] Fix the comments on SQLQueryTestSuite.scala/ThriftServerQueryTestSuite.scala to more explicit

2022-01-08 Thread GitBox



itholic commented on a change in pull request #35129:
URL: https://github.com/apache/spark/pull/35129#discussion_r780721366



##
File path: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala
##
@@ -43,22 +43,22 @@ import org.apache.spark.sql.types._
  *
  * To run the entire test suite:
  * {{{
- *   build/sbt -Phive-thriftserver "hive-thriftserver/testOnly 
*ThriftServerQueryTestSuite"
+ *   build/sbt -Phive-thriftserver "hive-thriftserver/testOnly 
org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite"
  * }}}
  *
  * To run a single test file upon change:
  * {{{
- *   build/sbt -Phive-thriftserver "hive-thriftserver/testOnly 
*ThriftServerQueryTestSuite -- -z inline-table.sql"
+ *   build/sbt -Phive-thriftserver "hive-thriftserver/testOnly 
org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite -- -z 
inline-table.sql"
  * }}}
  *
  * This test suite won't generate golden files. To re-generate golden files 
for entire suite, run:
  * {{{
- *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite"
+ *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
org.apache.spark.sql.SQLQueryTestSuite"

Review comment:
   Yeah, I think we use only one file for SQL tests.
   
   ```shell
   $ find ./ -name "*SQLQueryTestSuite*"
   .//sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #33174: [SPARK-35721][PYTHON] Path level discover for python unittests

2022-01-08 Thread GitBox



github-actions[bot] closed pull request #33174:
URL: https://github.com/apache/spark/pull/33174


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #34141: [SPARK-33887][SQL] Allow insert overwrite same table with static partition if using dynamic partition overwrite mode

2022-01-08 Thread GitBox



github-actions[bot] closed pull request #34141:
URL: https://github.com/apache/spark/pull/34141


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #34024: [SPARK-36784][SHUFFLE][WIP] Handle DNS issues on executor to prevent shuffle nodes from getting added to exclude list

2022-01-08 Thread GitBox



github-actions[bot] closed pull request #34024:
URL: https://github.com/apache/spark/pull/34024


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #34083: Add docs about using Shiv for packaging (similar to PEX)

2022-01-08 Thread GitBox



zero323 commented on pull request #34083:
URL: https://github.com/apache/spark/pull/34083#issuecomment-1008189532


   
   
   > Sounds good, we can revisit the PR once we have validated things.
   
   Shall we add a JIRA ticket for this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #34354: [SPARK-37085][PYTHON][SQL] Add list/tuple overloads to array, struct, create_map, map_concat

2022-01-08 Thread GitBox



zero323 commented on pull request #34354:
URL: https://github.com/apache/spark/pull/34354#issuecomment-1008188701


   > t seems to be the best for now, until mypy itself supports looser typing 
conventions.
   
   As @ueshin pointed out, using `Sequence` might provide more general approach 
(and allow us to forget about variance), but there are two problems:
   
   - It is high level interface that potentially covers more than lists and 
tuples. So this would require change of the logic, and to provide consistent 
UX, we should probably to the same in other places where sequence-ish input is 
accepted.
   - There is concern of ambiguity (`str` is a `Sequence[str]`), which causes 
my irrational feat that it can break in some hard to contain ways (most likely, 
I am overthinking it). 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #34354: [SPARK-37085][PYTHON][SQL] Add list/tuple overloads to array, struct, create_map, map_concat

2022-01-08 Thread GitBox



zero323 commented on pull request #34354:
URL: https://github.com/apache/spark/pull/34354#issuecomment-1008186488


   Merged into master.
   
   Thanks everyone!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 closed pull request #34354: [SPARK-37085][PYTHON][SQL] Add list/tuple overloads to array, struct, create_map, map_concat

2022-01-08 Thread GitBox



zero323 closed pull request #34354:
URL: https://github.com/apache/spark/pull/34354


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 removed a comment on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7

2022-01-08 Thread GitBox



zero323 removed a comment on pull request #35142:
URL: https://github.com/apache/spark/pull/35142#issuecomment-1008181473


   Sorry, but this doesn't seem like a good idea.
   
   Even if we didn't already drop Python 3.6 support with supporting code, 
maintaining another set of images just for this purpose, wouldn't justify the 
maintenance cost. These images provide good starting point, but are by no means 
comprehensive ‒ if you're want to use different interpreters, virtual machines 
or native dependencies it makes more sense to start from scratch with your own 
images that fit your requirements.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7

2022-01-08 Thread GitBox



zero323 commented on pull request #35142:
URL: https://github.com/apache/spark/pull/35142#issuecomment-1008181473


   Sorry, but this doesn't seem like a good idea.
   
   Even if we didn't already drop Python 3.6 support with supporting code, 
maintaining another set of images just for this purpose, wouldn't justify the 
maintenance cost. These images provide good starting point, but are by no means 
comprehensive ‒ if you're want to use different interpreters, virtual machines 
or native dependencies it makes more sense to start from scratch with your own 
images that fit your requirements.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #35076: [SPARK-37793][CORE][SHUFFLE] Fallback to fetch original blocks when noLocalMergedBlockDataError

2022-01-08 Thread GitBox



mridulm commented on pull request #35076:
URL: https://github.com/apache/spark/pull/35076#issuecomment-1008181443


   One possibility I was thinking of was if we calling `finalizePartition` 
before any chunks have been fully written out for a reducer.
   
   Can you test with this change ?
   ```
   diff --git 
a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
 
b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
   index d0eb4aed65..0d3a3c7448 100644
   --- 
a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
   +++ 
b/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
   @@ -575,17 +575,23 @@ public class RemoteBlockPushResolver implements 
MergedShuffleFileManager {
  List sizes = new ArrayList<>(shuffleMergePartitions.size());
  for (AppShufflePartitionInfo partition: 
shuffleMergePartitions.values()) {
synchronized (partition) {
   +  boolean shouldDelete = false;
  try {
// This can throw IOException which will marks this shuffle 
partition as not merged.
partition.finalizePartition();
   -bitmaps.add(partition.mapTracker);
   -reduceIds.add(partition.reduceId);
   -sizes.add(partition.getLastChunkOffset());
   +long size = partition.getLastChunkOffset();
   +if (size > 0) {
   +  bitmaps.add(partition.mapTracker);
   +  reduceIds.add(partition.reduceId);
   +  sizes.add(size);
   +} else {
   +  shouldDelete = true;
   +}
  } catch (IOException ioe) {
logger.warn("Exception while finalizing shuffle partition {}_{} 
{} {}", msg.appId,
  msg.appAttemptId, msg.shuffleId, partition.reduceId, ioe);
  } finally {
   -partition.closeAllFilesAndDeleteIfNeeded(false);
   +partition.closeAllFilesAndDeleteIfNeeded(shouldDelete);
  }
}
  }
   ```
   
   +CC @otterc 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #34083: Add docs about using Shiv for packaging (similar to PEX)

2022-01-08 Thread GitBox



mridulm commented on pull request #34083:
URL: https://github.com/apache/spark/pull/34083#issuecomment-1008180057


   Sounds good, we can revisit the PR once we have validated things.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] bjornjorgensen commented on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7

2022-01-08 Thread GitBox



bjornjorgensen commented on pull request #35142:
URL: https://github.com/apache/spark/pull/35142#issuecomment-1008158090


   You have problems with ImportError: libffi.so.6
   It looks like many others have the same problems like this post show. 
https://stackoverflow.com/questions/61875869/ubuntu-20-04-upgrade-python-missing-libffi-so-6
 
   In yours docker file you fix this with installing python 3.9.0 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] bjornjorgensen removed a comment on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7

2022-01-08 Thread GitBox



bjornjorgensen removed a comment on pull request #35142:
URL: https://github.com/apache/spark/pull/35142#issuecomment-1008155592


   Why do you wanna have support for CentOS 7 witch are end off live? 
   In yours docker filer you are installing python 3.9.0 but here you are using 
spark.pyspark.driver.python=python3.6/bin/python3 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] bjornjorgensen commented on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7

2022-01-08 Thread GitBox



bjornjorgensen commented on pull request #35142:
URL: https://github.com/apache/spark/pull/35142#issuecomment-1008155592


   Why do you wanna have support for CentOS 7 witch are end off live? 
   In yours docker filer you are installing python 3.9.0 but here you are using 
spark.pyspark.driver.python=python3.6/bin/python3 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] shardulm94 commented on pull request #33446: [SPARK-36215][SHUFFLE] Add logging for slow fetches to diagnose external shuffle service issues

2022-01-08 Thread GitBox



shardulm94 commented on pull request #33446:
URL: https://github.com/apache/spark/pull/33446#issuecomment-1008103255


   @mridulm I have rebased to master and also addressed comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #35139: [SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values

2022-01-08 Thread GitBox



AmplabJenkins commented on pull request #35139:
URL: https://github.com/apache/spark/pull/35139#issuecomment-1008094771


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #35140: [SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values

2022-01-08 Thread GitBox



AmplabJenkins commented on pull request #35140:
URL: https://github.com/apache/spark/pull/35140#issuecomment-1008094749


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #35142: [SPARK-37708][K8S] Provides basic images that support centos7

2022-01-08 Thread GitBox



AmplabJenkins commented on pull request #35142:
URL: https://github.com/apache/spark/pull/35142#issuecomment-1008094732


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #35143: [SPARK-37844][CORE][TESTS] Remove `slf4j-log4j12` transitive test dependency from `hadoop-minikdc`

2022-01-08 Thread GitBox



dongjoon-hyun commented on pull request #35143:
URL: https://github.com/apache/spark/pull/35143#issuecomment-1008088119


   I'm going to announce today the followings.
   1. The recovery of Maven test coverage
   2. Apple Silicon test coverage


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #35143: [SPARK-37844][CORE][TESTS] Remove `slf4j-log4j12` transitive test dependency from `hadoop-minikdc`

2022-01-08 Thread GitBox



dongjoon-hyun commented on pull request #35143:
URL: https://github.com/apache/spark/pull/35143#issuecomment-1008087344


   Thank you again, @srowen and @viirya .
   
   To @srowen , I invited you to the new Jenkins infra as an admin to show you 
the progress.
   
   For now, I've been installing the required softwares and stabilizing the new 
M1 Mac.
   
   https://user-images.githubusercontent.com/9700541/148655033-f7fe8caf-cd27-40b9-9528-6ce6ec24ca78.png;>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on pull request #35143: [SPARK-37844][CORE][TESTS] Remove `slf4j-log4j12` transitive test dependency from `hadoop-minikdc`

2022-01-08 Thread GitBox



viirya commented on pull request #35143:
URL: https://github.com/apache/spark/pull/35143#issuecomment-1008072312


   Thanks @dongjoon-hyun . lgtm. I found it last night too, was planning to 
submit a PR today. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #35143: [SPARK-37844][CORE][TESTS] Remove `slf4j-log4j12` transitive test dependency from `hadoop-minikdc`

2022-01-08 Thread GitBox



dongjoon-hyun closed pull request #35143:
URL: https://github.com/apache/spark/pull/35143


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #35143: [SPARK-37844][CORE][TESTS] Remove `slf4j-log4j12` transitive test dependency from `hadoop-minikdc`

2022-01-08 Thread GitBox



dongjoon-hyun commented on pull request #35143:
URL: https://github.com/apache/spark/pull/35143#issuecomment-1008071876


   Thank you, @srowen . All tests passed although GitHub Action status on here 
seems to show the first run.
   
   https://user-images.githubusercontent.com/9700541/148653678-9ae6594c-4f53-4cb7-9006-b64bf6088d11.png;>
   !
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on pull request #35134: [SPARK-37842][SQL][MINIOR] Use `multi-catch` to simplify duplicate exception handling behavior in Java code

2022-01-08 Thread GitBox



srowen commented on pull request #35134:
URL: https://github.com/apache/spark/pull/35134#issuecomment-1008048012


   These are good changes, but, these are changes to copied Hive code. We can 
'fix' it but might introduce tiny differences and merge conflicts if we update 
the Hive code. 
   
   Are there any instances in the Spark code (not in org.apache.hive)? those 
would be fine to change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen closed pull request #35144: Branch 2.4

2022-01-08 Thread GitBox



srowen closed pull request #35144:
URL: https://github.com/apache/spark/pull/35144


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Franc-Z opened a new pull request #35144: Branch 2.4

2022-01-08 Thread GitBox



Franc-Z opened a new pull request #35144:
URL: https://github.com/apache/spark/pull/35144


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   Hi, could anyone can tell me that if Spark will have a 2.5.0 version in 
future?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #34326: [SPARK-37053][CORE] Add metrics to SparkHistoryServer

2022-01-08 Thread GitBox



AngersZh commented on pull request #34326:
URL: https://github.com/apache/spark/pull/34326#issuecomment-1007968668


   > at a high level seems fine, I need to take a more detailed look at a few 
parts. It also needs docs.
   
   Doc added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34326: [SPARK-37053][CORE] Add metrics to SparkHistoryServer

2022-01-08 Thread GitBox



AngersZh commented on a change in pull request #34326:
URL: https://github.com/apache/spark/pull/34326#discussion_r780655043



##
File path: 
core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
##
@@ -157,6 +161,15 @@ class HistoryServer(
 contextHandler.setContextPath(HistoryServer.UI_PATH_PREFIX)
 contextHandler.addServlet(new ServletHolder(loaderServlet), "/*")
 attachHandler(contextHandler)
+
+// Register history server source to history server metrics system.
+cacheMetrics.init()

Review comment:
   > why are we calling init here? Doesn't this end up getting called twice?
   
   Only called once here, before this method is never called. I am also 
confused about this..

##
File path: 
core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
##
@@ -157,6 +161,15 @@ class HistoryServer(
 contextHandler.setContextPath(HistoryServer.UI_PATH_PREFIX)
 contextHandler.addServlet(new ServletHolder(loaderServlet), "/*")
 attachHandler(contextHandler)
+
+// Register history server source to history server metrics system.
+cacheMetrics.init()

Review comment:
   > why are we calling init here? Doesn't this end up getting called twice?
   
   Only called once here. At before this method is never called. I am also 
confused about this..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #33934: [SPARK-36691][PYTHON] PythonRunner failed should pass error message to ApplicationMaster too

2022-01-08 Thread GitBox



AngersZh commented on pull request #33934:
URL: https://github.com/apache/spark/pull/33934#issuecomment-1007946564


   > @holdenk who may have dealt with the python side before and have thoughts
   
   Thanks for your ping. It's a really useful feature and my wired pr have be 
deployed to our prod. Make user convenient to check failed reason.
   
   Hope some suggestion to make this feature to be continued.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #35130: [SPARK-37839][SQL] DS V2 supports partial aggregate push-down AVG

2022-01-08 Thread GitBox



cloud-fan commented on a change in pull request #35130:
URL: https://github.com/apache/spark/pull/35130#discussion_r780294560



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
##
@@ -212,7 +224,10 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] 
with PredicateHelper {
 
   private def supportPartialAggPushDown(agg: Aggregation): Boolean = {
 // We don't know the agg buffer of `GeneralAggregateFunc`, so can't do 
partial agg push down.
-agg.aggregateExpressions().forall(!_.isInstanceOf[GeneralAggregateFunc])
+agg.aggregateExpressions().forall { aggregateFunc =>
+  !aggregateFunc.isInstanceOf[GeneralAggregateFunc] ||
+aggregateFunc.asInstanceOf[GeneralAggregateFunc].name() == "AVG"

Review comment:
   If `AVG` can be partially pushed, I'd like to create a dedicated class 
for it, instead of using `GeneralAggregateFunc`.

##
File path: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
##
@@ -223,7 +225,11 @@ abstract class JdbcDialect extends Serializable with 
Logging{
   case f: GeneralAggregateFunc if f.name() == "AVG" =>
 assert(f.inputs().length == 1)
 val distinct = if (f.isDistinct) "DISTINCT " else ""
-Some(s"AVG($distinct${f.inputs().head})")
+if (supportCompletePushDown) {
+  Some(s"AVG($distinct${f.inputs().head})")
+} else {
+  Some(s"Sum($distinct${f.inputs().head}), 
Count($distinct${f.inputs().head})")

Review comment:
   why do we need it? It's Spark's responsibility to rewrite avg to 
sum/count, not the data source.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #35127: [SPARK-37837][INFRA] Enable black formatter in dev Python scripts

2022-01-08 Thread GitBox



dongjoon-hyun closed pull request #35127:
URL: https://github.com/apache/spark/pull/35127


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #35130: [SPARK-37839][SQL] DS V2 supports partial aggregate push-down AVG

2022-01-08 Thread GitBox



beliefer commented on pull request #35130:
URL: https://github.com/apache/spark/pull/35130#issuecomment-1007849173


   > What's the new algorithm? It's simple before: if agg funcs contain 
`GeneralAggregateFunc`, don't try partial pushdown. Othwewise, try complete 
pushdown first, then partial pushdown.
   
   Yes. But AVG is common used aggregate function, we hope try my best to 
pushdown. Maybe we could construct Arerage in DS V2, so we keep the simply.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hiboyang commented on pull request #34672: [SPARK-37394][CORE] Skip registering with ESS if a customized shuffle manager is configured

2022-01-08 Thread GitBox



hiboyang commented on pull request #34672:
URL: https://github.com/apache/spark/pull/34672#issuecomment-1007611924


   @tgravescs  I am the author of Uber RSS. In terms of your question "I'm not 
familiar with the details of Uber RSS, can it just fake up a remote shuffle 
port and ignore what is sent? have you requested their version to support 
dynamic allocation?"... Uber RSS could support dynamic allocation on Kubernetes 
by some work around, e.g enabling shuffle tracking feature and set shuffle 
timeout to be zero. But it is a hack. The code change here will make that hack 
unnecessary.
   
   In terms of adding a config like "spark.shuffle.registration.enabled", any 
concern?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao closed pull request #35125: [SPARK-37802][SQL][3.2] Composite field name should work with Aggregate push down

2022-01-08 Thread GitBox



huaxingao closed pull request #35125:
URL: https://github.com/apache/spark/pull/35125


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #35132: [SPARK-37841][SQL] BasicWriteTaskStatsTracker should not try get status for a skipped file

2022-01-08 Thread GitBox



cloud-fan commented on a change in pull request #35132:
URL: https://github.com/apache/spark/pull/35132#discussion_r780458052



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
##
@@ -142,8 +142,13 @@ class BasicWriteTaskStatsTracker(
 numSubmittedFiles += 1
   }
 
-  override def closeFile(filePath: String): Unit = {
-updateFileStats(filePath)
+  override def closeFile(filePath: String, isPathCreated: Boolean): Unit = {
+if (isPathCreated) {
+  updateFileStats(filePath)
+} else {
+  logDebug(s"$filePath is not pre-touched by writer, skipping update file 
stats")

Review comment:
   ```suggestion
 logDebug(s"$filePath is not created due to no data, skip updating file 
stats")
   ```

##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
##
@@ -305,6 +305,12 @@ private[orc] class OrcOutputWriter(
   recordWriter.close(Reporter.NULL)
 }
   }
+
+  /**
+   * If `recordWriterInstantiated` is false, the output file is not pretouched.

Review comment:
   Alternatively, can we fix the Hive ORC data source to always write the 
file? This seems wrong to me. We should at least write one file even if the 
input query is empty, to record the output schema. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #35143: [SPARK-37844][CORE][TESTS] Remove slf4j-log4j12 dependency from hadoop-minikdc

2022-01-08 Thread GitBox



dongjoon-hyun commented on pull request #35143:
URL: https://github.com/apache/spark/pull/35143#issuecomment-1007943026


   It seems that the GitHub Action notification is not forwarded to the user 
GitHub Action.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #34982: [SPARK-37712][YARN] Spark request yarn cluster metrics slow cause delay

2022-01-08 Thread GitBox



AngersZh commented on pull request #34982:
URL: https://github.com/apache/spark/pull/34982#issuecomment-1007945556


   > I guess I'm fine with this, I kind of assume this is because your yarn 
resource manager is overloaded or unresponsive though again?
   
   Yea, yarn's performance is not so stable. With this patch, performance 
become better ==


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun removed a comment on pull request #35143: [SPARK-37844][CORE][TESTS] Remove `slf4j-log4j12` transitive test dependency from `hadoop-minikdc`

2022-01-08 Thread GitBox



dongjoon-hyun removed a comment on pull request #35143:
URL: https://github.com/apache/spark/pull/35143#issuecomment-1007943026


   It seems that the GitHub Action notification is not forwarded to the user 
GitHub Action.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #35143: [SPARK-37844][CORE][TESTS] Remove slf4j-log4j12 dependency from hadoop-minikdc

2022-01-08 Thread GitBox



dongjoon-hyun closed pull request #35143:
URL: https://github.com/apache/spark/pull/35143


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #35141: [SPARK-37843][CORE] Suppress NoSuchFieldError at setMDCForTask

2022-01-08 Thread GitBox



dongjoon-hyun commented on pull request #35141:
URL: https://github.com/apache/spark/pull/35141#issuecomment-1007837947






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #35138: [SPARK-35703][SQL][FOLLOWUP] Only eliminate shuffles if partition keys contain all the join keys

2022-01-08 Thread GitBox



cloud-fan commented on pull request #35138:
URL: https://github.com/apache/spark/pull/35138#issuecomment-1007635990


   cc @sunchao @c21


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #31267: [SPARK-21195][CORE] MetricSystem should pick up dynamically registered metrics in sources

2022-01-08 Thread GitBox



github-actions[bot] closed pull request #31267:
URL: https://github.com/apache/spark/pull/31267


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wzhfy commented on pull request #35137: [TEST] Fix test logic by getting sleep interval in task

2022-01-08 Thread GitBox



wzhfy commented on pull request #35137:
URL: https://github.com/apache/spark/pull/35137#issuecomment-1007875631






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tgravescs edited a comment on pull request #34672: [SPARK-37394][CORE] Skip registering with ESS if a customized shuffle manager is configured

2022-01-08 Thread GitBox



tgravescs edited a comment on pull request #34672:
URL: https://github.com/apache/spark/pull/34672#issuecomment-1007641251


   yes my concern is that its a config for something that isn't public. it 
doesn't make sense to me to have a public config for a non-public api and by 
itself without this 3rd party lib the config would not apply.
   
   I don't think my question was answered (can't you just open a port on one of 
your remote shuffle services and ignore the messages) and it sounds like it 
does work now with some hack (is that hack just setup and configs)?   If so 
then I would much rather prefer a real solution like was already mentioned 
above because you can make it work now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35127: [SPARK-37837][INFRA] Enable black formatter in dev Python scripts

2022-01-08 Thread GitBox



HyukjinKwon commented on pull request #35127:
URL: https://github.com/apache/spark/pull/35127#issuecomment-1007171463


   cc @zero323 @dongjoon-hyun FYI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35129: [SC-37835][SQL] Fix the comments on SQLQueryTestSuite.scala to more explicit

2022-01-08 Thread GitBox



HyukjinKwon commented on a change in pull request #35129:
URL: https://github.com/apache/spark/pull/35129#discussion_r780066447



##
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala
##
@@ -44,22 +44,23 @@ import org.apache.spark.util.Utils
  *
  * To run the entire test suite:
  * {{{
- *   build/sbt "sql/testOnly *SQLQueryTestSuite"
+ *   build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
  * }}}
  *
  * To run a single test file upon change:
  * {{{
- *   build/sbt "~sql/testOnly *SQLQueryTestSuite -- -z inline-table.sql"
+ *   build/sbt "~sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z 
inline-table.sql"

Review comment:
   This one shouldn't be updated because there's 
`sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala`
 to verify the results. We should be only explicit when we generate golden 
files. Can we document this since we're here?

##
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala
##
@@ -44,22 +44,23 @@ import org.apache.spark.util.Utils
  *
  * To run the entire test suite:
  * {{{
- *   build/sbt "sql/testOnly *SQLQueryTestSuite"
+ *   build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
  * }}}
  *
  * To run a single test file upon change:
  * {{{
- *   build/sbt "~sql/testOnly *SQLQueryTestSuite -- -z inline-table.sql"
+ *   build/sbt "~sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z 
inline-table.sql"
  * }}}
  *
  * To re-generate golden files for entire suite, run:
  * {{{
- *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite"
+ *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
org.apache.spark.sql.SQLQueryTestSuite"
  * }}}
  *
  * To re-generate golden file for a single test, run:
  * {{{
- *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite 
-- -z describe.sql"
+ *   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt

Review comment:
   Can we add line breaks so we can run this via copying and pasting?

##
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala
##
@@ -44,22 +44,23 @@ import org.apache.spark.util.Utils
  *
  * To run the entire test suite:
  * {{{
- *   build/sbt "sql/testOnly *SQLQueryTestSuite"
+ *   build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
  * }}}
  *
  * To run a single test file upon change:
  * {{{
- *   build/sbt "~sql/testOnly *SQLQueryTestSuite -- -z inline-table.sql"
+ *   build/sbt "~sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z 
inline-table.sql"

Review comment:
   This one shouldn't be updated because there's 
`sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala`
 to verify the results. We should be only explicit when we generate golden 
files. Can we document this since we're here?

##
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala
##
@@ -36,6 +36,7 @@ import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.tags.ExtendedSQLTest
 import org.apache.spark.util.Utils
 
+// scalastyle:off line.size.limit

Review comment:
   can you on this style back at the end of this Scaladoc?

##
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala
##
@@ -36,6 +36,7 @@ import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.tags.ExtendedSQLTest
 import org.apache.spark.util.Utils
 
+// scalastyle:off line.size.limit

Review comment:
   can you turn on this style back at the end of this Scaladoc?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yoda-mon commented on pull request #34896: [SPARK-37568][SQL] Support 2-arguments by the convert_timezone() function

2022-01-08 Thread GitBox



yoda-mon commented on pull request #34896:
URL: https://github.com/apache/spark/pull/34896#issuecomment-1007202017


   @MaxGekk 
   Sorry for the delayed response, I tried to follow your advice. 
   However at
   
   ```
 def this(targetTz: Expression, sourceTs: Expression) =
   this(Literal("UTC"), targetTz, sourceTs)
   ```
   
   `sourceTz` is not ignored.  After the constructor I could use `timeZoneId` 
as`sourceTz` that copied by the mix-in, but in that case, it's difficult to 
distinguish  `TimeStampLTZ` and `TimeStampNTZ`. 
   
   Please let me know if there is something I missed to use the mix-in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] pan3793 edited a comment on pull request #35076: [SPARK-37793][CORE][SHUFFLE] Fallback to fetch original blocks when noLocalMergedBlockDataError

2022-01-08 Thread GitBox



pan3793 edited a comment on pull request #35076:
URL: https://github.com/apache/spark/pull/35076#issuecomment-1007354842






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tgravescs commented on pull request #33934: [SPARK-36691][PYTHON] PythonRunner failed should pass error message to ApplicationMaster too

2022-01-08 Thread GitBox



tgravescs commented on pull request #33934:
URL: https://github.com/apache/spark/pull/33934#issuecomment-1007523767


   @holdenk who may have dealt with the python side before and have thoughts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gaborgsomogyi commented on a change in pull request #23348: [SPARK-25857][core] Add developer documentation regarding delegation tokens.

2022-01-08 Thread GitBox



gaborgsomogyi commented on a change in pull request #23348:
URL: https://github.com/apache/spark/pull/23348#discussion_r780256850



##
File path: core/src/main/scala/org/apache/spark/deploy/security/README.md
##
@@ -0,0 +1,249 @@
+# Delegation Token Handling In Spark
+
+This document aims to explain and demystify delegation tokens as they are used 
by Spark, since
+this topic is generally a huge source of confusion.
+
+
+## What are delegation tokens and why use them?
+
+Delegation tokens (DTs from now on) are authentication tokens used by some 
services to replace
+Kerberos service tokens. Many services in the Hadoop ecosystem have support 
for DTs, since they
+have some very desirable advantages over Kerberos tokens:
+
+* No need to distribute Kerberos credentials
+
+In a distributed application, distributing Kerberos credentials is tricky. Not 
all users have
+keytabs, and when they do, it's generally frowned upon to distribute them over 
the network as
+part of application data.
+
+DTs allow for a single place (e.g. the Spark driver) to require Kerberos 
credentials. That entity
+can then distribute the DTs to other parts of the distributed application 
(e.g. Spark executors),
+so they can authenticate to services.
+
+* A single token per service is used for authentication
+
+If Kerberos authentication were used, each client connection to a server would 
require a trip
+to the KDC and generation of a service ticket. In a distributed system, the 
number of service
+tickets can balloon pretty quickly when you think about the number of client 
processes (e.g. Spark
+executors) vs. the number of service processes (e.g. HDFS DataNodes). That 
generates unnecessary
+extra load on the KDC, and may even run into usage limits set up by the KDC 
admin.
+
+* DTs are only used for authentication
+
+DTs, unlike TGTs, can only be used to authenticate to the specific service for 
which they were
+issued. You cannot use an existing DT to create new DTs or to create DTs for a 
different service.
+
+So in short, DTs are *not* Kerberos tokens. They are used by many services to 
replace Kerberos
+authentication, or even other forms of authentication, although there is 
nothing (aside from
+maybe implementation details) that ties them to Kerberos or any other 
authentication mechanism.
+
+
+## Lifecycle of DTs
+
+DTs, unlike Kerberos tokens, are service-specific. There is no centralized 
location you contact
+to create a DT for a service. So, the first step needed to get a DT is being 
able to authenticate
+to the service in question. In the Hadoop ecosystem, that is generally done 
using Kerberos.
+
+This requires Kerberos credentials to be available somewhere for the 
application to use. The user
+is generally responsible for providing those credentials, which is most 
commonly done by logging
+in to the KDC (e.g. using "kinit"). That generates a (Kerberos) "token cache" 
containing a TGT
+(ticket granting ticket), which can then be used to request service tickets.
+
+There are other ways of obtaining TGTs, but, ultimately, you need a TGT to 
bootstrap the process.
+
+Once a TGT is available, the target service's client library can then be used 
to authenticate
+to the service using the Kerberos credentials, and request the creation of a 
delegation token.
+This token can now be sent to other processes and used to authenticate to 
different daemons
+belonging to that service.
+
+And thus the first drawback of DTs becomes apparent: you need service-specific 
logic to create and
+use them.
+
+Spark implements a (somewhat) pluggable, internal DT creation API. Support for 
new services can be
+added by implementing a `HadoopDelegationTokenProvider` that is then called by 
Spark when generating
+delegation tokens for an application. Spark makes the DTs available to code by 
stashing them in the
+`UserGroupInformation` credentials, and it's up to the DT provider and the 
respective client library
+to agree on how to use those tokens.
+
+Once they are created, the semantics of how DTs operate are also 
service-specific. But, in general,
+they try to follow the semantics of Kerberos tokens:
+
+* A "renewable period (equivalent to TGT's "lifetime") which is for how long 
the DT is valid
+  before it requires renewal.
+* A "max lifetime" (equivalent to TGT's "renewable life") which is for how 
long the DT can be
+  renewed.
+
+Once the token reaches its "max lifetime", a new one needs to be created by 
contacting the
+appropriate service, restarting the above process.
+
+
+## DT Renewal, Renewers, and YARN
+
+This is the most confusing part of DT handling, and part of it is because much 
of the system was
+designed with MapReduce, and later YARN, in mind.
+
+As seen above, DTs need to be renewed periodically until they finally expire 
for good. An example of
+this is the default configuration of HDFS services: delegation tokens are 
valid for up to 7 days,
+and need to be renewed every 24 hours. If 24 hours

[GitHub] [spark] LuciferYang commented on a change in pull request #34471: [SPARK-36879][SQL] Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path

2022-01-08 Thread GitBox



LuciferYang commented on a change in pull request #34471:
URL: https://github.com/apache/spark/pull/34471#discussion_r780024037



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
##
@@ -76,6 +79,7 @@ object DataSourceReadBenchmark extends SqlBasedBenchmark {
 saveAsCsvTable(testDf, dir.getCanonicalPath + "/csv")
 saveAsJsonTable(testDf, dir.getCanonicalPath + "/json")
 saveAsParquetTable(testDf, dir.getCanonicalPath + "/parquet")
+saveAsParquetV2Table(testDf, dir.getCanonicalPath + "/parquetV2")

Review comment:
   I found that there are still unsupported encoding in Data Page V2, such 
as RLE for Boolean. It seems that it is not time to update the benchmark, 
please ignore my previous comments
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #34108: [SPARK-36638][SQL][TEST] Generalize OptimizeSkewedJoin - correctness

2022-01-08 Thread GitBox



github-actions[bot] closed pull request #34108:
URL: https://github.com/apache/spark/pull/34108


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #35121: [SPARK-37833][INFRA] Add `precondition` job to skip the main GitHub Action jobs

2022-01-08 Thread GitBox



dongjoon-hyun commented on pull request #35121:
URL: https://github.com/apache/spark/pull/35121#issuecomment-1007238145






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #32289: [SPARK-33357][K8S] Support Spark application managing with SparkAppHandle on Kubernetes

2022-01-08 Thread GitBox



github-actions[bot] closed pull request #32289:
URL: https://github.com/apache/spark/pull/32289


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #35133: [SPARK-37833][INFRA][FOLLOW-UP] Run checking modules of precondition only in forked repository

2022-01-08 Thread GitBox



HyukjinKwon closed pull request #35133:
URL: https://github.com/apache/spark/pull/35133


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 closed pull request #34674: [SPARK-37419][PYTHON][ML] Rewrite _shared_params_code_gen.py to inline type hints for ml/param/shared.py

2022-01-08 Thread GitBox



zero323 closed pull request #34674:
URL: https://github.com/apache/spark/pull/34674


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #35131: [SPARK-37827][SQL] Put the some built-in table properties into V1Table.propertie to adapt to V2 command

2022-01-08 Thread GitBox



cloud-fan commented on a change in pull request #35131:
URL: https://github.com/apache/spark/pull/35131#discussion_r780285249



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
##
@@ -384,9 +384,12 @@ case class CatalogTable(
   def toLinkedHashMap: mutable.LinkedHashMap[String, String] = {
 val map = new mutable.LinkedHashMap[String, String]()
 val tableProperties = properties
-  .filterKeys(!_.startsWith(VIEW_PREFIX))
+  .filter(!_._1.startsWith(VIEW_PREFIX))
+  .filter(kv => !CatalogV2Util.TABLE_RESERVED_PROPERTIES.contains(kv._1))
+  .filter(!_._1.startsWith(TableCatalog.OPTION_PREFIX))
+  .filter(_._1 != TableCatalog.PROP_EXTERNAL)

Review comment:
   hmm, isn't `PROP_EXTERNAL` included in `TABLE_RESERVED_PROPERTIES`?

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
##
@@ -384,9 +384,12 @@ case class CatalogTable(
   def toLinkedHashMap: mutable.LinkedHashMap[String, String] = {
 val map = new mutable.LinkedHashMap[String, String]()
 val tableProperties = properties
-  .filterKeys(!_.startsWith(VIEW_PREFIX))
+  .filter(!_._1.startsWith(VIEW_PREFIX))
+  .filter(kv => !CatalogV2Util.TABLE_RESERVED_PROPERTIES.contains(kv._1))
+  .filter(!_._1.startsWith(TableCatalog.OPTION_PREFIX))
+  .filter(_._1 != TableCatalog.PROP_EXTERNAL)
   .toSeq.sortBy(_._1)
-  .map(p => p._1 + "=" + p._2).mkString("[", ", ", "]")

Review comment:
   why this change?

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V1Table.scala
##
@@ -55,7 +56,8 @@ private[sql] case class V1Table(v1Table: CatalogTable) 
extends Table {
 }
   }
 
-  override lazy val properties: util.Map[String, String] = 
v1Table.properties.asJava
+  override lazy val properties: util.Map[String, String] =
+V1Table.filterTableProperties(v1Table).asJava

Review comment:
   I think we should generate these v2 table properties here, instead of 
when creating the table. It's a breaking change to generate v2 properties when 
creating tables, as it changes what to store in the metastore, which may have 
unknown consequences.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #33174: [SPARK-35721][PYTHON] Path level discover for python unittests

2022-01-08 Thread GitBox



github-actions[bot] commented on pull request #33174:
URL: https://github.com/apache/spark/pull/33174#issuecomment-1007834106


   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #35135: [SPARK-35442][SQL] Support propagate empty relation through aggregate

2022-01-08 Thread GitBox



cloud-fan commented on pull request #35135:
URL: https://github.com/apache/spark/pull/35135#issuecomment-1007644925


   can you also address 
https://github.com/apache/spark/pull/32602#discussion_r780377367 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tgravescs commented on a change in pull request #34326: [SPARK-37053][CORE] Add metrics to SparkHistoryServer

2022-01-08 Thread GitBox



tgravescs commented on a change in pull request #34326:
URL: https://github.com/apache/spark/pull/34326#discussion_r780369492



##
File path: 
core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
##
@@ -157,6 +161,15 @@ class HistoryServer(
 contextHandler.setContextPath(HistoryServer.UI_PATH_PREFIX)
 contextHandler.addServlet(new ServletHolder(loaderServlet), "/*")
 attachHandler(contextHandler)
+
+// Register history server source to history server metrics system.
+cacheMetrics.init()

Review comment:
   why are we calling init here?  Doesn't this end up getting called twice?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #34674: [SPARK-37419][PYTHON][ML] Rewrite _shared_params_code_gen.py to inline type hints for ml/param/shared.py

2022-01-08 Thread GitBox



zero323 commented on pull request #34674:
URL: https://github.com/apache/spark/pull/34674#issuecomment-1007510055


   Merged into master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Peng-Lei commented on a change in pull request #35131: [SPARK-37827][SQL] Put the some built-in table properties into V1Table.propertie to adapt to V2 command

2022-01-08 Thread GitBox



Peng-Lei commented on a change in pull request #35131:
URL: https://github.com/apache/spark/pull/35131#discussion_r780390548



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V1Table.scala
##
@@ -55,7 +56,8 @@ private[sql] case class V1Table(v1Table: CatalogTable) 
extends Table {
 }
   }
 
-  override lazy val properties: util.Map[String, String] = 
v1Table.properties.asJava
+  override lazy val properties: util.Map[String, String] =
+V1Table.filterTableProperties(v1Table).asJava

Review comment:
   Thank you very much. I try do it. In fact, officially because store 
these v2 table properties into metastore when create table, I spent a lot of 
time looking for failed use cases and trying to fix. It is indeed a breaking 
change to generate v2 properties when creating tables and store them. Thank you 
for your reminder very much. @cloud-fan 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] imback82 commented on a change in pull request #35113: [SPARK-37636][SQL] Migrate CREATE NAMESPACE to use V2 command by default

2022-01-08 Thread GitBox



imback82 commented on a change in pull request #35113:
URL: https://github.com/apache/spark/pull/35113#discussion_r780069534



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/TestsV1AndV2Commands.scala
##
@@ -28,14 +28,21 @@ import org.apache.spark.sql.internal.SQLConf
 trait TestsV1AndV2Commands extends DDLCommandTestUtils {
   private var _version: String = ""
   override def commandVersion: String = _version
+  def runningV1Command: Boolean = commandVersion == "V1"
 
   // Tests using V1 catalogs will run with `spark.sql.legacy.useV1Command` on 
and off
   // to test both V1 and V2 commands.
   override def test(testName: String, testTags: Tag*)(testFun: => Any)
 (implicit pos: Position): Unit = {
 Seq(true, false).foreach { useV1Command =>
-  _version = if (useV1Command) "V1" else "V2"
+  def setCommandVersion(): Unit = {
+_version = if (useV1Command) "V1" else "V2"
+  }
+  setCommandVersion()
   super.test(testName, testTags: _*) {
+// Need to set command version inside this test function so that
+// the correct command version is available in each test.
+setCommandVersion()

Review comment:
   This `def test` doesn't run `testFun` when it's invoked; it only 
registers the test). So by the time, `testFunc` is actually run, `_version` 
will always be set to "V2" (`useV1Command == false`). So we need to capture 
this inside the lambda that's passed into `super.test`.
   
   Also, note that we need to call `setCommandVersion()` before calling 
`super.test` because `super.test` being called is utilizing the 
`commandVersion` to set the right test name:
   
https://github.com/apache/spark/blob/527e842ee6ed3bb1aabba61d02337ab81a56f87b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLCommandTestUtils.scala#L56-L57

##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
##
@@ -189,8 +189,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
 
   override def createDatabase(
   dbDefinition: CatalogDatabase,
-  ignoreIfExists: Boolean): Unit = withClient {
-client.createDatabase(dbDefinition, ignoreIfExists)
+  ignoreIfExists: Boolean): Unit = {
+try {
+  withClient {
+client.createDatabase(dbDefinition, ignoreIfExists)
+  }
+} catch {
+  case e: AnalysisException if e.message.contains("already exists") =>
+throw new DatabaseAlreadyExistsException(dbDefinition.name)

Review comment:
   @cloud-fan I intercepted the exception here. looks a bit hacky?

##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
##
@@ -189,8 +189,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
 
   override def createDatabase(
   dbDefinition: CatalogDatabase,
-  ignoreIfExists: Boolean): Unit = withClient {
-client.createDatabase(dbDefinition, ignoreIfExists)
+  ignoreIfExists: Boolean): Unit = {
+try {
+  withClient {
+client.createDatabase(dbDefinition, ignoreIfExists)
+  }
+} catch {
+  case e: AnalysisException if e.message.contains("already exists") =>
+throw new DatabaseAlreadyExistsException(dbDefinition.name)

Review comment:
   Two concerns if we move the logic to `withClient`:
   1. How can we guarantee 
`org.apache.hadoop.hive.metastore.api.AlreadyExistsException` is thrown by 
`createDatabase` but not for other calls like `createTable`?
   2. Since db name is not available, we need to parse out the db name from the 
message.
   
   Are you OK with these?

##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
##
@@ -189,8 +189,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
 
   override def createDatabase(
   dbDefinition: CatalogDatabase,
-  ignoreIfExists: Boolean): Unit = withClient {
-client.createDatabase(dbDefinition, ignoreIfExists)
+  ignoreIfExists: Boolean): Unit = {
+try {
+  withClient {
+client.createDatabase(dbDefinition, ignoreIfExists)
+  }
+} catch {
+  case e: AnalysisException if e.message.contains("already exists") =>
+throw new DatabaseAlreadyExistsException(dbDefinition.name)

Review comment:
   Two concerns if we move the logic to `withClient`:
   1. How can we guarantee 
`org.apache.hadoop.hive.metastore.api.AlreadyExistsException` is thrown only by 
`createDatabase` but not for other calls like `createTable`?
   2. Since db name is not available, we need to parse out the db name from the 
message.
   
   Are you OK with these?

##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
##
@@ -189,8 +189,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat

[GitHub] [spark] github-actions[bot] closed pull request #34083: Add docs about using Shiv for packaging (similar to PEX)

2022-01-08 Thread GitBox



github-actions[bot] closed pull request #34083:
URL: https://github.com/apache/spark/pull/34083


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #35133: [SPARK-37833][INFRA][FOLLOW-UP] Run checking modules of precondition only in forked repository

2022-01-08 Thread GitBox



HyukjinKwon commented on pull request #35133:
URL: https://github.com/apache/spark/pull/35133#issuecomment-1007284852






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #35126: [SPARK-37836][PYTHON][INFRA] Enable F841, E722, E305 and E226 for PEP 8 compliance

2022-01-08 Thread GitBox



zero323 commented on pull request #35126:
URL: https://github.com/apache/spark/pull/35126#issuecomment-1007350779


   E305, E226 should be covered by black, if enabled for particular path, but 
there is no harm in adding them.
   
   Personally, I have mixed feelings about E722 ‒ using `BaseException` or 
`Exception` doesn't really seem to add anything. But I guess there is no harm 
in that either :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #35127: [SPARK-37837][INFRA] Enable black formatter in dev Python scripts

2022-01-08 Thread GitBox



dongjoon-hyun commented on pull request #35127:
URL: https://github.com/apache/spark/pull/35127#issuecomment-1007179964






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #35121: [SPARK-37833][INFRA] Add `precondition` jobs to skip the main GitHub Action jobs

2022-01-08 Thread GitBox



dongjoon-hyun commented on a change in pull request #35121:
URL: https://github.com/apache/spark/pull/35121#discussion_r780047564



##
File path: .github/workflows/build_and_test.yml
##
@@ -96,15 +96,39 @@ jobs:
   echo '::set-output name=hadoop::hadoop3'
 fi
 
+  build-precondition:
+name: Check a code change
+runs-on: ubuntu-20.04
+outputs:
+  required: ${{ steps.set-outputs.outputs.required }}
+steps:
+- name: Checkout Spark repository
+  uses: actions/checkout@v2
+  with:
+fetch-depth: 0
+repository: apache/spark
+ref: master
+- name: Sync the current branch with the latest in Apache Spark
+  if: github.repository != 'apache/spark'
+  run: |
+echo "APACHE_SPARK_REF=$(git rev-parse HEAD)" >> $GITHUB_ENV
+git fetch https://github.com/$GITHUB_REPOSITORY.git 
${GITHUB_REF#refs/heads/}
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' merge --no-commit --progress --squash 
FETCH_HEAD
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' commit -m "Merged commit"
+- name: Check all modules except 'docs'
+  id: set-outputs
+  run: |
+echo "::set-output name=required::$(./dev/is-changed.py -m 
avro,build,catalyst,core,docker-integration-tests,examples,graphx,hadoop-cloud,hive,hive-thriftserver,kubernetes,kvstore,launcher,mesos,mllib,mllib-local,network-common,network-shuffle,pyspark-core,pyspark-ml,pyspark-mllib,pyspark-pandas,pyspark-pandas-slow,pyspark-resource,pyspark-sql,pyspark-streaming,repl,sketch,spark-ganglia-lgpl,sparkr,sql,sql-kafka-0-10,streaming,streaming-kafka-0-10,streaming-kinesis-asl,tags,unsafe,yarn)"

Review comment:
   Yes, it's really a missing part .
   
   BTW, `is-changed.py` is not designed to support multi-monule-support. Each 
precondition has its own `-m` option here. So, this is the best and flexible 
way for now.

##
File path: .github/workflows/build_and_test.yml
##
@@ -96,15 +96,39 @@ jobs:
   echo '::set-output name=hadoop::hadoop3'
 fi
 
+  build-precondition:
+name: Check a code change
+runs-on: ubuntu-20.04
+outputs:
+  required: ${{ steps.set-outputs.outputs.required }}
+steps:
+- name: Checkout Spark repository
+  uses: actions/checkout@v2
+  with:
+fetch-depth: 0
+repository: apache/spark
+ref: master
+- name: Sync the current branch with the latest in Apache Spark
+  if: github.repository != 'apache/spark'
+  run: |
+echo "APACHE_SPARK_REF=$(git rev-parse HEAD)" >> $GITHUB_ENV
+git fetch https://github.com/$GITHUB_REPOSITORY.git 
${GITHUB_REF#refs/heads/}
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' merge --no-commit --progress --squash 
FETCH_HEAD
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' commit -m "Merged commit"
+- name: Check all modules except 'docs'
+  id: set-outputs
+  run: |
+echo "::set-output name=required::$(./dev/is-changed.py -m 
avro,build,catalyst,core,docker-integration-tests,examples,graphx,hadoop-cloud,hive,hive-thriftserver,kubernetes,kvstore,launcher,mesos,mllib,mllib-local,network-common,network-shuffle,pyspark-core,pyspark-ml,pyspark-mllib,pyspark-pandas,pyspark-pandas-slow,pyspark-resource,pyspark-sql,pyspark-streaming,repl,sketch,spark-ganglia-lgpl,sparkr,sql,sql-kafka-0-10,streaming,streaming-kafka-0-10,streaming-kinesis-asl,tags,unsafe,yarn)"

Review comment:
   Yes, it's really a missing part .
   
   BTW, `is-changed.py` is not designed to support multi-monule-support. Each 
precondition has its own `-m` option here and the values are not the same. So, 
this is the best and flexible way for now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #35124: [WIP][SPARK-37398][PYTHON] Inline type hints for python/pyspark/ml/classification.py

2022-01-08 Thread GitBox



AmplabJenkins commented on pull request #35124:
URL: https://github.com/apache/spark/pull/35124#issuecomment-1007838795


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] aa1371 commented on pull request #35083: [WIP][SPARK-37798] PySpark Pandas API: Cross and conditional merging

2022-01-08 Thread GitBox



aa1371 commented on pull request #35083:
URL: https://github.com/apache/spark/pull/35083#issuecomment-1007846355


   @HyukjinKwon - thoughts on the PR in its current state?
   
   Also, is it possible to see the specific formatting issues caught by the 
black formatter in the CI? It seems to contain some different black 
configuration than when I run it dev/format-python locally 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #35130: [SPARK-37839][SQL] DS V2 supports partial aggregate push-down AVG

2022-01-08 Thread GitBox



cloud-fan commented on pull request #35130:
URL: https://github.com/apache/spark/pull/35130#issuecomment-1007449447


   What's the new algorithm? It's simple before: if agg funcs contain 
`GeneralAggregateFunc`, don't try partial pushdown. Othwewise, try complete 
pushdown first, then partial pushdown.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sathiyapk commented on a change in pull request #34729: [SPARK-37475][SQL] Add scale parameter to floor and ceil functions

2022-01-08 Thread GitBox



sathiyapk commented on a change in pull request #34729:
URL: https://github.com/apache/spark/pull/34729#discussion_r780514098



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala
##
@@ -658,6 +669,62 @@ class MathExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 val intResultsB: Seq[Int] = Seq(31400, 31420, 31416, 
314159000, 314159300,
   314159260) ++ Seq.fill(7)(314159265)
 
+def doubleResultsFloor(i: Int): Any = {
+  val results = Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3,
+3.1, 3.14, 3.141, 3.1415, 3.14159, 3.141592)
+  if (i <= 6) results(i).toLong else results(i)
+}
+
+def doubleResultsCeil(i: Int): Any = {
+  val results = Seq(100.0, 10.0, 1.0, 1000.0, 100.0, 10.0,
+4L, 3.2, 3.15, 3.142, 3.1416, 3.1416, 3.141593)
+  if (i <= 6) results(i).toLong else results(i)
+}
+
+def floatResultsFloor(i: Int): Any = {
+  val results = Seq(0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 3L,
+3.1d, 3.14d, 3.141d, 3.1414d, 3.14149d, 3.141499d)
+  if (i <= 6) results(i).toLong else results(i)
+}
+
+def floatResultsCeil(i: Int): Any = {
+  val results = Seq(100.0f, 10.0f, 1.0f, 1000.0f, 100.0f, 
10.0f,
+4L, 3.2d, 3.15d, 3.142d, 3.1415d, 3.1415d, 3.1415d)
+  if (i <= 6) results(i).toLong else results(i)
+}
+
+def shortResultsFloor(i: Int): Any = {
+  val results = Seq[Long](0L, 0L, 3L, 31000L,
+31400L, 31410L, 31415L) ++ Seq.fill[Long](7)(31415)
+  results(i)
+}
+
+def shortResultsCeil(i: Int): Any = {

Review comment:
   Sorry, i am not sure i got it right. 
   
   Actually we test `RoundCeil` and `RoundFloor` directly for positive `scale` 
on big decimal, i can add long and double in that case, if you prefer. But we 
can't test `RoundCeil` and `RoundFloor` directly for negative `scale` because 
the casting to `Long` for negative scale happens one step before. 
   
   May be if put the casting in the `RoundCeil` and `RoundFloor` objects and 
let the `ExpressionBuilder` call these objects like it was before, we can able 
to test `RoundCeil` and `RoundFloor` directly for both negative and positive 
`scale`. What do you say ?

##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala
##
@@ -658,6 +669,62 @@ class MathExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 val intResultsB: Seq[Int] = Seq(31400, 31420, 31416, 
314159000, 314159300,
   314159260) ++ Seq.fill(7)(314159265)
 
+def doubleResultsFloor(i: Int): Any = {
+  val results = Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3,
+3.1, 3.14, 3.141, 3.1415, 3.14159, 3.141592)
+  if (i <= 6) results(i).toLong else results(i)
+}
+
+def doubleResultsCeil(i: Int): Any = {
+  val results = Seq(100.0, 10.0, 1.0, 1000.0, 100.0, 10.0,
+4L, 3.2, 3.15, 3.142, 3.1416, 3.1416, 3.141593)
+  if (i <= 6) results(i).toLong else results(i)
+}
+
+def floatResultsFloor(i: Int): Any = {
+  val results = Seq(0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 3L,
+3.1d, 3.14d, 3.141d, 3.1414d, 3.14149d, 3.141499d)
+  if (i <= 6) results(i).toLong else results(i)
+}
+
+def floatResultsCeil(i: Int): Any = {
+  val results = Seq(100.0f, 10.0f, 1.0f, 1000.0f, 100.0f, 
10.0f,
+4L, 3.2d, 3.15d, 3.142d, 3.1415d, 3.1415d, 3.1415d)
+  if (i <= 6) results(i).toLong else results(i)
+}
+
+def shortResultsFloor(i: Int): Any = {
+  val results = Seq[Long](0L, 0L, 3L, 31000L,
+31400L, 31410L, 31415L) ++ Seq.fill[Long](7)(31415)
+  results(i)
+}
+
+def shortResultsCeil(i: Int): Any = {

Review comment:
   Sorry, i am not sure i got it right. 
   
   Actually we test `RoundCeil` and `RoundFloor` directly for positive `scale` 
on big decimal, i can add long and double in that case, if you prefer. But we 
can't test `RoundCeil` and `RoundFloor` directly for negative `scale` because 
the casting to `Long` for negative scale happens one step before. 
   
   May be if i put the casting in the `RoundCeil` and `RoundFloor` objects and 
let the `ExpressionBuilder` call these objects like it was before, we can able 
to test `RoundCeil` and `RoundFloor` directly for both negative and positive 
`scale`. What do you say ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:

[GitHub] [spark] tgravescs commented on pull request #34982: [SPARK-37712][YARN] Spark request yarn cluster metrics slow cause delay

2022-01-08 Thread GitBox



tgravescs commented on pull request #34982:
URL: https://github.com/apache/spark/pull/34982#issuecomment-1007504670


   I guess I'm fine with this, I kind of assume this is because your yarn 
resource manager is overloaded or unresponsive though again?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tgravescs commented on a change in pull request #23348: [SPARK-25857][core] Add developer documentation regarding delegation tokens.

2022-01-08 Thread GitBox



tgravescs commented on a change in pull request #23348:
URL: https://github.com/apache/spark/pull/23348#discussion_r780320642



##
File path: core/src/main/scala/org/apache/spark/deploy/security/README.md
##
@@ -0,0 +1,249 @@
+# Delegation Token Handling In Spark
+
+This document aims to explain and demystify delegation tokens as they are used 
by Spark, since
+this topic is generally a huge source of confusion.
+
+
+## What are delegation tokens and why use them?
+
+Delegation tokens (DTs from now on) are authentication tokens used by some 
services to replace
+Kerberos service tokens. Many services in the Hadoop ecosystem have support 
for DTs, since they
+have some very desirable advantages over Kerberos tokens:
+
+* No need to distribute Kerberos credentials
+
+In a distributed application, distributing Kerberos credentials is tricky. Not 
all users have
+keytabs, and when they do, it's generally frowned upon to distribute them over 
the network as
+part of application data.
+
+DTs allow for a single place (e.g. the Spark driver) to require Kerberos 
credentials. That entity
+can then distribute the DTs to other parts of the distributed application 
(e.g. Spark executors),
+so they can authenticate to services.
+
+* A single token per service is used for authentication
+
+If Kerberos authentication were used, each client connection to a server would 
require a trip
+to the KDC and generation of a service ticket. In a distributed system, the 
number of service
+tickets can balloon pretty quickly when you think about the number of client 
processes (e.g. Spark
+executors) vs. the number of service processes (e.g. HDFS DataNodes). That 
generates unnecessary
+extra load on the KDC, and may even run into usage limits set up by the KDC 
admin.
+
+* DTs are only used for authentication
+
+DTs, unlike TGTs, can only be used to authenticate to the specific service for 
which they were
+issued. You cannot use an existing DT to create new DTs or to create DTs for a 
different service.
+
+So in short, DTs are *not* Kerberos tokens. They are used by many services to 
replace Kerberos
+authentication, or even other forms of authentication, although there is 
nothing (aside from
+maybe implementation details) that ties them to Kerberos or any other 
authentication mechanism.
+
+
+## Lifecycle of DTs
+
+DTs, unlike Kerberos tokens, are service-specific. There is no centralized 
location you contact
+to create a DT for a service. So, the first step needed to get a DT is being 
able to authenticate
+to the service in question. In the Hadoop ecosystem, that is generally done 
using Kerberos.
+
+This requires Kerberos credentials to be available somewhere for the 
application to use. The user
+is generally responsible for providing those credentials, which is most 
commonly done by logging
+in to the KDC (e.g. using "kinit"). That generates a (Kerberos) "token cache" 
containing a TGT
+(ticket granting ticket), which can then be used to request service tickets.
+
+There are other ways of obtaining TGTs, but, ultimately, you need a TGT to 
bootstrap the process.
+
+Once a TGT is available, the target service's client library can then be used 
to authenticate
+to the service using the Kerberos credentials, and request the creation of a 
delegation token.
+This token can now be sent to other processes and used to authenticate to 
different daemons
+belonging to that service.
+
+And thus the first drawback of DTs becomes apparent: you need service-specific 
logic to create and
+use them.
+
+Spark implements a (somewhat) pluggable, internal DT creation API. Support for 
new services can be
+added by implementing a `HadoopDelegationTokenProvider` that is then called by 
Spark when generating
+delegation tokens for an application. Spark makes the DTs available to code by 
stashing them in the
+`UserGroupInformation` credentials, and it's up to the DT provider and the 
respective client library
+to agree on how to use those tokens.
+
+Once they are created, the semantics of how DTs operate are also 
service-specific. But, in general,
+they try to follow the semantics of Kerberos tokens:
+
+* A "renewable period (equivalent to TGT's "lifetime") which is for how long 
the DT is valid
+  before it requires renewal.
+* A "max lifetime" (equivalent to TGT's "renewable life") which is for how 
long the DT can be
+  renewed.
+
+Once the token reaches its "max lifetime", a new one needs to be created by 
contacting the
+appropriate service, restarting the above process.
+
+
+## DT Renewal, Renewers, and YARN
+
+This is the most confusing part of DT handling, and part of it is because much 
of the system was
+designed with MapReduce, and later YARN, in mind.
+
+As seen above, DTs need to be renewed periodically until they finally expire 
for good. An example of
+this is the default configuration of HDFS services: delegation tokens are 
valid for up to 7 days,
+and need to be renewed every 24 hours. If 24 hours pass

[GitHub] [spark] wzhfy removed a comment on pull request #35137: [TEST] Fix test logic by getting sleep interval in task

2022-01-08 Thread GitBox



wzhfy removed a comment on pull request #35137:
URL: https://github.com/apache/spark/pull/35137#issuecomment-1007875631


   cc @Ngone51 @agrawaldevesh


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #34439: [SPARK-37095][PYTHON] Inline type hints for files in python/pyspark/broadcast.py

2022-01-08 Thread GitBox



zero323 commented on pull request #34439:
URL: https://github.com/apache/spark/pull/34439#issuecomment-1007619850


   Could you please resolve the conflicts @dchvn?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #35126: [SPARK-37836][PYTHON][INFRA] Enable F841, E722, E305 and E226 for PEP 8 compliance

2022-01-08 Thread GitBox



HyukjinKwon closed pull request #35126:
URL: https://github.com/apache/spark/pull/35126


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #32602: [SPARK-35455][SQL] Unify empty relation optimization between normal and AQE optimizer

2022-01-08 Thread GitBox



cloud-fan commented on a change in pull request #32602:
URL: https://github.com/apache/spark/pull/32602#discussion_r780377367



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala
##
@@ -137,3 +125,55 @@ object PropagateEmptyRelation extends Rule[LogicalPlan] 
with PredicateHelper wit
 }
   }
 }
+
+/**
+ * This rule runs in the normal optimizer and optimizes more cases
+ * compared to [[PropagateEmptyRelationBase]]:
+ * 1. Higher-node Logical Plans
+ *- Union with all empty children.
+ * 2. Unary-node Logical Plans
+ *- Project/Filter/Sample with all empty children.
+ *
+ * The reason why we don't apply this rule at AQE optimizer side is: the 
benefit is not big enough
+ * and it may introduce extra exchanges.

Review comment:
   After more thought, I think this is a big performance issue if we can't 
propagate empty relations through project/filter which are quite common. The 
risk of introducing new shuffles is relatively small compared to this.
   
   @ulysses-you can we move all the logic to `PropagateEmptyRelationBase`? 
`PropagateEmptyRelation` should not have any extra logic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on pull request #35132: [SPARK-37841][SQL] BasicWriteTaskStatsTracker should not try get status for a skipped file

2022-01-08 Thread GitBox



yaooqinn commented on pull request #35132:
URL: https://github.com/apache/spark/pull/35132#issuecomment-1007269119


   cc @dongjoon-hyun @cloud-fan thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #34141: [SPARK-33887][SQL] Allow insert overwrite same table with static partition if using dynamic partition overwrite mode

2022-01-08 Thread GitBox



github-actions[bot] commented on pull request #34141:
URL: https://github.com/apache/spark/pull/34141#issuecomment-1007834075


   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #33878: [SPARK-36303][SQL] Refactor fourthteenth set of 20 query execution errors to use error classes

2022-01-08 Thread GitBox



github-actions[bot] closed pull request #33878:
URL: https://github.com/apache/spark/pull/33878


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #35121: [SPARK-37833][INFRA] Add `precondition` job to skip the main GitHub Action jobs

2022-01-08 Thread GitBox



dongjoon-hyun edited a comment on pull request #35121:
URL: https://github.com/apache/spark/pull/35121#issuecomment-1007238145


   Thank you. I addressed your comments. Could you review this once more, 
@HyukjinKwon ?
   - `precondition` job runs all precondition checks and build `json` file.
   - Each jobs `needs` only `precondition`'s result.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you commented on pull request #34933: [SPARK-37674][SQL] Reduce the output partition of output stage to avoid producing small files.

2022-01-08 Thread GitBox



ulysses-you commented on pull request #34933:
URL: https://github.com/apache/spark/pull/34933#issuecomment-1007224052


   I see the requirement but there are some potential issue if we only use a 
new config for writing's final stage.
   
   - if the final stage is heavy it will cause regression if we make partition 
size big, e.g. the final stage is join even multi-join
   - the input shuffle size is not equal to the output size. if the plan of 
final stage changes the data size, this config is less meaning
   - not all query contains shuffle, then the semantics of this config is 
broken since the config is not used
   - it's not enough for dynamic partition writing that just increase the 
partition size. we should cluster the same partition value in several 
partitions as far as possible
   - and this config should also affect the rebalance
   
   I think it's a good idea to add a `RebalancePartitions` node for all writing 
command as  @wangyum working on SPARK-31264. And then we can consider adding a 
special partition size config for the added shuffle which is from  
`RebalancePartitions`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35126: [SPARK-37836][PYTHON][INFRA] Enable F841, E722, E305 and E226 for PEP 8 compliance

2022-01-08 Thread GitBox



HyukjinKwon commented on a change in pull request #35126:
URL: https://github.com/apache/spark/pull/35126#discussion_r780048734



##
File path: python/pyspark/tests/test_readwrite.py
##
@@ -231,9 +231,10 @@ def test_newhadoop(self):
 )
 self.assertEqual(result, data)
 
-fmt = "org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat"

Review comment:
   this is technically a followup of 
https://github.com/apache/spark/pull/35126 but just decided to piggy-back here 
since im touching arbitrary codes here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #35124: [WIP][SPARK-37398][PYTHON] Inline type hints for python/pyspark/ml/classification.py

2022-01-08 Thread GitBox



zero323 commented on pull request #35124:
URL: https://github.com/apache/spark/pull/35124#issuecomment-1007376991


   Thanks @javierivanov. Let's revisit this after all prerequisites are done 
(core for starters, but there are also internal dependencies that have to be 
followed).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 138 matches

Mail list logo