[GitHub] [spark] venkata91 commented on a change in pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle

2021-07-26 Thread GitBox


venkata91 commented on a change in pull request #33034:
URL: https://github.com/apache/spark/pull/33034#discussion_r677140773



##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##
@@ -128,49 +151,100 @@ protected AppShuffleInfo 
validateAndGetAppShuffleInfo(String appId) {
   }
 
   /**
-   * Given the appShuffleInfo, shuffleId and reduceId that uniquely identifies 
a given shuffle
-   * partition of an application, retrieves the associated metadata. If not 
present and the
-   * corresponding merged shuffle does not exist, initializes the metadata.
+   * Given the appShuffleInfo, shuffleId, shuffleMergeId and reduceId that 
uniquely identifies
+   * a given shuffle partition of an application, retrieves the associated 
metadata. If not
+   * present and the corresponding merged shuffle does not exist, initializes 
the metadata.
*/
   private AppShufflePartitionInfo getOrCreateAppShufflePartitionInfo(
   AppShuffleInfo appShuffleInfo,
   int shuffleId,
-  int reduceId) {
-File dataFile = appShuffleInfo.getMergedShuffleDataFile(shuffleId, 
reduceId);
-ConcurrentMap> partitions =
+  int shuffleMergeId,
+  int reduceId) throws RuntimeException {
+ConcurrentMap>> partitions =
   appShuffleInfo.partitions;
-Map shufflePartitions =
-  partitions.compute(shuffleId, (id, map) -> {
-if (map == null) {
-  // If this partition is already finalized then the partitions map 
will not contain the
-  // shuffleId but the data file would exist. In that case the block 
is considered late.
-  if (dataFile.exists()) {
-return null;
-  }
-  return new ConcurrentHashMap<>();
+AtomicReference>> 
shuffleMergePartitionsRef
+  = new AtomicReference<>(null);
+partitions.compute(shuffleId, (id, shuffleMergePartitionsMap) -> {
+  if (shuffleMergePartitionsMap == null) {
+logger.info("Creating a new attempt for shuffle blocks push request 
for"
+  + " shuffle {} with shuffleMergeId {} for application {}_{}", 
shuffleId,
+shuffleMergeId, appShuffleInfo.appId, appShuffleInfo.attemptId);
+Map> 
newShuffleMergePartitions
+  = new ConcurrentHashMap<>();
+Map newPartitionsMap = new 
ConcurrentHashMap<>();
+newShuffleMergePartitions.put(shuffleMergeId, newPartitionsMap);
+shuffleMergePartitionsRef.set(newShuffleMergePartitions);
+return newShuffleMergePartitions;
+  } else if (shuffleMergePartitionsMap.containsKey(shuffleMergeId)) {
+shuffleMergePartitionsRef.set(shuffleMergePartitionsMap);
+return shuffleMergePartitionsMap;
+  } else {
+int latestShuffleMergeId = shuffleMergePartitionsMap.keySet().stream()
+  .mapToInt(x -> x).max().orElse(UNDEFINED_SHUFFLE_MERGE_ID);
+int secondLatestShuffleMergeId = 
shuffleMergePartitionsMap.keySet().stream()
+  .mapToInt(x -> x).filter(x -> x != latestShuffleMergeId)
+  .max().orElse(UNDEFINED_SHUFFLE_MERGE_ID);
+if (latestShuffleMergeId > shuffleMergeId) {
+  // Reject the request as we have already seen a higher 
shuffleMergeId than the
+  // current incoming one
+  throw new RuntimeException(String.format("Rejecting shuffle blocks 
push request for"
++ " shuffle %s with shuffleMergeId %s for application %s_%s as a 
higher"
+  + " shuffleMergeId %s request is already seen", shuffleId, 
shuffleMergeId,
+  appShuffleInfo.appId, appShuffleInfo.attemptId, 
latestShuffleMergeId));
 } else {
-  return map;
+  // Higher shuffleMergeId seen for the shuffle ID meaning new stage 
attempt is being
+  // run for the shuffle ID. Close and clean up old shuffleMergeId 
files,
+  // happens in the non-deterministic stage retries
+  logger.info("Creating a new attempt for shuffle blocks push request 
for shuffle {} with"
++ " shuffleMergeId {} for application {}_{} since it is higher 
than the latest"
+  + " shuffleMergeId {} already seen", shuffleId, shuffleMergeId, 
appShuffleInfo.appId,
+  appShuffleInfo.attemptId, latestShuffleMergeId);
+  if (latestShuffleMergeId != UNDEFINED_SHUFFLE_MERGE_ID &&
+  shuffleMergePartitionsMap.containsKey(latestShuffleMergeId)) {
+Map latestShufflePartitions =
+  shuffleMergePartitionsMap.get(latestShuffleMergeId);
+mergedShuffleCleaner.execute(() ->
+  closeAndDeletePartitionFiles(latestShufflePartitions));
+shuffleMergePartitionsMap.put(latestShuffleMergeId, 
STALE_SHUFFLE_PARTITIONS);
+  }
+  // Remove older shuffleMergeIds which won't be required anymore
+  if (secondLatestShuffleMergeId != UNDEFINED_SHUFFLE_MERGE_ID) {
+

[GitHub] [spark] cloud-fan commented on pull request #33468: [SPARK-36247][SQL] Check string length for char/varchar and apply type coercion in UPDATE/MERGE command

2021-07-26 Thread GitBox


cloud-fan commented on pull request #33468:
URL: https://github.com/apache/spark/pull/33468#issuecomment-887232141


   thanks for the review, merging to master/3.2!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #33468: [SPARK-36247][SQL] Check string length for char/varchar and apply type coercion in UPDATE/MERGE command

2021-07-26 Thread GitBox


cloud-fan closed pull request #33468:
URL: https://github.com/apache/spark/pull/33468


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle

2021-07-26 Thread GitBox


Ngone51 commented on a change in pull request #33034:
URL: https://github.com/apache/spark/pull/33034#discussion_r677134315



##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##
@@ -78,6 +80,27 @@
   public static final String MERGE_DIR_KEY = "mergeDir";
   public static final String ATTEMPT_ID_KEY = "attemptId";
   private static final int UNDEFINED_ATTEMPT_ID = -1;
+  private static final int UNDEFINED_SHUFFLE_MERGE_ID = Integer.MIN_VALUE;
+
+  // ConcurrentHashMap doesn't allow null for keys or values which is why this 
is required.

Review comment:
   The comment is outdated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source

2021-07-26 Thread GitBox


SparkQA commented on pull request #33413:
URL: https://github.com/apache/spark/pull/33413#issuecomment-887225073


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46196/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source

2021-07-26 Thread GitBox


AmplabJenkins commented on pull request #33413:
URL: https://github.com/apache/spark/pull/33413#issuecomment-887225094


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46196/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle

2021-07-26 Thread GitBox


AmplabJenkins removed a comment on pull request #33034:
URL: https://github.com/apache/spark/pull/33034#issuecomment-887224594


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46197/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle

2021-07-26 Thread GitBox


SparkQA commented on pull request #33034:
URL: https://github.com/apache/spark/pull/33034#issuecomment-887224576


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46197/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle

2021-07-26 Thread GitBox


AmplabJenkins commented on pull request #33034:
URL: https://github.com/apache/spark/pull/33034#issuecomment-887224594


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46197/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33533: Revert "[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package

2021-07-26 Thread GitBox


SparkQA commented on pull request #33533:
URL: https://github.com/apache/spark/pull/33533#issuecomment-887222147


   **[Test build #141686 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141686/testReport)**
 for PR 33533 at commit 
[`a733179`](https://github.com/apache/spark/commit/a73317946a5fe08edfdd516f6b9a8f7e0ad4f079).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package

2021-07-26 Thread GitBox


viirya commented on pull request #33350:
URL: https://github.com/apache/spark/pull/33350#issuecomment-887221527


   Created #33533 to revert it first. @cloud-fan @sunchao @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya opened a new pull request #33533: Revert "[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package

2021-07-26 Thread GitBox


viirya opened a new pull request #33533:
URL: https://github.com/apache/spark/pull/33533


   This reverts commit 634f96dde40639df5a2ef246884bedbd48b3dc69.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


AmplabJenkins removed a comment on pull request #31517:
URL: https://github.com/apache/spark/pull/31517#issuecomment-887220339


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141679/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


AmplabJenkins commented on pull request #31517:
URL: https://github.com/apache/spark/pull/31517#issuecomment-887220339


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141679/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


SparkQA removed a comment on pull request #31517:
URL: https://github.com/apache/spark/pull/31517#issuecomment-887180729


   **[Test build #141679 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141679/testReport)**
 for PR 31517 at commit 
[`7b360a7`](https://github.com/apache/spark/commit/7b360a7d577ea379db3487d451b0c7a744d1dc02).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


SparkQA commented on pull request #31517:
URL: https://github.com/apache/spark/pull/31517#issuecomment-887219770


   **[Test build #141679 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141679/testReport)**
 for PR 31517 at commit 
[`7b360a7`](https://github.com/apache/spark/commit/7b360a7d577ea379db3487d451b0c7a744d1dc02).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


AmplabJenkins removed a comment on pull request #31517:
URL: https://github.com/apache/spark/pull/31517#issuecomment-887218994


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46193/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


AmplabJenkins commented on pull request #31517:
URL: https://github.com/apache/spark/pull/31517#issuecomment-887218994


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46193/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


SparkQA commented on pull request #31517:
URL: https://github.com/apache/spark/pull/31517#issuecomment-887218975


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46193/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33532: [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark/SparkR/Docker GHA job

2021-07-26 Thread GitBox


SparkQA commented on pull request #33532:
URL: https://github.com/apache/spark/pull/33532#issuecomment-887218117


   **[Test build #141684 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141684/testReport)**
 for PR 33532 at commit 
[`d83429e`](https://github.com/apache/spark/commit/d83429e2721f3c1b01131edf34bf839cc364eb67).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33531: [SPARK-36312][SQL] parquetWriterSupport.setSchema should check inner field

2021-07-26 Thread GitBox


SparkQA commented on pull request #33531:
URL: https://github.com/apache/spark/pull/33531#issuecomment-887218140


   **[Test build #141685 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141685/testReport)**
 for PR 33531 at commit 
[`eed72e1`](https://github.com/apache/spark/commit/eed72e1160a5d3a061e9c977230149453cee56d4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

2021-07-26 Thread GitBox


AmplabJenkins removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887217818


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141683/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source

2021-07-26 Thread GitBox


AmplabJenkins removed a comment on pull request #33413:
URL: https://github.com/apache/spark/pull/33413#issuecomment-887217814


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141681/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up

2021-07-26 Thread GitBox


AmplabJenkins removed a comment on pull request #33526:
URL: https://github.com/apache/spark/pull/33526#issuecomment-887217812


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141675/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33441: [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too

2021-07-26 Thread GitBox


AmplabJenkins removed a comment on pull request #33441:
URL: https://github.com/apache/spark/pull/33441#issuecomment-887217813


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46194/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

2021-07-26 Thread GitBox


AmplabJenkins commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887217818


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141683/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up

2021-07-26 Thread GitBox


AmplabJenkins commented on pull request #33526:
URL: https://github.com/apache/spark/pull/33526#issuecomment-887217812


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141675/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source

2021-07-26 Thread GitBox


AmplabJenkins commented on pull request #33413:
URL: https://github.com/apache/spark/pull/33413#issuecomment-887217814


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141681/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33441: [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too

2021-07-26 Thread GitBox


AmplabJenkins commented on pull request #33441:
URL: https://github.com/apache/spark/pull/33441#issuecomment-887217813


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46194/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle

2021-07-26 Thread GitBox


Ngone51 commented on a change in pull request #33034:
URL: https://github.com/apache/spark/pull/33034#discussion_r677125870



##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##
@@ -128,49 +151,100 @@ protected AppShuffleInfo 
validateAndGetAppShuffleInfo(String appId) {
   }
 
   /**
-   * Given the appShuffleInfo, shuffleId and reduceId that uniquely identifies 
a given shuffle
-   * partition of an application, retrieves the associated metadata. If not 
present and the
-   * corresponding merged shuffle does not exist, initializes the metadata.
+   * Given the appShuffleInfo, shuffleId, shuffleMergeId and reduceId that 
uniquely identifies
+   * a given shuffle partition of an application, retrieves the associated 
metadata. If not
+   * present and the corresponding merged shuffle does not exist, initializes 
the metadata.
*/
   private AppShufflePartitionInfo getOrCreateAppShufflePartitionInfo(
   AppShuffleInfo appShuffleInfo,
   int shuffleId,
-  int reduceId) {
-File dataFile = appShuffleInfo.getMergedShuffleDataFile(shuffleId, 
reduceId);
-ConcurrentMap> partitions =
+  int shuffleMergeId,
+  int reduceId) throws RuntimeException {
+ConcurrentMap>> partitions =
   appShuffleInfo.partitions;
-Map shufflePartitions =
-  partitions.compute(shuffleId, (id, map) -> {
-if (map == null) {
-  // If this partition is already finalized then the partitions map 
will not contain the
-  // shuffleId but the data file would exist. In that case the block 
is considered late.
-  if (dataFile.exists()) {
-return null;
-  }
-  return new ConcurrentHashMap<>();
+AtomicReference>> 
shuffleMergePartitionsRef
+  = new AtomicReference<>(null);
+partitions.compute(shuffleId, (id, shuffleMergePartitionsMap) -> {
+  if (shuffleMergePartitionsMap == null) {
+logger.info("Creating a new attempt for shuffle blocks push request 
for"
+  + " shuffle {} with shuffleMergeId {} for application {}_{}", 
shuffleId,
+shuffleMergeId, appShuffleInfo.appId, appShuffleInfo.attemptId);
+Map> 
newShuffleMergePartitions
+  = new ConcurrentHashMap<>();
+Map newPartitionsMap = new 
ConcurrentHashMap<>();
+newShuffleMergePartitions.put(shuffleMergeId, newPartitionsMap);
+shuffleMergePartitionsRef.set(newShuffleMergePartitions);
+return newShuffleMergePartitions;
+  } else if (shuffleMergePartitionsMap.containsKey(shuffleMergeId)) {
+shuffleMergePartitionsRef.set(shuffleMergePartitionsMap);
+return shuffleMergePartitionsMap;
+  } else {
+int latestShuffleMergeId = shuffleMergePartitionsMap.keySet().stream()
+  .mapToInt(x -> x).max().orElse(UNDEFINED_SHUFFLE_MERGE_ID);
+int secondLatestShuffleMergeId = 
shuffleMergePartitionsMap.keySet().stream()
+  .mapToInt(x -> x).filter(x -> x != latestShuffleMergeId)
+  .max().orElse(UNDEFINED_SHUFFLE_MERGE_ID);
+if (latestShuffleMergeId > shuffleMergeId) {
+  // Reject the request as we have already seen a higher 
shuffleMergeId than the
+  // current incoming one
+  throw new RuntimeException(String.format("Rejecting shuffle blocks 
push request for"
++ " shuffle %s with shuffleMergeId %s for application %s_%s as a 
higher"
+  + " shuffleMergeId %s request is already seen", shuffleId, 
shuffleMergeId,
+  appShuffleInfo.appId, appShuffleInfo.attemptId, 
latestShuffleMergeId));
 } else {
-  return map;
+  // Higher shuffleMergeId seen for the shuffle ID meaning new stage 
attempt is being
+  // run for the shuffle ID. Close and clean up old shuffleMergeId 
files,
+  // happens in the non-deterministic stage retries
+  logger.info("Creating a new attempt for shuffle blocks push request 
for shuffle {} with"
++ " shuffleMergeId {} for application {}_{} since it is higher 
than the latest"
+  + " shuffleMergeId {} already seen", shuffleId, shuffleMergeId, 
appShuffleInfo.appId,
+  appShuffleInfo.attemptId, latestShuffleMergeId);
+  if (latestShuffleMergeId != UNDEFINED_SHUFFLE_MERGE_ID &&
+  shuffleMergePartitionsMap.containsKey(latestShuffleMergeId)) {
+Map latestShufflePartitions =
+  shuffleMergePartitionsMap.get(latestShuffleMergeId);
+mergedShuffleCleaner.execute(() ->
+  closeAndDeletePartitionFiles(latestShufflePartitions));
+shuffleMergePartitionsMap.put(latestShuffleMergeId, 
STALE_SHUFFLE_PARTITIONS);
+  }
+  // Remove older shuffleMergeIds which won't be required anymore
+  if (secondLatestShuffleMergeId != UNDEFINED_SHUFFLE_MERGE_ID) {
+

[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

2021-07-26 Thread GitBox


SparkQA removed a comment on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887202286


   **[Test build #141683 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141683/testReport)**
 for PR 32332 at commit 
[`8ddf243`](https://github.com/apache/spark/commit/8ddf24309623be15ae6e4a5a9ea52dcfe8069792).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33441: [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too

2021-07-26 Thread GitBox


SparkQA commented on pull request #33441:
URL: https://github.com/apache/spark/pull/33441#issuecomment-887214851


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46194/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33531: [SPARK-36312][SQL] parquetWriterSupport.setSchema should check inner field

2021-07-26 Thread GitBox


SparkQA commented on pull request #33531:
URL: https://github.com/apache/spark/pull/33531#issuecomment-887212972


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46195/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package

2021-07-26 Thread GitBox


viirya commented on pull request #33350:
URL: https://github.com/apache/spark/pull/33350#issuecomment-887212407


   Is it flaky or do other merged PRs cause the result different?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source

2021-07-26 Thread GitBox


SparkQA commented on pull request #33413:
URL: https://github.com/apache/spark/pull/33413#issuecomment-887211886


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46196/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

2021-07-26 Thread GitBox


SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887211807


   **[Test build #141683 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141683/testReport)**
 for PR 32332 at commit 
[`8ddf243`](https://github.com/apache/spark/commit/8ddf24309623be15ae6e4a5a9ea52dcfe8069792).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package

2021-07-26 Thread GitBox


viirya commented on pull request #33350:
URL: https://github.com/apache/spark/pull/33350#issuecomment-887211632


   We can revert it if it needs taking some time to investigate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle

2021-07-26 Thread GitBox


SparkQA commented on pull request #33034:
URL: https://github.com/apache/spark/pull/33034#issuecomment-887211478


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46197/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33531: [SPARK-36312][SQL] parquetWriterSupport.setSchema should check inner field

2021-07-26 Thread GitBox


AngersZh commented on a change in pull request #33531:
URL: https://github.com/apache/spark/pull/33531#discussion_r677120085



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
##
@@ -3028,6 +3028,24 @@ class HiveDDLSuite
 }
   }
 
+  test("SPARK-36312: ParquetWriteSupport should check inner field") {
+withView("v") {
+  spark.range(1).createTempView("v")
+  withTempPath { path =>
+val e = intercept[AnalysisException] {
+  spark.sql(
+"""
+  | SELECT

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #33532: [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark GHA job

2021-07-26 Thread GitBox


dongjoon-hyun commented on a change in pull request #33532:
URL: https://github.com/apache/spark/pull/33532#discussion_r677119887



##
File path: .github/workflows/build_and_test.yml
##
@@ -206,6 +206,7 @@ jobs:
   GITHUB_PREV_SHA: ${{ github.event.before }}
   SPARK_LOCAL_IP: localhost
   SKIP_UNIDOC: true
+  SKIP_MIMA: true

Review comment:
   +1 for the idea.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext

2021-07-26 Thread GitBox


mridulm commented on a change in pull request #33385:
URL: https://github.com/apache/spark/pull/33385#discussion_r677117650



##
File path: core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
##
@@ -57,6 +57,7 @@ private[spark] class TaskDescription(
 val addedJars: Map[String, Long],
 val addedArchives: Map[String, Long],
 val properties: Properties,
+val cpus: Int,

Review comment:
   @xwu99 Please add a check to make sure we dont have invalid value for 
`cpus` coming in - `assert(cpus > 0)`.
   There should be some validation either here, or where it is getting passed 
in from.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext

2021-07-26 Thread GitBox


mridulm commented on a change in pull request #33385:
URL: https://github.com/apache/spark/pull/33385#discussion_r677116925



##
File path: core/src/main/scala/org/apache/spark/TaskContextImpl.scala
##
@@ -54,6 +54,7 @@ private[spark] class TaskContextImpl(
 @transient private val metricsSystem: MetricsSystem,
 // The default value is only used in tests.
 override val taskMetrics: TaskMetrics = TaskMetrics.empty,
+override val cpus: Int = 0,

Review comment:
   @tgravescs For default value, CPUS_PER_TASK is simply "spark.task.cpus" 
with default 1 ... which should be available from `SparkEnv.conf` ?
   `SparkEnv.conf.get(config.CPUS_PER_TASK)` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext

2021-07-26 Thread GitBox


mridulm commented on a change in pull request #33385:
URL: https://github.com/apache/spark/pull/33385#discussion_r677116925



##
File path: core/src/main/scala/org/apache/spark/TaskContextImpl.scala
##
@@ -54,6 +54,7 @@ private[spark] class TaskContextImpl(
 @transient private val metricsSystem: MetricsSystem,
 // The default value is only used in tests.
 override val taskMetrics: TaskMetrics = TaskMetrics.empty,
+override val cpus: Int = 0,

Review comment:
   @tgravescs CPUS_PER_TASK is simply "spark.task.cpus" with default 1 ... 
which should be available from `SparkEnv.conf` ?
   `SparkEnv.conf.get(config.CPUS_PER_TASK)` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #33531: [SPARK-36312][SQL] parquetWriterSupport.setSchema should check inner field

2021-07-26 Thread GitBox


dongjoon-hyun commented on a change in pull request #33531:
URL: https://github.com/apache/spark/pull/33531#discussion_r677119186



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
##
@@ -3028,6 +3028,24 @@ class HiveDDLSuite
 }
   }
 
+  test("SPARK-36312: ParquetWriteSupport should check inner field") {
+withView("v") {
+  spark.range(1).createTempView("v")
+  withTempPath { path =>
+val e = intercept[AnalysisException] {
+  spark.sql(
+"""
+  | SELECT

Review comment:
   nit. Could you remove the redundant space here, @AngersZh ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #33532: [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark GHA job

2021-07-26 Thread GitBox


HyukjinKwon commented on a change in pull request #33532:
URL: https://github.com/apache/spark/pull/33532#discussion_r677118693



##
File path: .github/workflows/build_and_test.yml
##
@@ -206,6 +206,7 @@ jobs:
   GITHUB_PREV_SHA: ${{ github.event.before }}
   SPARK_LOCAL_IP: localhost
   SKIP_UNIDOC: true
+  SKIP_MIMA: true

Review comment:
   We might have to skip here too: 
https://github.com/apache/spark/blob/9f20c62ff0c94a790bfd1fd925fd62fcf1ecc05a/.github/workflows/build_and_test.yml#L721




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle

2021-07-26 Thread GitBox


Ngone51 commented on a change in pull request #33034:
URL: https://github.com/apache/spark/pull/33034#discussion_r677118599



##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##
@@ -128,49 +151,100 @@ protected AppShuffleInfo 
validateAndGetAppShuffleInfo(String appId) {
   }
 
   /**
-   * Given the appShuffleInfo, shuffleId and reduceId that uniquely identifies 
a given shuffle
-   * partition of an application, retrieves the associated metadata. If not 
present and the
-   * corresponding merged shuffle does not exist, initializes the metadata.
+   * Given the appShuffleInfo, shuffleId, shuffleMergeId and reduceId that 
uniquely identifies
+   * a given shuffle partition of an application, retrieves the associated 
metadata. If not
+   * present and the corresponding merged shuffle does not exist, initializes 
the metadata.
*/
   private AppShufflePartitionInfo getOrCreateAppShufflePartitionInfo(
   AppShuffleInfo appShuffleInfo,
   int shuffleId,
-  int reduceId) {
-File dataFile = appShuffleInfo.getMergedShuffleDataFile(shuffleId, 
reduceId);
-ConcurrentMap> partitions =
+  int shuffleMergeId,
+  int reduceId) throws RuntimeException {
+ConcurrentMap>> partitions =
   appShuffleInfo.partitions;
-Map shufflePartitions =
-  partitions.compute(shuffleId, (id, map) -> {
-if (map == null) {
-  // If this partition is already finalized then the partitions map 
will not contain the
-  // shuffleId but the data file would exist. In that case the block 
is considered late.
-  if (dataFile.exists()) {
-return null;
-  }
-  return new ConcurrentHashMap<>();
+AtomicReference>> 
shuffleMergePartitionsRef
+  = new AtomicReference<>(null);
+partitions.compute(shuffleId, (id, shuffleMergePartitionsMap) -> {
+  if (shuffleMergePartitionsMap == null) {
+logger.info("Creating a new attempt for shuffle blocks push request 
for"
+  + " shuffle {} with shuffleMergeId {} for application {}_{}", 
shuffleId,
+shuffleMergeId, appShuffleInfo.appId, appShuffleInfo.attemptId);
+Map> 
newShuffleMergePartitions
+  = new ConcurrentHashMap<>();
+Map newPartitionsMap = new 
ConcurrentHashMap<>();
+newShuffleMergePartitions.put(shuffleMergeId, newPartitionsMap);
+shuffleMergePartitionsRef.set(newShuffleMergePartitions);
+return newShuffleMergePartitions;
+  } else if (shuffleMergePartitionsMap.containsKey(shuffleMergeId)) {
+shuffleMergePartitionsRef.set(shuffleMergePartitionsMap);
+return shuffleMergePartitionsMap;
+  } else {
+int latestShuffleMergeId = shuffleMergePartitionsMap.keySet().stream()
+  .mapToInt(x -> x).max().orElse(UNDEFINED_SHUFFLE_MERGE_ID);
+int secondLatestShuffleMergeId = 
shuffleMergePartitionsMap.keySet().stream()
+  .mapToInt(x -> x).filter(x -> x != latestShuffleMergeId)
+  .max().orElse(UNDEFINED_SHUFFLE_MERGE_ID);
+if (latestShuffleMergeId > shuffleMergeId) {
+  // Reject the request as we have already seen a higher 
shuffleMergeId than the
+  // current incoming one
+  throw new RuntimeException(String.format("Rejecting shuffle blocks 
push request for"
++ " shuffle %s with shuffleMergeId %s for application %s_%s as a 
higher"
+  + " shuffleMergeId %s request is already seen", shuffleId, 
shuffleMergeId,
+  appShuffleInfo.appId, appShuffleInfo.attemptId, 
latestShuffleMergeId));
 } else {
-  return map;
+  // Higher shuffleMergeId seen for the shuffle ID meaning new stage 
attempt is being
+  // run for the shuffle ID. Close and clean up old shuffleMergeId 
files,
+  // happens in the non-deterministic stage retries
+  logger.info("Creating a new attempt for shuffle blocks push request 
for shuffle {} with"
++ " shuffleMergeId {} for application {}_{} since it is higher 
than the latest"
+  + " shuffleMergeId {} already seen", shuffleId, shuffleMergeId, 
appShuffleInfo.appId,
+  appShuffleInfo.attemptId, latestShuffleMergeId);
+  if (latestShuffleMergeId != UNDEFINED_SHUFFLE_MERGE_ID &&
+  shuffleMergePartitionsMap.containsKey(latestShuffleMergeId)) {
+Map latestShufflePartitions =
+  shuffleMergePartitionsMap.get(latestShuffleMergeId);
+mergedShuffleCleaner.execute(() ->
+  closeAndDeletePartitionFiles(latestShufflePartitions));
+shuffleMergePartitionsMap.put(latestShuffleMergeId, 
STALE_SHUFFLE_PARTITIONS);
+  }
+  // Remove older shuffleMergeIds which won't be required anymore
+  if (secondLatestShuffleMergeId != UNDEFINED_SHUFFLE_MERGE_ID) {
+

[GitHub] [spark] HyukjinKwon commented on a change in pull request #33532: [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark GHA job

2021-07-26 Thread GitBox


HyukjinKwon commented on a change in pull request #33532:
URL: https://github.com/apache/spark/pull/33532#discussion_r677118544



##
File path: .github/workflows/build_and_test.yml
##
@@ -206,6 +206,7 @@ jobs:
   GITHUB_PREV_SHA: ${{ github.event.before }}
   SPARK_LOCAL_IP: localhost
   SKIP_UNIDOC: true
+  SKIP_MIMA: true

Review comment:
   Can we skip this in SparkR build too?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext

2021-07-26 Thread GitBox


mridulm commented on a change in pull request #33385:
URL: https://github.com/apache/spark/pull/33385#discussion_r677117650



##
File path: core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
##
@@ -57,6 +57,7 @@ private[spark] class TaskDescription(
 val addedJars: Map[String, Long],
 val addedArchives: Map[String, Long],
 val properties: Properties,
+val cpus: Int,

Review comment:
   Please add a check to make sure we dont have invalid value for `cpus` 
coming in - `assert(cpus > 0)`.
   There should be some validation either here, or where it is getting passed 
in from.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext

2021-07-26 Thread GitBox


mridulm commented on a change in pull request #33385:
URL: https://github.com/apache/spark/pull/33385#discussion_r677117650



##
File path: core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
##
@@ -57,6 +57,7 @@ private[spark] class TaskDescription(
 val addedJars: Map[String, Long],
 val addedArchives: Map[String, Long],
 val properties: Properties,
+val cpus: Int,

Review comment:
   Please add a check to make sure we dont have invalid value for `cpus` 
coming in - `assert(cpus > 0)`.
   Validation either here, or where it is getting set from.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext

2021-07-26 Thread GitBox


mridulm commented on a change in pull request #33385:
URL: https://github.com/apache/spark/pull/33385#discussion_r677117650



##
File path: core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
##
@@ -57,6 +57,7 @@ private[spark] class TaskDescription(
 val addedJars: Map[String, Long],
 val addedArchives: Map[String, Long],
 val properties: Properties,
+val cpus: Int,

Review comment:
   Please add a check to make sure we dont have invalid value for `cpus` 
coming in - `assert(cpus > 0)`.
   Some form of validation either here, or where it is getting set from.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext

2021-07-26 Thread GitBox


HyukjinKwon commented on pull request #33385:
URL: https://github.com/apache/spark/pull/33385#issuecomment-887208391


   Ohhh got it. Thanks for clarification!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext

2021-07-26 Thread GitBox


mridulm commented on a change in pull request #33385:
URL: https://github.com/apache/spark/pull/33385#discussion_r677117650



##
File path: core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
##
@@ -57,6 +57,7 @@ private[spark] class TaskDescription(
 val addedJars: Map[String, Long],
 val addedArchives: Map[String, Long],
 val properties: Properties,
+val cpus: Int,

Review comment:
   Please add a check to make sure we dont have invalid value for `cpus` 
coming in - `assert(cpus > 0)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext

2021-07-26 Thread GitBox


mridulm commented on a change in pull request #33385:
URL: https://github.com/apache/spark/pull/33385#discussion_r677116925



##
File path: core/src/main/scala/org/apache/spark/TaskContextImpl.scala
##
@@ -54,6 +54,7 @@ private[spark] class TaskContextImpl(
 @transient private val metricsSystem: MetricsSystem,
 // The default value is only used in tests.
 override val taskMetrics: TaskMetrics = TaskMetrics.empty,
+override val cpus: Int = 0,

Review comment:
   CPUS_PER_TASK is simply "spark.task.cpus" with default 1 ... which 
should be available from `SparkEnv.conf` ?
   `SparkEnv.conf.get(config.CPUS_PER_TASK)` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] williamhyun commented on pull request #33532: [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark GHA job

2021-07-26 Thread GitBox


williamhyun commented on pull request #33532:
URL: https://github.com/apache/spark/pull/33532#issuecomment-887207431


   cc: @HyukjinKwon , @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] williamhyun opened a new pull request #33532: [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark GHA job

2021-07-26 Thread GitBox


williamhyun opened a new pull request #33532:
URL: https://github.com/apache/spark/pull/33532


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source

2021-07-26 Thread GitBox


SparkQA removed a comment on pull request #33413:
URL: https://github.com/apache/spark/pull/33413#issuecomment-887197252


   **[Test build #141681 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141681/testReport)**
 for PR 33413 at commit 
[`fdecefc`](https://github.com/apache/spark/commit/fdecefc1c4db1803d2e9683fae3b4d2172904563).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source

2021-07-26 Thread GitBox


SparkQA commented on pull request #33413:
URL: https://github.com/apache/spark/pull/33413#issuecomment-887206619


   **[Test build #141681 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141681/testReport)**
 for PR 33413 at commit 
[`fdecefc`](https://github.com/apache/spark/commit/fdecefc1c4db1803d2e9683fae3b4d2172904563).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext

2021-07-26 Thread GitBox


mridulm commented on a change in pull request #33385:
URL: https://github.com/apache/spark/pull/33385#discussion_r677115269



##
File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
##
@@ -429,6 +429,7 @@ private[spark] class TaskSetManager(
   execId: String,
   host: String,
   maxLocality: TaskLocality.TaskLocality,
+  taskCpuAssignments: Int = 1,

Review comment:
   If we can require explicitly passing it, that would be ideal, I agree.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext

2021-07-26 Thread GitBox


mridulm commented on pull request #33385:
URL: https://github.com/apache/spark/pull/33385#issuecomment-887205429


   @HyukjinKwon cpu support in resource profile allows users to override the 
default 1 core per task at a stage level - and so makes sense to expose.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up

2021-07-26 Thread GitBox


SparkQA removed a comment on pull request #33526:
URL: https://github.com/apache/spark/pull/33526#issuecomment-887156766


   **[Test build #141675 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141675/testReport)**
 for PR 33526 at commit 
[`6e73fb3`](https://github.com/apache/spark/commit/6e73fb32a533c384f41bc13014bfdf511b3d07dc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up

2021-07-26 Thread GitBox


SparkQA commented on pull request #33526:
URL: https://github.com/apache/spark/pull/33526#issuecomment-887204625


   **[Test build #141675 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141675/testReport)**
 for PR 33526 at commit 
[`6e73fb3`](https://github.com/apache/spark/commit/6e73fb32a533c384f41bc13014bfdf511b3d07dc).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle

2021-07-26 Thread GitBox


Ngone51 commented on a change in pull request #33034:
URL: https://github.com/apache/spark/pull/33034#discussion_r677061512



##
File path: core/src/main/scala/org/apache/spark/Dependency.scala
##
@@ -150,6 +155,22 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
 }
   }
 
+  def newShuffleMergeState(): Unit = {
+_shuffleMergeEnabled = canShuffleMergeBeEnabled()

Review comment:
   Isn't this a constant value? Why do we need to reset it every time? 

##
File path: core/src/main/scala/org/apache/spark/Dependency.scala
##
@@ -124,6 +122,13 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
*/
   private[this] var _shuffleMergedFinalized: Boolean = false
 
+  /**
+   * shuffleMergeId is used to uniquely identify a indeterminate stage attempt 
of a shuffle Id.

Review comment:
   How about:
   ```suggestion
  * shuffleMergeId is used to uniquely identify a merging process of an 
indeterminate stage attempt.
   ```

##
File path: 
common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java
##
@@ -206,12 +206,15 @@ public long sendRpc(ByteBuffer message, 
RpcResponseCallback callback) {
*
* @param appId applicationId.
* @param shuffleId shuffle id.
+   * @param shuffleMergeId shuffleMergeId is used to uniquely identify a 
indeterminate stage
+   *   attempt of a shuffle Id.

Review comment:
   nit: "... identify  a merging process of an indeterminate stage attempt."

##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockStoreClient.java
##
@@ -185,6 +188,8 @@ public void finalizeShuffleMerge(
* @param host the host of the remote node.
* @param port the port of the remote node.
* @param shuffleId shuffle id.
+   * @param shuffleMergeId shuffleMergeId is used to uniquely identify a 
indeterminate stage
+   *   attempt of a shuffle Id.

Review comment:
   ditto.

##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java
##
@@ -125,63 +125,87 @@ private AbstractFetchShuffleBlocks 
createFetchShuffleBlocksOrChunksMsg(
   String execId,
   String[] blockIds) {
 if (blockIds[0].startsWith(SHUFFLE_CHUNK_PREFIX)) {
-  return createFetchShuffleMsgAndBuildBlockIds(appId, execId, blockIds, 
true);
+  return createFetchShuffleChunksMsg(appId, execId, blockIds);
 } else {
-  return createFetchShuffleMsgAndBuildBlockIds(appId, execId, blockIds, 
false);
+  return createFetchShuffleBlocksMsg(appId, execId, blockIds);
 }
   }
 
-  /**
-   * Create FetchShuffleBlocks/FetchShuffleBlockChunks message and rebuild 
internal blockIds by
-   * analyzing the passed in blockIds.
-   */
-  private AbstractFetchShuffleBlocks createFetchShuffleMsgAndBuildBlockIds(
+  private AbstractFetchShuffleBlocks createFetchShuffleBlocksMsg(
   String appId,
   String execId,
-  String[] blockIds,
-  boolean areMergedChunks) {
+  String[] blockIds) {
 String[] firstBlock = splitBlockId(blockIds[0]);
 int shuffleId = Integer.parseInt(firstBlock[1]);
 boolean batchFetchEnabled = firstBlock.length == 5;
-
-// In case of FetchShuffleBlocks, primaryId is mapId. For 
FetchShuffleBlockChunks, primaryId
-// is reduceId.
-LinkedHashMap primaryIdToBlocksInfo = new 
LinkedHashMap<>();
+Map mapIdToBlocksInfo = new LinkedHashMap<>();
 for (String blockId : blockIds) {
   String[] blockIdParts = splitBlockId(blockId);
   if (Integer.parseInt(blockIdParts[1]) != shuffleId) {
-throw new IllegalArgumentException("Expected shuffleId=" + shuffleId +
-  ", got:" + blockId);
-  }
-  Number primaryId;
-  if (!areMergedChunks) {
-primaryId = Long.parseLong(blockIdParts[2]);
-  } else {
-primaryId = Integer.parseInt(blockIdParts[2]);
+throw new IllegalArgumentException("Expected shuffleId=" + shuffleId + 
", got:" + blockId);
   }
-  BlocksInfo blocksInfoByPrimaryId = 
primaryIdToBlocksInfo.computeIfAbsent(primaryId,
-id -> new BlocksInfo());
-  blocksInfoByPrimaryId.blockIds.add(blockId);
-  // If blockId is a regular shuffle block, then blockIdParts[3] = 
reduceId. If blockId is a
-  // shuffleChunk block, then blockIdParts[3] = chunkId
-  blocksInfoByPrimaryId.ids.add(Integer.parseInt(blockIdParts[3]));
+
+  long mapId = Long.parseLong(blockIdParts[2]);
+  BlocksInfo blocksInfoByMapId = mapIdToBlocksInfo.computeIfAbsent(mapId,
+  id -> new BlocksInfo());
+  blocksInfoByMapId.blockIds.add(blockId);
+  blocksInfoByMapId.ids.add(Integer.parseInt(blockIdParts[3]));
+
   if (batchFetchEnabled) {
-// It comes here only if the blockId is a regular shuffle block not a 
shuffleChunk block.
 // When we read continuous shuffle 

[GitHub] [spark] sunchao commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package

2021-07-26 Thread GitBox


sunchao commented on pull request #33350:
URL: https://github.com/apache/spark/pull/33350#issuecomment-887204038


   Sorry. Let me check the failed tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #33308: [SPARK-35918][AVRO] Unify schema mismatch handling for read/write and enhance error messages

2021-07-26 Thread GitBox


gengliangwang commented on a change in pull request #33308:
URL: https://github.com/apache/spark/pull/33308#discussion_r677112782



##
File path: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSerdeSuite.scala
##
@@ -86,18 +86,18 @@ class AvroSerdeSuite extends SparkFunSuite {
 // deserialize should have no issues when 'bar' is nullable but fail when 
it is nonnull
 Deserializer.create(CATALYST_STRUCT, avro, BY_NAME)
 assertFailedConversionMessage(avro, Deserializer, BY_NAME,
-  "Cannot find non-nullable field 'foo.bar' (at position 0) in Avro 
schema.",
+  "Cannot find field 'foo.bar' in Avro schema",
   nonnullCatalyst)
 assertFailedConversionMessage(avro, Deserializer, BY_POSITION,
-  "Cannot find non-nullable field at position 1 (field 'foo.baz') in Avro 
schema.",
+  "Cannot find field at position 1 of field 'foo' from Avro schema",

Review comment:
   Shall we mention by-position match?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package

2021-07-26 Thread GitBox


cloud-fan commented on pull request #33350:
URL: https://github.com/apache/spark/pull/33350#issuecomment-887203356


   hmm, they are also failing in master: 
https://github.com/apache/spark/pull/33498


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #33308: [SPARK-35918][AVRO] Unify schema mismatch handling for read/write and enhance error messages

2021-07-26 Thread GitBox


gengliangwang commented on a change in pull request #33308:
URL: https://github.com/apache/spark/pull/33308#discussion_r677112492



##
File path: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSerdeSuite.scala
##
@@ -86,18 +86,18 @@ class AvroSerdeSuite extends SparkFunSuite {
 // deserialize should have no issues when 'bar' is nullable but fail when 
it is nonnull
 Deserializer.create(CATALYST_STRUCT, avro, BY_NAME)
 assertFailedConversionMessage(avro, Deserializer, BY_NAME,
-  "Cannot find non-nullable field 'foo.bar' (at position 0) in Avro 
schema.",
+  "Cannot find field 'foo.bar' in Avro schema",

Review comment:
   Shall we mention "by-name" match here?
   ```
   Cannot find field 'foo.bar' in Avro schema with by-name match
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

2021-07-26 Thread GitBox


SparkQA commented on pull request #32332:
URL: https://github.com/apache/spark/pull/32332#issuecomment-887202286


   **[Test build #141683 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141683/testReport)**
 for PR 32332 at commit 
[`8ddf243`](https://github.com/apache/spark/commit/8ddf24309623be15ae6e4a5a9ea52dcfe8069792).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] darcy-shen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe

2021-07-26 Thread GitBox


darcy-shen commented on a change in pull request #32332:
URL: https://github.com/apache/spark/pull/32332#discussion_r677110817



##
File path: python/pyspark/sql/session.py
##
@@ -697,12 +712,19 @@ def prepare(obj):
 verify_func(obj)
 return obj,
 else:
+no_need_to_prepare = True
 prepare = lambda obj: obj
 
-if isinstance(data, RDD):
-rdd, schema = self._createFromRDD(data.map(prepare), schema, 
samplingRatio)
+if isinstance(data, RDD) and no_need_to_prepare:
+rdd, schema = self._createFromRDD(
+data, schema, verifySchema, samplingRatio)
+elif isinstance(data, RDD) and (not no_need_to_prepare):
+rdd, schema = self._createFromRDD(
+data.map(prepare), schema, verifySchema, samplingRatio)

Review comment:
   Refactored, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up

2021-07-26 Thread GitBox


SparkQA removed a comment on pull request #33526:
URL: https://github.com/apache/spark/pull/33526#issuecomment-887139023


   **[Test build #141674 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141674/testReport)**
 for PR 33526 at commit 
[`21ef1a1`](https://github.com/apache/spark/commit/21ef1a1c9e39bcb04cb4f180c80536edbf8f50be).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up

2021-07-26 Thread GitBox


AmplabJenkins removed a comment on pull request #33526:
URL: https://github.com/apache/spark/pull/33526#issuecomment-887200904


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141674/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up

2021-07-26 Thread GitBox


AmplabJenkins commented on pull request #33526:
URL: https://github.com/apache/spark/pull/33526#issuecomment-887200904


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141674/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up

2021-07-26 Thread GitBox


SparkQA commented on pull request #33526:
URL: https://github.com/apache/spark/pull/33526#issuecomment-887200656


   **[Test build #141674 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141674/testReport)**
 for PR 33526 at commit 
[`21ef1a1`](https://github.com/apache/spark/commit/21ef1a1c9e39bcb04cb4f180c80536edbf8f50be).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package

2021-07-26 Thread GitBox


cloud-fan commented on pull request #33350:
URL: https://github.com/apache/spark/pull/33350#issuecomment-887200130


   seems they are failing in 3.2 branch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


SparkQA commented on pull request #31517:
URL: https://github.com/apache/spark/pull/31517#issuecomment-887199134


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46193/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle

2021-07-26 Thread GitBox


SparkQA commented on pull request #33034:
URL: https://github.com/apache/spark/pull/33034#issuecomment-887197367


   **[Test build #141682 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141682/testReport)**
 for PR 33034 at commit 
[`11af40d`](https://github.com/apache/spark/commit/11af40d9ea3550aa1014f155dbc8cdc201b06145).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source

2021-07-26 Thread GitBox


SparkQA commented on pull request #33413:
URL: https://github.com/apache/spark/pull/33413#issuecomment-887197252


   **[Test build #141681 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141681/testReport)**
 for PR 33413 at commit 
[`fdecefc`](https://github.com/apache/spark/commit/fdecefc1c4db1803d2e9683fae3b4d2172904563).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33531: [SPARK-36312][SQL] parquetWriterSupport.setSchema should check inner field

2021-07-26 Thread GitBox


SparkQA commented on pull request #33531:
URL: https://github.com/apache/spark/pull/33531#issuecomment-887197148


   **[Test build #141680 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141680/testReport)**
 for PR 33531 at commit 
[`d79bdbf`](https://github.com/apache/spark/commit/d79bdbfc4e666be736f0c2080c39848db77634c5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up

2021-07-26 Thread GitBox


AmplabJenkins removed a comment on pull request #33526:
URL: https://github.com/apache/spark/pull/33526#issuecomment-887196180


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46191/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up

2021-07-26 Thread GitBox


AmplabJenkins commented on pull request #33526:
URL: https://github.com/apache/spark/pull/33526#issuecomment-887196180


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46191/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33441: [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too

2021-07-26 Thread GitBox


SparkQA commented on pull request #33441:
URL: https://github.com/apache/spark/pull/33441#issuecomment-887195858


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46194/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #33531: [SPARK-36312][SQL] parquetWriterSupport.setSchema should check inner field

2021-07-26 Thread GitBox


AngersZh commented on pull request #33531:
URL: https://github.com/apache/spark/pull/33531#issuecomment-887191536


   ping @cloud-fan @dongjoon-hyun @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu opened a new pull request #33531: [SPARK-36312][SQL] parquetWriterSupport.setSchema should check inner field

2021-07-26 Thread GitBox


AngersZh opened a new pull request #33531:
URL: https://github.com/apache/spark/pull/33531


   ### What changes were proposed in this pull request?
   Last pr only support add inner field check for hive ddl, this pr add check 
for parquet data source write API.
   
   
   ### Why are the changes needed?
   Failed earlier
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Added Ut
   
   Without this UI it failed as 
   ```
   [info] - SPARK-36312: ParquetWriteSupport should check inner field *** 
FAILED *** (8 seconds, 29 milliseconds)
   [info]   Expected exception org.apache.spark.sql.AnalysisException to be 
thrown, but org.apache.spark.SparkException was thrown (HiveDDLSuite.scala:3035)
   [info]   org.scalatest.exceptions.TestFailedException:
   [info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
   [info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
   [info]   at 
org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563)
   [info]   at org.scalatest.Assertions.intercept(Assertions.scala:756)
   [info]   at org.scalatest.Assertions.intercept$(Assertions.scala:746)
   [info]   at 
org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1563)
   [info]   at 
org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$396(HiveDDLSuite.scala:3035)
   [info]   at 
org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$396$adapted(HiveDDLSuite.scala:3034)
   [info]   at 
org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69)
   [info]   at 
org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66)
   [info]   at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:34)
   [info]   at 
org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$395(HiveDDLSuite.scala:3034)
   [info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   [info]   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)
   [info]   at 
org.apache.spark.sql.test.SQLTestUtilsBase.withView(SQLTestUtils.scala:316)
   [info]   at 
org.apache.spark.sql.test.SQLTestUtilsBase.withView$(SQLTestUtils.scala:314)
   [info]   at 
org.apache.spark.sql.hive.execution.HiveDDLSuite.withView(HiveDDLSuite.scala:396)
   [info]   at 
org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$394(HiveDDLSuite.scala:3032)
   [info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
   [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
   [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
   [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
   [info]   at 
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
   [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
   [info]   at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
   [info]   at 
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
   [info]   at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
   [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
   [info]   at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
   [info]   at scala.collection.immutable.List.foreach(List.scala:431)
   [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
   [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
   [info]   at 
org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
   [info]   at org.scalatest.Suite.run(Suite.scala:1112)
   [info]   at org.scalatest.Suite.run$(Suite.scala:1094)
   [info]   at 
org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563)
   [info]   at 

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33441: [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too

2021-07-26 Thread GitBox


AngersZh commented on a change in pull request #33441:
URL: https://github.com/apache/spark/pull/33441#discussion_r677094314



##
File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
##
@@ -153,6 +154,27 @@ private[sql] class AvroFileFormat extends FileFormat
   }
 
   override def supportDataType(dataType: DataType): Boolean = 
AvroUtils.supportsDataType(dataType)
+
+  override def supportFieldName(name: String): Unit = {

Review comment:
   How about current?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33441: [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too

2021-07-26 Thread GitBox


AngersZh commented on a change in pull request #33441:
URL: https://github.com/apache/spark/pull/33441#discussion_r677094233



##
File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
##
@@ -153,6 +154,27 @@ private[sql] class AvroFileFormat extends FileFormat
   }
 
   override def supportDataType(dataType: DataType): Boolean = 
AvroUtils.supportsDataType(dataType)
+
+  override def supportFieldName(name: String): Unit = {
+val length = name.length
+if (length == 0) {
+  throw 
QueryCompilationErrors.columnNameContainsInvalidCharactersError(name)
+} else {
+  val first = name.charAt(0)
+  if (!Character.isLetter(first) && first != '_') {

Review comment:
   > do you have some reference doc to prove this is indeed a limitation of 
avro? or can you run some local tests like `df.write.format("avro").save(path)`?
   
   For `df.write.format(source).save(path)`
   
   Avro error message
   ```
   [info]   org.apache.avro.SchemaParseException: Illegal initial character: 
(IF((ID = 1), 1, 0))
   [info]   at org.apache.avro.Schema.validateName(Schema.java:1562)
   [info]   at org.apache.avro.Schema.access$400(Schema.java:91)
   [info]   at org.apache.avro.Schema$Field.(Schema.java:546)
   [info]   at 
org.apache.avro.SchemaBuilder$FieldBuilder.completeField(SchemaBuilder.java:2240)
   [info]   at 
org.apache.avro.SchemaBuilder$FieldBuilder.completeField(SchemaBuilder.java:2236)
   [info]   at 
org.apache.avro.SchemaBuilder$FieldBuilder.access$5100(SchemaBuilder.java:2150)
   [info]   at 
org.apache.avro.SchemaBuilder$GenericDefault.noDefault(SchemaBuilder.java:2539)
   [info]   at 
org.apache.spark.sql.avro.SchemaConverters$.$anonfun$toAvroType$1(SchemaConverters.scala:194)
   [info]   at scala.collection.Iterator.foreach(Iterator.scala:943)
   [info]   at scala.collection.Iterator.foreach$(Iterator.scala:943)
   [info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
   [info]   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
   [info]   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
   [info]   at 
org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
   [info]   at 
org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:191)
   [info]   at 
org.apache.spark.sql.avro.AvroUtils$.$anonfun$prepareWrite$1(AvroUtils.scala:98)
   [info]   at scala.Option.getOrElse(Option.scala:189)
   [info]   at 
org.apache.spark.sql.avro.AvroUtils$.prepareWrite(AvroUtils.scala:97)
   [info]   at 
org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:74)
   [info]   at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:142)
   [info]   at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
   [info]   at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
   [info]   at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
   [info]   at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
   [info]   at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97)
   [info]   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
   [info]   at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
   [info]   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
   [info]   at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
   ```
   
   parquet error message
   ```
   [info] - SPARK-33865: Hive DDL with avro should check col name *** FAILED 
*** (4 seconds, 554 milliseconds)
   [info]   org.apache.spark.sql.AnalysisException: Column name "(IF((ID = 1), 
1, 0))" contains invalid character(s). Please use alias to rename it.
   [info]   at 
org.apache.spark.sql.errors.QueryCompilationErrors$.columnNameContainsInvalidCharactersError(QueryCompilationErrors.scala:2105)
   [info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:590)
   [info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.$anonfun$setSchema$2(ParquetWriteSupport.scala:485)
   [info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.$anonfun$setSchema$2$adapted(ParquetWriteSupport.scala:485)
   [info]   at scala.collection.immutable.List.foreach(List.scala:431)
   [info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:485)
   [info]   at 

[GitHub] [spark] beliefer commented on a change in pull request #33317: [SPARK-36095][CORE] Grouping exception in core/rdd

2021-07-26 Thread GitBox


beliefer commented on a change in pull request #33317:
URL: https://github.com/apache/spark/pull/33317#discussion_r677093918



##
File path: core/src/main/scala/org/apache/spark/errors/SparkCoreErrors.scala
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.errors
+
+import java.io.IOException
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.storage.{BlockId, RDDBlockId}
+
+/**
+ * Object for grouping error messages from (most) exceptions thrown during 
query execution.
+ */
+private[spark] object SparkCoreErrors {
+  def rddBlockNotFoundError(blockId: BlockId, id: Int): Throwable = {
+new Exception(s"Could not compute split, block $blockId of RDD $id not 
found")
+  }
+
+  def blockHaveBeenRemovedError(string: String): Throwable = {
+new SparkException(s"Attempted to use $string after its blocks have been 
removed!")
+  }
+
+  def histogramOnEmptyRDDOrContainingInfinityOrNaNError(): Throwable = {
+new UnsupportedOperationException(
+  "Histogram on either an empty RDD or RDD containing +/-infinity or NaN")
+  }
+
+  def emptyRDDError(): Throwable = {
+new UnsupportedOperationException("empty RDD")
+  }
+
+  def pathNotSupportedError(path: String): Throwable = {
+new IOException(s"Path: ${path} is a directory, which is not supported by 
the " +
+  s"record reader when 
`mapreduce.input.fileinputformat.input.dir.recursive` is false.")
+  }
+
+  def checkpointRDDBlockIdNotFoundError(rddBlockId: RDDBlockId): Throwable = {
+new SparkException(s"Checkpoint block $rddBlockId not found! Either the 
executor " +
+  s"that originally checkpointed this partition is no longer alive, or the 
original RDD is " +
+  s"unpersisted. If this problem persists, you may consider using 
`rdd.checkpoint()` " +
+  s"instead, which is slower than local checkpointing but more 
fault-tolerant.")
+  }
+
+  def endOfStreamError(): Throwable = {
+new java.util.NoSuchElementException("End of stream")
+  }
+
+  def cannotUseMapSideCombiningWithArrayKeyError(): Throwable = {
+new SparkException("Cannot use map-side combining with array keys.")
+  }
+
+  def hashPartitionerCannotPartitionArrayKeyError(): Throwable = {
+new SparkException("HashPartitioner cannot partition array keys.")
+  }
+
+  def reduceByKeyLocallyNotSupportArrayKeysError(): Throwable = {
+new SparkException("reduceByKeyLocally() does not support array keys")
+  }
+
+  def noSuchElementException(): Throwable = {
+new NoSuchElementException()
+  }
+
+  def rddLacksSparkContextError(): Throwable = {
+new SparkException("This RDD lacks a SparkContext. It could happen in the 
following cases: " +
+  "\n(1) RDD transformations and actions are NOT invoked by the driver, 
but inside of other " +
+  "transformations; for example, rdd1.map(x => rdd2.values.count() * x) is 
invalid " +
+  "because the values transformation and count action cannot be performed 
inside of the " +
+  "rdd1.map transformation. For more information, see SPARK-5063.\n(2) 
When a Spark " +
+  "Streaming job recovers from checkpoint, this exception will be hit if a 
reference to " +
+  "an RDD not defined by the streaming job is used in DStream operations. 
For more " +
+  "information, See SPARK-13758.")

Review comment:
   > can we split paragraph like """ ... """.stripMargin.replaceAll("\n", " 
") + "\n" + """ ... """.stripMargin.replaceAll("\n", " ")
   
   Seems good!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up

2021-07-26 Thread GitBox


SparkQA commented on pull request #33526:
URL: https://github.com/apache/spark/pull/33526#issuecomment-887182859


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46191/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #33317: [SPARK-36095][CORE] Grouping exception in core/rdd

2021-07-26 Thread GitBox


beliefer commented on a change in pull request #33317:
URL: https://github.com/apache/spark/pull/33317#discussion_r677092283



##
File path: core/src/main/scala/org/apache/spark/errors/SparkCoreErrors.scala
##
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.errors
+
+import java.io.IOException
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.storage.{BlockId, RDDBlockId}
+
+/**
+ * Object for grouping error messages from (most) exceptions thrown during 
query execution.
+ */
+private[spark] object SparkCoreErrors {
+  def rddBlockNotFoundError(blockId: BlockId, id: Int): Throwable = {
+new Exception(s"Could not compute split, block $blockId of RDD $id not 
found")
+  }
+
+  def blockHaveBeenRemovedError(string: String): Throwable = {
+new SparkException(s"Attempted to use $string after its blocks have been 
removed!")
+  }
+
+  def histogramOnEmptyRDDOrContainingInfinityOrNaNError(): Throwable = {
+new UnsupportedOperationException(
+  "Histogram on either an empty RDD or RDD containing +/-infinity or NaN")
+  }
+
+  def emptyRDDError(): Throwable = {
+new UnsupportedOperationException("empty RDD")
+  }
+
+  def pathNotSupportedError(path: String): Throwable = {
+new IOException(s"Path: ${path} is a directory, which is not supported by 
the " +
+  "record reader when 
`mapreduce.input.fileinputformat.input.dir.recursive` is false.")
+  }
+
+  def checkpointRDDBlockIdNotFoundError(rddBlockId: RDDBlockId): Throwable = {
+new SparkException(
+  s"""
+ |Checkpoint block $rddBlockId not found! Either the executor
+ |that originally checkpointed this partition is no longer alive, or 
the original RDD is
+ |unpersisted. If this problem persists, you may consider using 
`rdd.checkpoint()`
+ |instead, which is slower than local checkpointing but more 
fault-tolerant.
+   """.stripMargin.replaceAll("\n", " "))
+  }
+
+  def endOfStreamError(): Throwable = {
+new java.util.NoSuchElementException("End of stream")
+  }
+
+  def cannotUseMapSideCombiningWithArrayKeyError(): Throwable = {
+new SparkException("Cannot use map-side combining with array keys.")
+  }
+
+  def hashPartitionerCannotPartitionArrayKeyError(): Throwable = {
+new SparkException("HashPartitioner cannot partition array keys.")
+  }
+
+  def reduceByKeyLocallyNotSupportArrayKeysError(): Throwable = {
+new SparkException("reduceByKeyLocally() does not support array keys")
+  }
+
+  def noSuchElementException(): Throwable = {
+new NoSuchElementException()
+  }
+
+  def rddLacksSparkContextError(): Throwable = {
+new SparkException("This RDD lacks a SparkContext. It could happen in the 
following cases: " +

Review comment:
   Yeah. I got it. I make a mistake.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


SparkQA commented on pull request #31517:
URL: https://github.com/apache/spark/pull/31517#issuecomment-887180729


   **[Test build #141679 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141679/testReport)**
 for PR 31517 at commit 
[`7b360a7`](https://github.com/apache/spark/commit/7b360a7d577ea379db3487d451b0c7a744d1dc02).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #33441: [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too

2021-07-26 Thread GitBox


SparkQA commented on pull request #33441:
URL: https://github.com/apache/spark/pull/33441#issuecomment-887180308


   **[Test build #141678 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141678/testReport)**
 for PR 33441 at commit 
[`5b99ab6`](https://github.com/apache/spark/commit/5b99ab6df314578bf3ba923e89f24c2765cb089b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #33519: [SPARK-36288][DOCS][PYTHON] Update API usage on pyspark pandas documents

2021-07-26 Thread GitBox


HyukjinKwon closed pull request #33519:
URL: https://github.com/apache/spark/pull/33519


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #33519: [SPARK-36288][DOCS][PYTHON] Update API usage on pyspark pandas documents

2021-07-26 Thread GitBox


HyukjinKwon commented on pull request #33519:
URL: https://github.com/apache/spark/pull/33519#issuecomment-887179955


   Merged to master and branch-3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


LuciferYang commented on a change in pull request #31517:
URL: https://github.com/apache/spark/pull/31517#discussion_r677090585



##
File path: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
##
@@ -136,11 +136,14 @@ private[spark] object BlockManagerId {
* The max cache size is hardcoded to 1, since the size of a 
BlockManagerId
* object is about 48B, the total memory cost should be below 1MB which is 
feasible.
*/
-  val blockManagerIdCache = CacheBuilder.newBuilder()
-.maximumSize(1)
-.build(new CacheLoader[BlockManagerId, BlockManagerId]() {
-  override def load(id: BlockManagerId) = id
-})
+  val blockManagerIdCache = {

Review comment:
   7b360a7 revert this change




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


LuciferYang commented on a change in pull request #31517:
URL: https://github.com/apache/spark/pull/31517#discussion_r677090543



##
File path: core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala
##
@@ -84,16 +84,18 @@ private[spark] class ReliableCheckpointRDD[T: ClassTag](
   }
 
   // Cache of preferred locations of checkpointed files.
-  @transient private[spark] lazy val cachedPreferredLocations = 
CacheBuilder.newBuilder()
-.expireAfterWrite(
-  SparkEnv.get.conf.get(CACHE_CHECKPOINT_PREFERRED_LOCS_EXPIRE_TIME).get,
-  TimeUnit.MINUTES)
-.build(
-  new CacheLoader[Partition, Seq[String]]() {
-override def load(split: Partition): Seq[String] = {
-  getPartitionBlockLocations(split)
-}
-  })
+  @transient private[spark] lazy val cachedPreferredLocations = {
+val builder = Caffeine.newBuilder()
+  .expireAfterWrite(
+SparkEnv.get.conf.get(CACHE_CHECKPOINT_PREFERRED_LOCS_EXPIRE_TIME).get,
+TimeUnit.MINUTES)
+val loader = new CacheLoader[Partition, Seq[String]]() {
+  override def load(split: Partition): Seq[String] = {
+getPartitionBlockLocations(split)
+  }
+}
+builder.build[Partition, Seq[String]](loader)

Review comment:
   7b360a7 revert this change




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] venkata91 commented on a change in pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle

2021-07-26 Thread GitBox


venkata91 commented on a change in pull request #33034:
URL: https://github.com/apache/spark/pull/33034#discussion_r677090442



##
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##
@@ -135,51 +150,87 @@ protected AppShuffleInfo 
validateAndGetAppShuffleInfo(String appId) {
   private AppShufflePartitionInfo getOrCreateAppShufflePartitionInfo(
   AppShuffleInfo appShuffleInfo,
   int shuffleId,
+  int shuffleMergeId,
   int reduceId) {
-File dataFile = appShuffleInfo.getMergedShuffleDataFile(shuffleId, 
reduceId);
-ConcurrentMap> partitions =
+ConcurrentMap>> partitions =
   appShuffleInfo.partitions;
-Map shufflePartitions =
-  partitions.compute(shuffleId, (id, map) -> {
-if (map == null) {
-  // If this partition is already finalized then the partitions map 
will not contain the
-  // shuffleId but the data file would exist. In that case the block 
is considered late.
-  if (dataFile.exists()) {
+Map> shuffleMergePartitions 
=
+  partitions.compute(shuffleId, (id, shuffleMergePartitionsMap) -> {
+if (shuffleMergePartitionsMap == null) {
+  logger.info("Creating a new attempt for shuffle blocks push request 
for"
+  + " shuffle {} with shuffleMergeId {} for application {}_{}", 
shuffleId,
+  shuffleMergeId, appShuffleInfo.appId, appShuffleInfo.attemptId));
+  Map> 
newShuffleMergePartitions
+= new ConcurrentHashMap<>();
+  Map newPartitionsMap = new 
ConcurrentHashMap<>();
+  newShuffleMergePartitions.put(shuffleMergeId, newPartitionsMap);
+  return newShuffleMergePartitions;
+} else if (shuffleMergePartitionsMap.containsKey(shuffleMergeId)) {
+  return shuffleMergePartitionsMap;
+} else {
+  int latestShuffleMergeId = 
shuffleMergePartitionsMap.keySet().stream()
+.mapToInt(v -> v).max().orElse(UNDEFINED_SHUFFLE_MERGE_ID);
+  if (latestShuffleMergeId > shuffleMergeId) {
+logger.info("Rejecting shuffle blocks push request for shuffle {} 
with"
++ " shuffleMergeId {} for application {}_{} as a higher 
shuffleMergeId"
++ " {} request is already seen", shuffleId, shuffleMergeId,
+appShuffleInfo.appId, appShuffleInfo.attemptId, 
latestShuffleMergeId));
+// Reject the request as we have already seen a higher 
shuffleMergeId than the
+// current incoming one
 return null;
+  } else {
+// Higher shuffleMergeId seen for the shuffle ID meaning new stage 
attempt is being
+// run for the shuffle ID. Close and clean up old shuffleMergeId 
files,
+// happens in the non-deterministic stage retries
+logger.info("Creating a new attempt for shuffle blocks push 
request for"
+   + " shuffle {} with shuffleMergeId {} for application {}_{} 
since it is"
+   + " higher than the latest shuffleMergeId {} already seen", 
shuffleId,
+   shuffleMergeId, appShuffleInfo.appId, appShuffleInfo.attemptId,
+   latestShuffleMergeId));
+if (null != shuffleMergePartitionsMap.get(latestShuffleMergeId)) {
+  Map shufflePartitions =
+shuffleMergePartitionsMap.get(latestShuffleMergeId);
+  mergedShuffleCleaner.execute(() ->
+closeAndDeletePartitionFiles(shufflePartitions));
+}
+shuffleMergePartitionsMap.put(latestShuffleMergeId, 
STALE_SHUFFLE_PARTITIONS);
+Map newPartitionsMap = new 
ConcurrentHashMap<>();
+shuffleMergePartitionsMap.put(shuffleMergeId, newPartitionsMap);
+return shuffleMergePartitionsMap;
   }
-  return new ConcurrentHashMap<>();
-} else {
-  return map;
 }
   });
-if (shufflePartitions == null) {
+
+Map shufflePartitions = 
shuffleMergePartitions.get(shuffleMergeId);
+if (shufflePartitions == FINALIZED_SHUFFLE_PARTITIONS
+|| shufflePartitions == STALE_SHUFFLE_PARTITIONS) {
+  // It only gets here when shufflePartitions is either 
FINALIZED_SHUFFLE_PARTITIONS or STALE_SHUFFLE_PARTITIONS.
+  // This happens in 2 cases:
+  // 1. Incoming block request is for an older shuffleMergeId of a shuffle 
(i.e already higher shuffle
+  // sequence Id blocks are being merged for this shuffle Id.
+  // 2. Shuffle for the current shuffleMergeId is already finalized.
   return null;
 }
 
+File dataFile = appShuffleInfo.getMergedShuffleDataFile(shuffleId, 
shuffleMergeId, reduceId);
 return shufflePartitions.computeIfAbsent(reduceId, key -> {
-  // It only gets here when the key is not present in the map. This could 
either
-  // be the first time the merge manager receives a pushed block for a 
given application
-  

[GitHub] [spark] LuciferYang commented on a change in pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


LuciferYang commented on a change in pull request #31517:
URL: https://github.com/apache/spark/pull/31517#discussion_r677087470



##
File path: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
##
@@ -136,11 +136,14 @@ private[spark] object BlockManagerId {
* The max cache size is hardcoded to 1, since the size of a 
BlockManagerId
* object is about 48B, the total memory cost should be below 1MB which is 
feasible.
*/
-  val blockManagerIdCache = CacheBuilder.newBuilder()
-.maximumSize(1)
-.build(new CacheLoader[BlockManagerId, BlockManagerId]() {
-  override def load(id: BlockManagerId) = id
-})
+  val blockManagerIdCache = {

Review comment:
   Similar to `getPartitionBlockLocations `, it cannot be compiled with 
`.build[BlockManagerId, BlockManagerId](identity)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dgd-contributor commented on a change in pull request #33317: [SPARK-36095][CORE] Grouping exception in core/rdd

2021-07-26 Thread GitBox


dgd-contributor commented on a change in pull request #33317:
URL: https://github.com/apache/spark/pull/33317#discussion_r677089979



##
File path: core/src/main/scala/org/apache/spark/errors/SparkCoreErrors.scala
##
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.errors
+
+import java.io.IOException
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.storage.{BlockId, RDDBlockId}
+
+/**
+ * Object for grouping error messages from (most) exceptions thrown during 
query execution.
+ */
+private[spark] object SparkCoreErrors {
+  def rddBlockNotFoundError(blockId: BlockId, id: Int): Throwable = {
+new Exception(s"Could not compute split, block $blockId of RDD $id not 
found")
+  }
+
+  def blockHaveBeenRemovedError(string: String): Throwable = {
+new SparkException(s"Attempted to use $string after its blocks have been 
removed!")
+  }
+
+  def histogramOnEmptyRDDOrContainingInfinityOrNaNError(): Throwable = {
+new UnsupportedOperationException(
+  "Histogram on either an empty RDD or RDD containing +/-infinity or NaN")
+  }
+
+  def emptyRDDError(): Throwable = {
+new UnsupportedOperationException("empty RDD")
+  }
+
+  def pathNotSupportedError(path: String): Throwable = {
+new IOException(s"Path: ${path} is a directory, which is not supported by 
the " +
+  "record reader when 
`mapreduce.input.fileinputformat.input.dir.recursive` is false.")
+  }
+
+  def checkpointRDDBlockIdNotFoundError(rddBlockId: RDDBlockId): Throwable = {
+new SparkException(
+  s"""
+ |Checkpoint block $rddBlockId not found! Either the executor
+ |that originally checkpointed this partition is no longer alive, or 
the original RDD is
+ |unpersisted. If this problem persists, you may consider using 
`rdd.checkpoint()`
+ |instead, which is slower than local checkpointing but more 
fault-tolerant.
+   """.stripMargin.replaceAll("\n", " "))
+  }
+
+  def endOfStreamError(): Throwable = {
+new java.util.NoSuchElementException("End of stream")
+  }
+
+  def cannotUseMapSideCombiningWithArrayKeyError(): Throwable = {
+new SparkException("Cannot use map-side combining with array keys.")
+  }
+
+  def hashPartitionerCannotPartitionArrayKeyError(): Throwable = {
+new SparkException("HashPartitioner cannot partition array keys.")
+  }
+
+  def reduceByKeyLocallyNotSupportArrayKeysError(): Throwable = {
+new SparkException("reduceByKeyLocally() does not support array keys")
+  }
+
+  def noSuchElementException(): Throwable = {
+new NoSuchElementException()
+  }
+
+  def rddLacksSparkContextError(): Throwable = {
+new SparkException("This RDD lacks a SparkContext. It could happen in the 
following cases: " +

Review comment:
   if we do this, the message will be resulted in multiple lines, as 
stripMargin only remove "|" character




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


AmplabJenkins removed a comment on pull request #31517:
URL: https://github.com/apache/spark/pull/31517#issuecomment-887178881


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46192/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


AmplabJenkins commented on pull request #31517:
URL: https://github.com/apache/spark/pull/31517#issuecomment-887178881


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46192/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


SparkQA removed a comment on pull request #31517:
URL: https://github.com/apache/spark/pull/31517#issuecomment-887177451


   **[Test build #141677 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141677/testReport)**
 for PR 31517 at commit 
[`9cd9c35`](https://github.com/apache/spark/commit/9cd9c35872094b0f60f5175dc85494c45cde10d8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

2021-07-26 Thread GitBox


AmplabJenkins removed a comment on pull request #31517:
URL: https://github.com/apache/spark/pull/31517#issuecomment-887178755


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141677/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   >