[GitHub] [spark] venkata91 commented on a change in pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle
venkata91 commented on a change in pull request #33034: URL: https://github.com/apache/spark/pull/33034#discussion_r677140773 ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java ## @@ -128,49 +151,100 @@ protected AppShuffleInfo validateAndGetAppShuffleInfo(String appId) { } /** - * Given the appShuffleInfo, shuffleId and reduceId that uniquely identifies a given shuffle - * partition of an application, retrieves the associated metadata. If not present and the - * corresponding merged shuffle does not exist, initializes the metadata. + * Given the appShuffleInfo, shuffleId, shuffleMergeId and reduceId that uniquely identifies + * a given shuffle partition of an application, retrieves the associated metadata. If not + * present and the corresponding merged shuffle does not exist, initializes the metadata. */ private AppShufflePartitionInfo getOrCreateAppShufflePartitionInfo( AppShuffleInfo appShuffleInfo, int shuffleId, - int reduceId) { -File dataFile = appShuffleInfo.getMergedShuffleDataFile(shuffleId, reduceId); -ConcurrentMap> partitions = + int shuffleMergeId, + int reduceId) throws RuntimeException { +ConcurrentMap>> partitions = appShuffleInfo.partitions; -Map shufflePartitions = - partitions.compute(shuffleId, (id, map) -> { -if (map == null) { - // If this partition is already finalized then the partitions map will not contain the - // shuffleId but the data file would exist. In that case the block is considered late. - if (dataFile.exists()) { -return null; - } - return new ConcurrentHashMap<>(); +AtomicReference>> shuffleMergePartitionsRef + = new AtomicReference<>(null); +partitions.compute(shuffleId, (id, shuffleMergePartitionsMap) -> { + if (shuffleMergePartitionsMap == null) { +logger.info("Creating a new attempt for shuffle blocks push request for" + + " shuffle {} with shuffleMergeId {} for application {}_{}", shuffleId, +shuffleMergeId, appShuffleInfo.appId, appShuffleInfo.attemptId); +Map> newShuffleMergePartitions + = new ConcurrentHashMap<>(); +Map newPartitionsMap = new ConcurrentHashMap<>(); +newShuffleMergePartitions.put(shuffleMergeId, newPartitionsMap); +shuffleMergePartitionsRef.set(newShuffleMergePartitions); +return newShuffleMergePartitions; + } else if (shuffleMergePartitionsMap.containsKey(shuffleMergeId)) { +shuffleMergePartitionsRef.set(shuffleMergePartitionsMap); +return shuffleMergePartitionsMap; + } else { +int latestShuffleMergeId = shuffleMergePartitionsMap.keySet().stream() + .mapToInt(x -> x).max().orElse(UNDEFINED_SHUFFLE_MERGE_ID); +int secondLatestShuffleMergeId = shuffleMergePartitionsMap.keySet().stream() + .mapToInt(x -> x).filter(x -> x != latestShuffleMergeId) + .max().orElse(UNDEFINED_SHUFFLE_MERGE_ID); +if (latestShuffleMergeId > shuffleMergeId) { + // Reject the request as we have already seen a higher shuffleMergeId than the + // current incoming one + throw new RuntimeException(String.format("Rejecting shuffle blocks push request for" ++ " shuffle %s with shuffleMergeId %s for application %s_%s as a higher" + + " shuffleMergeId %s request is already seen", shuffleId, shuffleMergeId, + appShuffleInfo.appId, appShuffleInfo.attemptId, latestShuffleMergeId)); } else { - return map; + // Higher shuffleMergeId seen for the shuffle ID meaning new stage attempt is being + // run for the shuffle ID. Close and clean up old shuffleMergeId files, + // happens in the non-deterministic stage retries + logger.info("Creating a new attempt for shuffle blocks push request for shuffle {} with" ++ " shuffleMergeId {} for application {}_{} since it is higher than the latest" + + " shuffleMergeId {} already seen", shuffleId, shuffleMergeId, appShuffleInfo.appId, + appShuffleInfo.attemptId, latestShuffleMergeId); + if (latestShuffleMergeId != UNDEFINED_SHUFFLE_MERGE_ID && + shuffleMergePartitionsMap.containsKey(latestShuffleMergeId)) { +Map latestShufflePartitions = + shuffleMergePartitionsMap.get(latestShuffleMergeId); +mergedShuffleCleaner.execute(() -> + closeAndDeletePartitionFiles(latestShufflePartitions)); +shuffleMergePartitionsMap.put(latestShuffleMergeId, STALE_SHUFFLE_PARTITIONS); + } + // Remove older shuffleMergeIds which won't be required anymore + if (secondLatestShuffleMergeId != UNDEFINED_SHUFFLE_MERGE_ID) { +
[GitHub] [spark] cloud-fan commented on pull request #33468: [SPARK-36247][SQL] Check string length for char/varchar and apply type coercion in UPDATE/MERGE command
cloud-fan commented on pull request #33468: URL: https://github.com/apache/spark/pull/33468#issuecomment-887232141 thanks for the review, merging to master/3.2! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #33468: [SPARK-36247][SQL] Check string length for char/varchar and apply type coercion in UPDATE/MERGE command
cloud-fan closed pull request #33468: URL: https://github.com/apache/spark/pull/33468 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle
Ngone51 commented on a change in pull request #33034: URL: https://github.com/apache/spark/pull/33034#discussion_r677134315 ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java ## @@ -78,6 +80,27 @@ public static final String MERGE_DIR_KEY = "mergeDir"; public static final String ATTEMPT_ID_KEY = "attemptId"; private static final int UNDEFINED_ATTEMPT_ID = -1; + private static final int UNDEFINED_SHUFFLE_MERGE_ID = Integer.MIN_VALUE; + + // ConcurrentHashMap doesn't allow null for keys or values which is why this is required. Review comment: The comment is outdated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source
SparkQA commented on pull request #33413: URL: https://github.com/apache/spark/pull/33413#issuecomment-887225073 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46196/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source
AmplabJenkins commented on pull request #33413: URL: https://github.com/apache/spark/pull/33413#issuecomment-887225094 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46196/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle
AmplabJenkins removed a comment on pull request #33034: URL: https://github.com/apache/spark/pull/33034#issuecomment-887224594 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46197/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle
SparkQA commented on pull request #33034: URL: https://github.com/apache/spark/pull/33034#issuecomment-887224576 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46197/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle
AmplabJenkins commented on pull request #33034: URL: https://github.com/apache/spark/pull/33034#issuecomment-887224594 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46197/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33533: Revert "[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
SparkQA commented on pull request #33533: URL: https://github.com/apache/spark/pull/33533#issuecomment-887222147 **[Test build #141686 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141686/testReport)** for PR 33533 at commit [`a733179`](https://github.com/apache/spark/commit/a73317946a5fe08edfdd516f6b9a8f7e0ad4f079). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
viirya commented on pull request #33350: URL: https://github.com/apache/spark/pull/33350#issuecomment-887221527 Created #33533 to revert it first. @cloud-fan @sunchao @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya opened a new pull request #33533: Revert "[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
viirya opened a new pull request #33533: URL: https://github.com/apache/spark/pull/33533 This reverts commit 634f96dde40639df5a2ef246884bedbd48b3dc69. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
AmplabJenkins removed a comment on pull request #31517: URL: https://github.com/apache/spark/pull/31517#issuecomment-887220339 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141679/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
AmplabJenkins commented on pull request #31517: URL: https://github.com/apache/spark/pull/31517#issuecomment-887220339 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141679/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
SparkQA removed a comment on pull request #31517: URL: https://github.com/apache/spark/pull/31517#issuecomment-887180729 **[Test build #141679 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141679/testReport)** for PR 31517 at commit [`7b360a7`](https://github.com/apache/spark/commit/7b360a7d577ea379db3487d451b0c7a744d1dc02). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
SparkQA commented on pull request #31517: URL: https://github.com/apache/spark/pull/31517#issuecomment-887219770 **[Test build #141679 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141679/testReport)** for PR 31517 at commit [`7b360a7`](https://github.com/apache/spark/commit/7b360a7d577ea379db3487d451b0c7a744d1dc02). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
AmplabJenkins removed a comment on pull request #31517: URL: https://github.com/apache/spark/pull/31517#issuecomment-887218994 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46193/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
AmplabJenkins commented on pull request #31517: URL: https://github.com/apache/spark/pull/31517#issuecomment-887218994 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46193/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
SparkQA commented on pull request #31517: URL: https://github.com/apache/spark/pull/31517#issuecomment-887218975 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46193/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33532: [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark/SparkR/Docker GHA job
SparkQA commented on pull request #33532: URL: https://github.com/apache/spark/pull/33532#issuecomment-887218117 **[Test build #141684 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141684/testReport)** for PR 33532 at commit [`d83429e`](https://github.com/apache/spark/commit/d83429e2721f3c1b01131edf34bf839cc364eb67). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33531: [SPARK-36312][SQL] parquetWriterSupport.setSchema should check inner field
SparkQA commented on pull request #33531: URL: https://github.com/apache/spark/pull/33531#issuecomment-887218140 **[Test build #141685 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141685/testReport)** for PR 33531 at commit [`eed72e1`](https://github.com/apache/spark/commit/eed72e1160a5d3a061e9c977230149453cee56d4). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe
AmplabJenkins removed a comment on pull request #32332: URL: https://github.com/apache/spark/pull/32332#issuecomment-887217818 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141683/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source
AmplabJenkins removed a comment on pull request #33413: URL: https://github.com/apache/spark/pull/33413#issuecomment-887217814 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141681/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up
AmplabJenkins removed a comment on pull request #33526: URL: https://github.com/apache/spark/pull/33526#issuecomment-887217812 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141675/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33441: [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too
AmplabJenkins removed a comment on pull request #33441: URL: https://github.com/apache/spark/pull/33441#issuecomment-887217813 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46194/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe
AmplabJenkins commented on pull request #32332: URL: https://github.com/apache/spark/pull/32332#issuecomment-887217818 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141683/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up
AmplabJenkins commented on pull request #33526: URL: https://github.com/apache/spark/pull/33526#issuecomment-887217812 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141675/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source
AmplabJenkins commented on pull request #33413: URL: https://github.com/apache/spark/pull/33413#issuecomment-887217814 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141681/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33441: [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too
AmplabJenkins commented on pull request #33441: URL: https://github.com/apache/spark/pull/33441#issuecomment-887217813 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46194/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle
Ngone51 commented on a change in pull request #33034: URL: https://github.com/apache/spark/pull/33034#discussion_r677125870 ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java ## @@ -128,49 +151,100 @@ protected AppShuffleInfo validateAndGetAppShuffleInfo(String appId) { } /** - * Given the appShuffleInfo, shuffleId and reduceId that uniquely identifies a given shuffle - * partition of an application, retrieves the associated metadata. If not present and the - * corresponding merged shuffle does not exist, initializes the metadata. + * Given the appShuffleInfo, shuffleId, shuffleMergeId and reduceId that uniquely identifies + * a given shuffle partition of an application, retrieves the associated metadata. If not + * present and the corresponding merged shuffle does not exist, initializes the metadata. */ private AppShufflePartitionInfo getOrCreateAppShufflePartitionInfo( AppShuffleInfo appShuffleInfo, int shuffleId, - int reduceId) { -File dataFile = appShuffleInfo.getMergedShuffleDataFile(shuffleId, reduceId); -ConcurrentMap> partitions = + int shuffleMergeId, + int reduceId) throws RuntimeException { +ConcurrentMap>> partitions = appShuffleInfo.partitions; -Map shufflePartitions = - partitions.compute(shuffleId, (id, map) -> { -if (map == null) { - // If this partition is already finalized then the partitions map will not contain the - // shuffleId but the data file would exist. In that case the block is considered late. - if (dataFile.exists()) { -return null; - } - return new ConcurrentHashMap<>(); +AtomicReference>> shuffleMergePartitionsRef + = new AtomicReference<>(null); +partitions.compute(shuffleId, (id, shuffleMergePartitionsMap) -> { + if (shuffleMergePartitionsMap == null) { +logger.info("Creating a new attempt for shuffle blocks push request for" + + " shuffle {} with shuffleMergeId {} for application {}_{}", shuffleId, +shuffleMergeId, appShuffleInfo.appId, appShuffleInfo.attemptId); +Map> newShuffleMergePartitions + = new ConcurrentHashMap<>(); +Map newPartitionsMap = new ConcurrentHashMap<>(); +newShuffleMergePartitions.put(shuffleMergeId, newPartitionsMap); +shuffleMergePartitionsRef.set(newShuffleMergePartitions); +return newShuffleMergePartitions; + } else if (shuffleMergePartitionsMap.containsKey(shuffleMergeId)) { +shuffleMergePartitionsRef.set(shuffleMergePartitionsMap); +return shuffleMergePartitionsMap; + } else { +int latestShuffleMergeId = shuffleMergePartitionsMap.keySet().stream() + .mapToInt(x -> x).max().orElse(UNDEFINED_SHUFFLE_MERGE_ID); +int secondLatestShuffleMergeId = shuffleMergePartitionsMap.keySet().stream() + .mapToInt(x -> x).filter(x -> x != latestShuffleMergeId) + .max().orElse(UNDEFINED_SHUFFLE_MERGE_ID); +if (latestShuffleMergeId > shuffleMergeId) { + // Reject the request as we have already seen a higher shuffleMergeId than the + // current incoming one + throw new RuntimeException(String.format("Rejecting shuffle blocks push request for" ++ " shuffle %s with shuffleMergeId %s for application %s_%s as a higher" + + " shuffleMergeId %s request is already seen", shuffleId, shuffleMergeId, + appShuffleInfo.appId, appShuffleInfo.attemptId, latestShuffleMergeId)); } else { - return map; + // Higher shuffleMergeId seen for the shuffle ID meaning new stage attempt is being + // run for the shuffle ID. Close and clean up old shuffleMergeId files, + // happens in the non-deterministic stage retries + logger.info("Creating a new attempt for shuffle blocks push request for shuffle {} with" ++ " shuffleMergeId {} for application {}_{} since it is higher than the latest" + + " shuffleMergeId {} already seen", shuffleId, shuffleMergeId, appShuffleInfo.appId, + appShuffleInfo.attemptId, latestShuffleMergeId); + if (latestShuffleMergeId != UNDEFINED_SHUFFLE_MERGE_ID && + shuffleMergePartitionsMap.containsKey(latestShuffleMergeId)) { +Map latestShufflePartitions = + shuffleMergePartitionsMap.get(latestShuffleMergeId); +mergedShuffleCleaner.execute(() -> + closeAndDeletePartitionFiles(latestShufflePartitions)); +shuffleMergePartitionsMap.put(latestShuffleMergeId, STALE_SHUFFLE_PARTITIONS); + } + // Remove older shuffleMergeIds which won't be required anymore + if (secondLatestShuffleMergeId != UNDEFINED_SHUFFLE_MERGE_ID) { +
[GitHub] [spark] SparkQA removed a comment on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe
SparkQA removed a comment on pull request #32332: URL: https://github.com/apache/spark/pull/32332#issuecomment-887202286 **[Test build #141683 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141683/testReport)** for PR 32332 at commit [`8ddf243`](https://github.com/apache/spark/commit/8ddf24309623be15ae6e4a5a9ea52dcfe8069792). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33441: [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too
SparkQA commented on pull request #33441: URL: https://github.com/apache/spark/pull/33441#issuecomment-887214851 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46194/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33531: [SPARK-36312][SQL] parquetWriterSupport.setSchema should check inner field
SparkQA commented on pull request #33531: URL: https://github.com/apache/spark/pull/33531#issuecomment-887212972 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46195/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
viirya commented on pull request #33350: URL: https://github.com/apache/spark/pull/33350#issuecomment-887212407 Is it flaky or do other merged PRs cause the result different? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source
SparkQA commented on pull request #33413: URL: https://github.com/apache/spark/pull/33413#issuecomment-887211886 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46196/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe
SparkQA commented on pull request #32332: URL: https://github.com/apache/spark/pull/32332#issuecomment-887211807 **[Test build #141683 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141683/testReport)** for PR 32332 at commit [`8ddf243`](https://github.com/apache/spark/commit/8ddf24309623be15ae6e4a5a9ea52dcfe8069792). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
viirya commented on pull request #33350: URL: https://github.com/apache/spark/pull/33350#issuecomment-887211632 We can revert it if it needs taking some time to investigate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle
SparkQA commented on pull request #33034: URL: https://github.com/apache/spark/pull/33034#issuecomment-887211478 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46197/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33531: [SPARK-36312][SQL] parquetWriterSupport.setSchema should check inner field
AngersZh commented on a change in pull request #33531: URL: https://github.com/apache/spark/pull/33531#discussion_r677120085 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala ## @@ -3028,6 +3028,24 @@ class HiveDDLSuite } } + test("SPARK-36312: ParquetWriteSupport should check inner field") { +withView("v") { + spark.range(1).createTempView("v") + withTempPath { path => +val e = intercept[AnalysisException] { + spark.sql( +""" + | SELECT Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #33532: [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark GHA job
dongjoon-hyun commented on a change in pull request #33532: URL: https://github.com/apache/spark/pull/33532#discussion_r677119887 ## File path: .github/workflows/build_and_test.yml ## @@ -206,6 +206,7 @@ jobs: GITHUB_PREV_SHA: ${{ github.event.before }} SPARK_LOCAL_IP: localhost SKIP_UNIDOC: true + SKIP_MIMA: true Review comment: +1 for the idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext
mridulm commented on a change in pull request #33385: URL: https://github.com/apache/spark/pull/33385#discussion_r677117650 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala ## @@ -57,6 +57,7 @@ private[spark] class TaskDescription( val addedJars: Map[String, Long], val addedArchives: Map[String, Long], val properties: Properties, +val cpus: Int, Review comment: @xwu99 Please add a check to make sure we dont have invalid value for `cpus` coming in - `assert(cpus > 0)`. There should be some validation either here, or where it is getting passed in from. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext
mridulm commented on a change in pull request #33385: URL: https://github.com/apache/spark/pull/33385#discussion_r677116925 ## File path: core/src/main/scala/org/apache/spark/TaskContextImpl.scala ## @@ -54,6 +54,7 @@ private[spark] class TaskContextImpl( @transient private val metricsSystem: MetricsSystem, // The default value is only used in tests. override val taskMetrics: TaskMetrics = TaskMetrics.empty, +override val cpus: Int = 0, Review comment: @tgravescs For default value, CPUS_PER_TASK is simply "spark.task.cpus" with default 1 ... which should be available from `SparkEnv.conf` ? `SparkEnv.conf.get(config.CPUS_PER_TASK)` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext
mridulm commented on a change in pull request #33385: URL: https://github.com/apache/spark/pull/33385#discussion_r677116925 ## File path: core/src/main/scala/org/apache/spark/TaskContextImpl.scala ## @@ -54,6 +54,7 @@ private[spark] class TaskContextImpl( @transient private val metricsSystem: MetricsSystem, // The default value is only used in tests. override val taskMetrics: TaskMetrics = TaskMetrics.empty, +override val cpus: Int = 0, Review comment: @tgravescs CPUS_PER_TASK is simply "spark.task.cpus" with default 1 ... which should be available from `SparkEnv.conf` ? `SparkEnv.conf.get(config.CPUS_PER_TASK)` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #33531: [SPARK-36312][SQL] parquetWriterSupport.setSchema should check inner field
dongjoon-hyun commented on a change in pull request #33531: URL: https://github.com/apache/spark/pull/33531#discussion_r677119186 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala ## @@ -3028,6 +3028,24 @@ class HiveDDLSuite } } + test("SPARK-36312: ParquetWriteSupport should check inner field") { +withView("v") { + spark.range(1).createTempView("v") + withTempPath { path => +val e = intercept[AnalysisException] { + spark.sql( +""" + | SELECT Review comment: nit. Could you remove the redundant space here, @AngersZh ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33532: [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark GHA job
HyukjinKwon commented on a change in pull request #33532: URL: https://github.com/apache/spark/pull/33532#discussion_r677118693 ## File path: .github/workflows/build_and_test.yml ## @@ -206,6 +206,7 @@ jobs: GITHUB_PREV_SHA: ${{ github.event.before }} SPARK_LOCAL_IP: localhost SKIP_UNIDOC: true + SKIP_MIMA: true Review comment: We might have to skip here too: https://github.com/apache/spark/blob/9f20c62ff0c94a790bfd1fd925fd62fcf1ecc05a/.github/workflows/build_and_test.yml#L721 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle
Ngone51 commented on a change in pull request #33034: URL: https://github.com/apache/spark/pull/33034#discussion_r677118599 ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java ## @@ -128,49 +151,100 @@ protected AppShuffleInfo validateAndGetAppShuffleInfo(String appId) { } /** - * Given the appShuffleInfo, shuffleId and reduceId that uniquely identifies a given shuffle - * partition of an application, retrieves the associated metadata. If not present and the - * corresponding merged shuffle does not exist, initializes the metadata. + * Given the appShuffleInfo, shuffleId, shuffleMergeId and reduceId that uniquely identifies + * a given shuffle partition of an application, retrieves the associated metadata. If not + * present and the corresponding merged shuffle does not exist, initializes the metadata. */ private AppShufflePartitionInfo getOrCreateAppShufflePartitionInfo( AppShuffleInfo appShuffleInfo, int shuffleId, - int reduceId) { -File dataFile = appShuffleInfo.getMergedShuffleDataFile(shuffleId, reduceId); -ConcurrentMap> partitions = + int shuffleMergeId, + int reduceId) throws RuntimeException { +ConcurrentMap>> partitions = appShuffleInfo.partitions; -Map shufflePartitions = - partitions.compute(shuffleId, (id, map) -> { -if (map == null) { - // If this partition is already finalized then the partitions map will not contain the - // shuffleId but the data file would exist. In that case the block is considered late. - if (dataFile.exists()) { -return null; - } - return new ConcurrentHashMap<>(); +AtomicReference>> shuffleMergePartitionsRef + = new AtomicReference<>(null); +partitions.compute(shuffleId, (id, shuffleMergePartitionsMap) -> { + if (shuffleMergePartitionsMap == null) { +logger.info("Creating a new attempt for shuffle blocks push request for" + + " shuffle {} with shuffleMergeId {} for application {}_{}", shuffleId, +shuffleMergeId, appShuffleInfo.appId, appShuffleInfo.attemptId); +Map> newShuffleMergePartitions + = new ConcurrentHashMap<>(); +Map newPartitionsMap = new ConcurrentHashMap<>(); +newShuffleMergePartitions.put(shuffleMergeId, newPartitionsMap); +shuffleMergePartitionsRef.set(newShuffleMergePartitions); +return newShuffleMergePartitions; + } else if (shuffleMergePartitionsMap.containsKey(shuffleMergeId)) { +shuffleMergePartitionsRef.set(shuffleMergePartitionsMap); +return shuffleMergePartitionsMap; + } else { +int latestShuffleMergeId = shuffleMergePartitionsMap.keySet().stream() + .mapToInt(x -> x).max().orElse(UNDEFINED_SHUFFLE_MERGE_ID); +int secondLatestShuffleMergeId = shuffleMergePartitionsMap.keySet().stream() + .mapToInt(x -> x).filter(x -> x != latestShuffleMergeId) + .max().orElse(UNDEFINED_SHUFFLE_MERGE_ID); +if (latestShuffleMergeId > shuffleMergeId) { + // Reject the request as we have already seen a higher shuffleMergeId than the + // current incoming one + throw new RuntimeException(String.format("Rejecting shuffle blocks push request for" ++ " shuffle %s with shuffleMergeId %s for application %s_%s as a higher" + + " shuffleMergeId %s request is already seen", shuffleId, shuffleMergeId, + appShuffleInfo.appId, appShuffleInfo.attemptId, latestShuffleMergeId)); } else { - return map; + // Higher shuffleMergeId seen for the shuffle ID meaning new stage attempt is being + // run for the shuffle ID. Close and clean up old shuffleMergeId files, + // happens in the non-deterministic stage retries + logger.info("Creating a new attempt for shuffle blocks push request for shuffle {} with" ++ " shuffleMergeId {} for application {}_{} since it is higher than the latest" + + " shuffleMergeId {} already seen", shuffleId, shuffleMergeId, appShuffleInfo.appId, + appShuffleInfo.attemptId, latestShuffleMergeId); + if (latestShuffleMergeId != UNDEFINED_SHUFFLE_MERGE_ID && + shuffleMergePartitionsMap.containsKey(latestShuffleMergeId)) { +Map latestShufflePartitions = + shuffleMergePartitionsMap.get(latestShuffleMergeId); +mergedShuffleCleaner.execute(() -> + closeAndDeletePartitionFiles(latestShufflePartitions)); +shuffleMergePartitionsMap.put(latestShuffleMergeId, STALE_SHUFFLE_PARTITIONS); + } + // Remove older shuffleMergeIds which won't be required anymore + if (secondLatestShuffleMergeId != UNDEFINED_SHUFFLE_MERGE_ID) { +
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33532: [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark GHA job
HyukjinKwon commented on a change in pull request #33532: URL: https://github.com/apache/spark/pull/33532#discussion_r677118544 ## File path: .github/workflows/build_and_test.yml ## @@ -206,6 +206,7 @@ jobs: GITHUB_PREV_SHA: ${{ github.event.before }} SPARK_LOCAL_IP: localhost SKIP_UNIDOC: true + SKIP_MIMA: true Review comment: Can we skip this in SparkR build too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext
mridulm commented on a change in pull request #33385: URL: https://github.com/apache/spark/pull/33385#discussion_r677117650 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala ## @@ -57,6 +57,7 @@ private[spark] class TaskDescription( val addedJars: Map[String, Long], val addedArchives: Map[String, Long], val properties: Properties, +val cpus: Int, Review comment: Please add a check to make sure we dont have invalid value for `cpus` coming in - `assert(cpus > 0)`. There should be some validation either here, or where it is getting passed in from. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext
mridulm commented on a change in pull request #33385: URL: https://github.com/apache/spark/pull/33385#discussion_r677117650 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala ## @@ -57,6 +57,7 @@ private[spark] class TaskDescription( val addedJars: Map[String, Long], val addedArchives: Map[String, Long], val properties: Properties, +val cpus: Int, Review comment: Please add a check to make sure we dont have invalid value for `cpus` coming in - `assert(cpus > 0)`. Validation either here, or where it is getting set from. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext
mridulm commented on a change in pull request #33385: URL: https://github.com/apache/spark/pull/33385#discussion_r677117650 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala ## @@ -57,6 +57,7 @@ private[spark] class TaskDescription( val addedJars: Map[String, Long], val addedArchives: Map[String, Long], val properties: Properties, +val cpus: Int, Review comment: Please add a check to make sure we dont have invalid value for `cpus` coming in - `assert(cpus > 0)`. Some form of validation either here, or where it is getting set from. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext
HyukjinKwon commented on pull request #33385: URL: https://github.com/apache/spark/pull/33385#issuecomment-887208391 Ohhh got it. Thanks for clarification! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext
mridulm commented on a change in pull request #33385: URL: https://github.com/apache/spark/pull/33385#discussion_r677117650 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala ## @@ -57,6 +57,7 @@ private[spark] class TaskDescription( val addedJars: Map[String, Long], val addedArchives: Map[String, Long], val properties: Properties, +val cpus: Int, Review comment: Please add a check to make sure we dont have invalid value for `cpus` coming in - `assert(cpus > 0)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext
mridulm commented on a change in pull request #33385: URL: https://github.com/apache/spark/pull/33385#discussion_r677116925 ## File path: core/src/main/scala/org/apache/spark/TaskContextImpl.scala ## @@ -54,6 +54,7 @@ private[spark] class TaskContextImpl( @transient private val metricsSystem: MetricsSystem, // The default value is only used in tests. override val taskMetrics: TaskMetrics = TaskMetrics.empty, +override val cpus: Int = 0, Review comment: CPUS_PER_TASK is simply "spark.task.cpus" with default 1 ... which should be available from `SparkEnv.conf` ? `SparkEnv.conf.get(config.CPUS_PER_TASK)` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] williamhyun commented on pull request #33532: [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark GHA job
williamhyun commented on pull request #33532: URL: https://github.com/apache/spark/pull/33532#issuecomment-887207431 cc: @HyukjinKwon , @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] williamhyun opened a new pull request #33532: [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark GHA job
williamhyun opened a new pull request #33532: URL: https://github.com/apache/spark/pull/33532 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source
SparkQA removed a comment on pull request #33413: URL: https://github.com/apache/spark/pull/33413#issuecomment-887197252 **[Test build #141681 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141681/testReport)** for PR 33413 at commit [`fdecefc`](https://github.com/apache/spark/commit/fdecefc1c4db1803d2e9683fae3b4d2172904563). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source
SparkQA commented on pull request #33413: URL: https://github.com/apache/spark/pull/33413#issuecomment-887206619 **[Test build #141681 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141681/testReport)** for PR 33413 at commit [`fdecefc`](https://github.com/apache/spark/commit/fdecefc1c4db1803d2e9683fae3b4d2172904563). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a change in pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext
mridulm commented on a change in pull request #33385: URL: https://github.com/apache/spark/pull/33385#discussion_r677115269 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ## @@ -429,6 +429,7 @@ private[spark] class TaskSetManager( execId: String, host: String, maxLocality: TaskLocality.TaskLocality, + taskCpuAssignments: Int = 1, Review comment: If we can require explicitly passing it, that would be ideal, I agree. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #33385: [SPARK-36173][CORE] Support getting CPU number in TaskContext
mridulm commented on pull request #33385: URL: https://github.com/apache/spark/pull/33385#issuecomment-887205429 @HyukjinKwon cpu support in resource profile allows users to override the default 1 core per task at a stage level - and so makes sense to expose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up
SparkQA removed a comment on pull request #33526: URL: https://github.com/apache/spark/pull/33526#issuecomment-887156766 **[Test build #141675 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141675/testReport)** for PR 33526 at commit [`6e73fb3`](https://github.com/apache/spark/commit/6e73fb32a533c384f41bc13014bfdf511b3d07dc). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up
SparkQA commented on pull request #33526: URL: https://github.com/apache/spark/pull/33526#issuecomment-887204625 **[Test build #141675 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141675/testReport)** for PR 33526 at commit [`6e73fb3`](https://github.com/apache/spark/commit/6e73fb32a533c384f41bc13014bfdf511b3d07dc). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle
Ngone51 commented on a change in pull request #33034: URL: https://github.com/apache/spark/pull/33034#discussion_r677061512 ## File path: core/src/main/scala/org/apache/spark/Dependency.scala ## @@ -150,6 +155,22 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( } } + def newShuffleMergeState(): Unit = { +_shuffleMergeEnabled = canShuffleMergeBeEnabled() Review comment: Isn't this a constant value? Why do we need to reset it every time? ## File path: core/src/main/scala/org/apache/spark/Dependency.scala ## @@ -124,6 +122,13 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( */ private[this] var _shuffleMergedFinalized: Boolean = false + /** + * shuffleMergeId is used to uniquely identify a indeterminate stage attempt of a shuffle Id. Review comment: How about: ```suggestion * shuffleMergeId is used to uniquely identify a merging process of an indeterminate stage attempt. ``` ## File path: common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java ## @@ -206,12 +206,15 @@ public long sendRpc(ByteBuffer message, RpcResponseCallback callback) { * * @param appId applicationId. * @param shuffleId shuffle id. + * @param shuffleMergeId shuffleMergeId is used to uniquely identify a indeterminate stage + * attempt of a shuffle Id. Review comment: nit: "... identify a merging process of an indeterminate stage attempt." ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockStoreClient.java ## @@ -185,6 +188,8 @@ public void finalizeShuffleMerge( * @param host the host of the remote node. * @param port the port of the remote node. * @param shuffleId shuffle id. + * @param shuffleMergeId shuffleMergeId is used to uniquely identify a indeterminate stage + * attempt of a shuffle Id. Review comment: ditto. ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java ## @@ -125,63 +125,87 @@ private AbstractFetchShuffleBlocks createFetchShuffleBlocksOrChunksMsg( String execId, String[] blockIds) { if (blockIds[0].startsWith(SHUFFLE_CHUNK_PREFIX)) { - return createFetchShuffleMsgAndBuildBlockIds(appId, execId, blockIds, true); + return createFetchShuffleChunksMsg(appId, execId, blockIds); } else { - return createFetchShuffleMsgAndBuildBlockIds(appId, execId, blockIds, false); + return createFetchShuffleBlocksMsg(appId, execId, blockIds); } } - /** - * Create FetchShuffleBlocks/FetchShuffleBlockChunks message and rebuild internal blockIds by - * analyzing the passed in blockIds. - */ - private AbstractFetchShuffleBlocks createFetchShuffleMsgAndBuildBlockIds( + private AbstractFetchShuffleBlocks createFetchShuffleBlocksMsg( String appId, String execId, - String[] blockIds, - boolean areMergedChunks) { + String[] blockIds) { String[] firstBlock = splitBlockId(blockIds[0]); int shuffleId = Integer.parseInt(firstBlock[1]); boolean batchFetchEnabled = firstBlock.length == 5; - -// In case of FetchShuffleBlocks, primaryId is mapId. For FetchShuffleBlockChunks, primaryId -// is reduceId. -LinkedHashMap primaryIdToBlocksInfo = new LinkedHashMap<>(); +Map mapIdToBlocksInfo = new LinkedHashMap<>(); for (String blockId : blockIds) { String[] blockIdParts = splitBlockId(blockId); if (Integer.parseInt(blockIdParts[1]) != shuffleId) { -throw new IllegalArgumentException("Expected shuffleId=" + shuffleId + - ", got:" + blockId); - } - Number primaryId; - if (!areMergedChunks) { -primaryId = Long.parseLong(blockIdParts[2]); - } else { -primaryId = Integer.parseInt(blockIdParts[2]); +throw new IllegalArgumentException("Expected shuffleId=" + shuffleId + ", got:" + blockId); } - BlocksInfo blocksInfoByPrimaryId = primaryIdToBlocksInfo.computeIfAbsent(primaryId, -id -> new BlocksInfo()); - blocksInfoByPrimaryId.blockIds.add(blockId); - // If blockId is a regular shuffle block, then blockIdParts[3] = reduceId. If blockId is a - // shuffleChunk block, then blockIdParts[3] = chunkId - blocksInfoByPrimaryId.ids.add(Integer.parseInt(blockIdParts[3])); + + long mapId = Long.parseLong(blockIdParts[2]); + BlocksInfo blocksInfoByMapId = mapIdToBlocksInfo.computeIfAbsent(mapId, + id -> new BlocksInfo()); + blocksInfoByMapId.blockIds.add(blockId); + blocksInfoByMapId.ids.add(Integer.parseInt(blockIdParts[3])); + if (batchFetchEnabled) { -// It comes here only if the blockId is a regular shuffle block not a shuffleChunk block. // When we read continuous shuffle
[GitHub] [spark] sunchao commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
sunchao commented on pull request #33350: URL: https://github.com/apache/spark/pull/33350#issuecomment-887204038 Sorry. Let me check the failed tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #33308: [SPARK-35918][AVRO] Unify schema mismatch handling for read/write and enhance error messages
gengliangwang commented on a change in pull request #33308: URL: https://github.com/apache/spark/pull/33308#discussion_r677112782 ## File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSerdeSuite.scala ## @@ -86,18 +86,18 @@ class AvroSerdeSuite extends SparkFunSuite { // deserialize should have no issues when 'bar' is nullable but fail when it is nonnull Deserializer.create(CATALYST_STRUCT, avro, BY_NAME) assertFailedConversionMessage(avro, Deserializer, BY_NAME, - "Cannot find non-nullable field 'foo.bar' (at position 0) in Avro schema.", + "Cannot find field 'foo.bar' in Avro schema", nonnullCatalyst) assertFailedConversionMessage(avro, Deserializer, BY_POSITION, - "Cannot find non-nullable field at position 1 (field 'foo.baz') in Avro schema.", + "Cannot find field at position 1 of field 'foo' from Avro schema", Review comment: Shall we mention by-position match? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
cloud-fan commented on pull request #33350: URL: https://github.com/apache/spark/pull/33350#issuecomment-887203356 hmm, they are also failing in master: https://github.com/apache/spark/pull/33498 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #33308: [SPARK-35918][AVRO] Unify schema mismatch handling for read/write and enhance error messages
gengliangwang commented on a change in pull request #33308: URL: https://github.com/apache/spark/pull/33308#discussion_r677112492 ## File path: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSerdeSuite.scala ## @@ -86,18 +86,18 @@ class AvroSerdeSuite extends SparkFunSuite { // deserialize should have no issues when 'bar' is nullable but fail when it is nonnull Deserializer.create(CATALYST_STRUCT, avro, BY_NAME) assertFailedConversionMessage(avro, Deserializer, BY_NAME, - "Cannot find non-nullable field 'foo.bar' (at position 0) in Avro schema.", + "Cannot find field 'foo.bar' in Avro schema", Review comment: Shall we mention "by-name" match here? ``` Cannot find field 'foo.bar' in Avro schema with by-name match ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe
SparkQA commented on pull request #32332: URL: https://github.com/apache/spark/pull/32332#issuecomment-887202286 **[Test build #141683 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141683/testReport)** for PR 32332 at commit [`8ddf243`](https://github.com/apache/spark/commit/8ddf24309623be15ae6e4a5a9ea52dcfe8069792). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] darcy-shen commented on a change in pull request #32332: [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe
darcy-shen commented on a change in pull request #32332: URL: https://github.com/apache/spark/pull/32332#discussion_r677110817 ## File path: python/pyspark/sql/session.py ## @@ -697,12 +712,19 @@ def prepare(obj): verify_func(obj) return obj, else: +no_need_to_prepare = True prepare = lambda obj: obj -if isinstance(data, RDD): -rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) +if isinstance(data, RDD) and no_need_to_prepare: +rdd, schema = self._createFromRDD( +data, schema, verifySchema, samplingRatio) +elif isinstance(data, RDD) and (not no_need_to_prepare): +rdd, schema = self._createFromRDD( +data.map(prepare), schema, verifySchema, samplingRatio) Review comment: Refactored, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up
SparkQA removed a comment on pull request #33526: URL: https://github.com/apache/spark/pull/33526#issuecomment-887139023 **[Test build #141674 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141674/testReport)** for PR 33526 at commit [`21ef1a1`](https://github.com/apache/spark/commit/21ef1a1c9e39bcb04cb4f180c80536edbf8f50be). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up
AmplabJenkins removed a comment on pull request #33526: URL: https://github.com/apache/spark/pull/33526#issuecomment-887200904 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141674/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up
AmplabJenkins commented on pull request #33526: URL: https://github.com/apache/spark/pull/33526#issuecomment-887200904 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141674/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up
SparkQA commented on pull request #33526: URL: https://github.com/apache/spark/pull/33526#issuecomment-887200656 **[Test build #141674 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141674/testReport)** for PR 33526 at commit [`21ef1a1`](https://github.com/apache/spark/commit/21ef1a1c9e39bcb04cb4f180c80536edbf8f50be). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
cloud-fan commented on pull request #33350: URL: https://github.com/apache/spark/pull/33350#issuecomment-887200130 seems they are failing in 3.2 branch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
SparkQA commented on pull request #31517: URL: https://github.com/apache/spark/pull/31517#issuecomment-887199134 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46193/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle
SparkQA commented on pull request #33034: URL: https://github.com/apache/spark/pull/33034#issuecomment-887197367 **[Test build #141682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141682/testReport)** for PR 33034 at commit [`11af40d`](https://github.com/apache/spark/commit/11af40d9ea3550aa1014f155dbc8cdc201b06145). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33413: [SPARK-36175][SQL] Support TimestampNTZ in Avro data source
SparkQA commented on pull request #33413: URL: https://github.com/apache/spark/pull/33413#issuecomment-887197252 **[Test build #141681 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141681/testReport)** for PR 33413 at commit [`fdecefc`](https://github.com/apache/spark/commit/fdecefc1c4db1803d2e9683fae3b4d2172904563). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33531: [SPARK-36312][SQL] parquetWriterSupport.setSchema should check inner field
SparkQA commented on pull request #33531: URL: https://github.com/apache/spark/pull/33531#issuecomment-887197148 **[Test build #141680 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141680/testReport)** for PR 33531 at commit [`d79bdbf`](https://github.com/apache/spark/commit/d79bdbfc4e666be736f0c2080c39848db77634c5). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up
AmplabJenkins removed a comment on pull request #33526: URL: https://github.com/apache/spark/pull/33526#issuecomment-887196180 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46191/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up
AmplabJenkins commented on pull request #33526: URL: https://github.com/apache/spark/pull/33526#issuecomment-887196180 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46191/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33441: [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too
SparkQA commented on pull request #33441: URL: https://github.com/apache/spark/pull/33441#issuecomment-887195858 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46194/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #33531: [SPARK-36312][SQL] parquetWriterSupport.setSchema should check inner field
AngersZh commented on pull request #33531: URL: https://github.com/apache/spark/pull/33531#issuecomment-887191536 ping @cloud-fan @dongjoon-hyun @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu opened a new pull request #33531: [SPARK-36312][SQL] parquetWriterSupport.setSchema should check inner field
AngersZh opened a new pull request #33531: URL: https://github.com/apache/spark/pull/33531 ### What changes were proposed in this pull request? Last pr only support add inner field check for hive ddl, this pr add check for parquet data source write API. ### Why are the changes needed? Failed earlier ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Ut Without this UI it failed as ``` [info] - SPARK-36312: ParquetWriteSupport should check inner field *** FAILED *** (8 seconds, 29 milliseconds) [info] Expected exception org.apache.spark.sql.AnalysisException to be thrown, but org.apache.spark.SparkException was thrown (HiveDDLSuite.scala:3035) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) [info] at org.scalatest.Assertions.intercept(Assertions.scala:756) [info] at org.scalatest.Assertions.intercept$(Assertions.scala:746) [info] at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$396(HiveDDLSuite.scala:3035) [info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$396$adapted(HiveDDLSuite.scala:3034) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66) [info] at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:34) [info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$395(HiveDDLSuite.scala:3034) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withView(SQLTestUtils.scala:316) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withView$(SQLTestUtils.scala:314) [info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.withView(HiveDDLSuite.scala:396) [info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$394(HiveDDLSuite.scala:3032) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) [info] at org.scalatest.Suite.run(Suite.scala:1112) [info] at org.scalatest.Suite.run$(Suite.scala:1094) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) [info] at
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33441: [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too
AngersZh commented on a change in pull request #33441: URL: https://github.com/apache/spark/pull/33441#discussion_r677094314 ## File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala ## @@ -153,6 +154,27 @@ private[sql] class AvroFileFormat extends FileFormat } override def supportDataType(dataType: DataType): Boolean = AvroUtils.supportsDataType(dataType) + + override def supportFieldName(name: String): Unit = { Review comment: How about current? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33441: [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too
AngersZh commented on a change in pull request #33441: URL: https://github.com/apache/spark/pull/33441#discussion_r677094233 ## File path: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala ## @@ -153,6 +154,27 @@ private[sql] class AvroFileFormat extends FileFormat } override def supportDataType(dataType: DataType): Boolean = AvroUtils.supportsDataType(dataType) + + override def supportFieldName(name: String): Unit = { +val length = name.length +if (length == 0) { + throw QueryCompilationErrors.columnNameContainsInvalidCharactersError(name) +} else { + val first = name.charAt(0) + if (!Character.isLetter(first) && first != '_') { Review comment: > do you have some reference doc to prove this is indeed a limitation of avro? or can you run some local tests like `df.write.format("avro").save(path)`? For `df.write.format(source).save(path)` Avro error message ``` [info] org.apache.avro.SchemaParseException: Illegal initial character: (IF((ID = 1), 1, 0)) [info] at org.apache.avro.Schema.validateName(Schema.java:1562) [info] at org.apache.avro.Schema.access$400(Schema.java:91) [info] at org.apache.avro.Schema$Field.(Schema.java:546) [info] at org.apache.avro.SchemaBuilder$FieldBuilder.completeField(SchemaBuilder.java:2240) [info] at org.apache.avro.SchemaBuilder$FieldBuilder.completeField(SchemaBuilder.java:2236) [info] at org.apache.avro.SchemaBuilder$FieldBuilder.access$5100(SchemaBuilder.java:2150) [info] at org.apache.avro.SchemaBuilder$GenericDefault.noDefault(SchemaBuilder.java:2539) [info] at org.apache.spark.sql.avro.SchemaConverters$.$anonfun$toAvroType$1(SchemaConverters.scala:194) [info] at scala.collection.Iterator.foreach(Iterator.scala:943) [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) [info] at scala.collection.IterableLike.foreach(IterableLike.scala:74) [info] at scala.collection.IterableLike.foreach$(IterableLike.scala:73) [info] at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) [info] at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:191) [info] at org.apache.spark.sql.avro.AvroUtils$.$anonfun$prepareWrite$1(AvroUtils.scala:98) [info] at scala.Option.getOrElse(Option.scala:189) [info] at org.apache.spark.sql.avro.AvroUtils$.prepareWrite(AvroUtils.scala:97) [info] at org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:74) [info] at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:142) [info] at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) [info] at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) [info] at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) [info] at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) [info] at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97) [info] at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) [info] at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) [info] at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) [info] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) ``` parquet error message ``` [info] - SPARK-33865: Hive DDL with avro should check col name *** FAILED *** (4 seconds, 554 milliseconds) [info] org.apache.spark.sql.AnalysisException: Column name "(IF((ID = 1), 1, 0))" contains invalid character(s). Please use alias to rename it. [info] at org.apache.spark.sql.errors.QueryCompilationErrors$.columnNameContainsInvalidCharactersError(QueryCompilationErrors.scala:2105) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:590) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.$anonfun$setSchema$2(ParquetWriteSupport.scala:485) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.$anonfun$setSchema$2$adapted(ParquetWriteSupport.scala:485) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:485) [info] at
[GitHub] [spark] beliefer commented on a change in pull request #33317: [SPARK-36095][CORE] Grouping exception in core/rdd
beliefer commented on a change in pull request #33317: URL: https://github.com/apache/spark/pull/33317#discussion_r677093918 ## File path: core/src/main/scala/org/apache/spark/errors/SparkCoreErrors.scala ## @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.errors + +import java.io.IOException + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.storage.{BlockId, RDDBlockId} + +/** + * Object for grouping error messages from (most) exceptions thrown during query execution. + */ +private[spark] object SparkCoreErrors { + def rddBlockNotFoundError(blockId: BlockId, id: Int): Throwable = { +new Exception(s"Could not compute split, block $blockId of RDD $id not found") + } + + def blockHaveBeenRemovedError(string: String): Throwable = { +new SparkException(s"Attempted to use $string after its blocks have been removed!") + } + + def histogramOnEmptyRDDOrContainingInfinityOrNaNError(): Throwable = { +new UnsupportedOperationException( + "Histogram on either an empty RDD or RDD containing +/-infinity or NaN") + } + + def emptyRDDError(): Throwable = { +new UnsupportedOperationException("empty RDD") + } + + def pathNotSupportedError(path: String): Throwable = { +new IOException(s"Path: ${path} is a directory, which is not supported by the " + + s"record reader when `mapreduce.input.fileinputformat.input.dir.recursive` is false.") + } + + def checkpointRDDBlockIdNotFoundError(rddBlockId: RDDBlockId): Throwable = { +new SparkException(s"Checkpoint block $rddBlockId not found! Either the executor " + + s"that originally checkpointed this partition is no longer alive, or the original RDD is " + + s"unpersisted. If this problem persists, you may consider using `rdd.checkpoint()` " + + s"instead, which is slower than local checkpointing but more fault-tolerant.") + } + + def endOfStreamError(): Throwable = { +new java.util.NoSuchElementException("End of stream") + } + + def cannotUseMapSideCombiningWithArrayKeyError(): Throwable = { +new SparkException("Cannot use map-side combining with array keys.") + } + + def hashPartitionerCannotPartitionArrayKeyError(): Throwable = { +new SparkException("HashPartitioner cannot partition array keys.") + } + + def reduceByKeyLocallyNotSupportArrayKeysError(): Throwable = { +new SparkException("reduceByKeyLocally() does not support array keys") + } + + def noSuchElementException(): Throwable = { +new NoSuchElementException() + } + + def rddLacksSparkContextError(): Throwable = { +new SparkException("This RDD lacks a SparkContext. It could happen in the following cases: " + + "\n(1) RDD transformations and actions are NOT invoked by the driver, but inside of other " + + "transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid " + + "because the values transformation and count action cannot be performed inside of the " + + "rdd1.map transformation. For more information, see SPARK-5063.\n(2) When a Spark " + + "Streaming job recovers from checkpoint, this exception will be hit if a reference to " + + "an RDD not defined by the streaming job is used in DStream operations. For more " + + "information, See SPARK-13758.") Review comment: > can we split paragraph like """ ... """.stripMargin.replaceAll("\n", " ") + "\n" + """ ... """.stripMargin.replaceAll("\n", " ") Seems good! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33526: [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up
SparkQA commented on pull request #33526: URL: https://github.com/apache/spark/pull/33526#issuecomment-887182859 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46191/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #33317: [SPARK-36095][CORE] Grouping exception in core/rdd
beliefer commented on a change in pull request #33317: URL: https://github.com/apache/spark/pull/33317#discussion_r677092283 ## File path: core/src/main/scala/org/apache/spark/errors/SparkCoreErrors.scala ## @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.errors + +import java.io.IOException + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.storage.{BlockId, RDDBlockId} + +/** + * Object for grouping error messages from (most) exceptions thrown during query execution. + */ +private[spark] object SparkCoreErrors { + def rddBlockNotFoundError(blockId: BlockId, id: Int): Throwable = { +new Exception(s"Could not compute split, block $blockId of RDD $id not found") + } + + def blockHaveBeenRemovedError(string: String): Throwable = { +new SparkException(s"Attempted to use $string after its blocks have been removed!") + } + + def histogramOnEmptyRDDOrContainingInfinityOrNaNError(): Throwable = { +new UnsupportedOperationException( + "Histogram on either an empty RDD or RDD containing +/-infinity or NaN") + } + + def emptyRDDError(): Throwable = { +new UnsupportedOperationException("empty RDD") + } + + def pathNotSupportedError(path: String): Throwable = { +new IOException(s"Path: ${path} is a directory, which is not supported by the " + + "record reader when `mapreduce.input.fileinputformat.input.dir.recursive` is false.") + } + + def checkpointRDDBlockIdNotFoundError(rddBlockId: RDDBlockId): Throwable = { +new SparkException( + s""" + |Checkpoint block $rddBlockId not found! Either the executor + |that originally checkpointed this partition is no longer alive, or the original RDD is + |unpersisted. If this problem persists, you may consider using `rdd.checkpoint()` + |instead, which is slower than local checkpointing but more fault-tolerant. + """.stripMargin.replaceAll("\n", " ")) + } + + def endOfStreamError(): Throwable = { +new java.util.NoSuchElementException("End of stream") + } + + def cannotUseMapSideCombiningWithArrayKeyError(): Throwable = { +new SparkException("Cannot use map-side combining with array keys.") + } + + def hashPartitionerCannotPartitionArrayKeyError(): Throwable = { +new SparkException("HashPartitioner cannot partition array keys.") + } + + def reduceByKeyLocallyNotSupportArrayKeysError(): Throwable = { +new SparkException("reduceByKeyLocally() does not support array keys") + } + + def noSuchElementException(): Throwable = { +new NoSuchElementException() + } + + def rddLacksSparkContextError(): Throwable = { +new SparkException("This RDD lacks a SparkContext. It could happen in the following cases: " + Review comment: Yeah. I got it. I make a mistake. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
SparkQA commented on pull request #31517: URL: https://github.com/apache/spark/pull/31517#issuecomment-887180729 **[Test build #141679 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141679/testReport)** for PR 31517 at commit [`7b360a7`](https://github.com/apache/spark/commit/7b360a7d577ea379db3487d451b0c7a744d1dc02). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33441: [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too
SparkQA commented on pull request #33441: URL: https://github.com/apache/spark/pull/33441#issuecomment-887180308 **[Test build #141678 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141678/testReport)** for PR 33441 at commit [`5b99ab6`](https://github.com/apache/spark/commit/5b99ab6df314578bf3ba923e89f24c2765cb089b). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #33519: [SPARK-36288][DOCS][PYTHON] Update API usage on pyspark pandas documents
HyukjinKwon closed pull request #33519: URL: https://github.com/apache/spark/pull/33519 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33519: [SPARK-36288][DOCS][PYTHON] Update API usage on pyspark pandas documents
HyukjinKwon commented on pull request #33519: URL: https://github.com/apache/spark/pull/33519#issuecomment-887179955 Merged to master and branch-3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
LuciferYang commented on a change in pull request #31517: URL: https://github.com/apache/spark/pull/31517#discussion_r677090585 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala ## @@ -136,11 +136,14 @@ private[spark] object BlockManagerId { * The max cache size is hardcoded to 1, since the size of a BlockManagerId * object is about 48B, the total memory cost should be below 1MB which is feasible. */ - val blockManagerIdCache = CacheBuilder.newBuilder() -.maximumSize(1) -.build(new CacheLoader[BlockManagerId, BlockManagerId]() { - override def load(id: BlockManagerId) = id -}) + val blockManagerIdCache = { Review comment: 7b360a7 revert this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
LuciferYang commented on a change in pull request #31517: URL: https://github.com/apache/spark/pull/31517#discussion_r677090543 ## File path: core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala ## @@ -84,16 +84,18 @@ private[spark] class ReliableCheckpointRDD[T: ClassTag]( } // Cache of preferred locations of checkpointed files. - @transient private[spark] lazy val cachedPreferredLocations = CacheBuilder.newBuilder() -.expireAfterWrite( - SparkEnv.get.conf.get(CACHE_CHECKPOINT_PREFERRED_LOCS_EXPIRE_TIME).get, - TimeUnit.MINUTES) -.build( - new CacheLoader[Partition, Seq[String]]() { -override def load(split: Partition): Seq[String] = { - getPartitionBlockLocations(split) -} - }) + @transient private[spark] lazy val cachedPreferredLocations = { +val builder = Caffeine.newBuilder() + .expireAfterWrite( +SparkEnv.get.conf.get(CACHE_CHECKPOINT_PREFERRED_LOCS_EXPIRE_TIME).get, +TimeUnit.MINUTES) +val loader = new CacheLoader[Partition, Seq[String]]() { + override def load(split: Partition): Seq[String] = { +getPartitionBlockLocations(split) + } +} +builder.build[Partition, Seq[String]](loader) Review comment: 7b360a7 revert this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] venkata91 commented on a change in pull request #33034: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle
venkata91 commented on a change in pull request #33034: URL: https://github.com/apache/spark/pull/33034#discussion_r677090442 ## File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java ## @@ -135,51 +150,87 @@ protected AppShuffleInfo validateAndGetAppShuffleInfo(String appId) { private AppShufflePartitionInfo getOrCreateAppShufflePartitionInfo( AppShuffleInfo appShuffleInfo, int shuffleId, + int shuffleMergeId, int reduceId) { -File dataFile = appShuffleInfo.getMergedShuffleDataFile(shuffleId, reduceId); -ConcurrentMap> partitions = +ConcurrentMap>> partitions = appShuffleInfo.partitions; -Map shufflePartitions = - partitions.compute(shuffleId, (id, map) -> { -if (map == null) { - // If this partition is already finalized then the partitions map will not contain the - // shuffleId but the data file would exist. In that case the block is considered late. - if (dataFile.exists()) { +Map> shuffleMergePartitions = + partitions.compute(shuffleId, (id, shuffleMergePartitionsMap) -> { +if (shuffleMergePartitionsMap == null) { + logger.info("Creating a new attempt for shuffle blocks push request for" + + " shuffle {} with shuffleMergeId {} for application {}_{}", shuffleId, + shuffleMergeId, appShuffleInfo.appId, appShuffleInfo.attemptId)); + Map> newShuffleMergePartitions += new ConcurrentHashMap<>(); + Map newPartitionsMap = new ConcurrentHashMap<>(); + newShuffleMergePartitions.put(shuffleMergeId, newPartitionsMap); + return newShuffleMergePartitions; +} else if (shuffleMergePartitionsMap.containsKey(shuffleMergeId)) { + return shuffleMergePartitionsMap; +} else { + int latestShuffleMergeId = shuffleMergePartitionsMap.keySet().stream() +.mapToInt(v -> v).max().orElse(UNDEFINED_SHUFFLE_MERGE_ID); + if (latestShuffleMergeId > shuffleMergeId) { +logger.info("Rejecting shuffle blocks push request for shuffle {} with" ++ " shuffleMergeId {} for application {}_{} as a higher shuffleMergeId" ++ " {} request is already seen", shuffleId, shuffleMergeId, +appShuffleInfo.appId, appShuffleInfo.attemptId, latestShuffleMergeId)); +// Reject the request as we have already seen a higher shuffleMergeId than the +// current incoming one return null; + } else { +// Higher shuffleMergeId seen for the shuffle ID meaning new stage attempt is being +// run for the shuffle ID. Close and clean up old shuffleMergeId files, +// happens in the non-deterministic stage retries +logger.info("Creating a new attempt for shuffle blocks push request for" + + " shuffle {} with shuffleMergeId {} for application {}_{} since it is" + + " higher than the latest shuffleMergeId {} already seen", shuffleId, + shuffleMergeId, appShuffleInfo.appId, appShuffleInfo.attemptId, + latestShuffleMergeId)); +if (null != shuffleMergePartitionsMap.get(latestShuffleMergeId)) { + Map shufflePartitions = +shuffleMergePartitionsMap.get(latestShuffleMergeId); + mergedShuffleCleaner.execute(() -> +closeAndDeletePartitionFiles(shufflePartitions)); +} +shuffleMergePartitionsMap.put(latestShuffleMergeId, STALE_SHUFFLE_PARTITIONS); +Map newPartitionsMap = new ConcurrentHashMap<>(); +shuffleMergePartitionsMap.put(shuffleMergeId, newPartitionsMap); +return shuffleMergePartitionsMap; } - return new ConcurrentHashMap<>(); -} else { - return map; } }); -if (shufflePartitions == null) { + +Map shufflePartitions = shuffleMergePartitions.get(shuffleMergeId); +if (shufflePartitions == FINALIZED_SHUFFLE_PARTITIONS +|| shufflePartitions == STALE_SHUFFLE_PARTITIONS) { + // It only gets here when shufflePartitions is either FINALIZED_SHUFFLE_PARTITIONS or STALE_SHUFFLE_PARTITIONS. + // This happens in 2 cases: + // 1. Incoming block request is for an older shuffleMergeId of a shuffle (i.e already higher shuffle + // sequence Id blocks are being merged for this shuffle Id. + // 2. Shuffle for the current shuffleMergeId is already finalized. return null; } +File dataFile = appShuffleInfo.getMergedShuffleDataFile(shuffleId, shuffleMergeId, reduceId); return shufflePartitions.computeIfAbsent(reduceId, key -> { - // It only gets here when the key is not present in the map. This could either - // be the first time the merge manager receives a pushed block for a given application -
[GitHub] [spark] LuciferYang commented on a change in pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
LuciferYang commented on a change in pull request #31517: URL: https://github.com/apache/spark/pull/31517#discussion_r677087470 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala ## @@ -136,11 +136,14 @@ private[spark] object BlockManagerId { * The max cache size is hardcoded to 1, since the size of a BlockManagerId * object is about 48B, the total memory cost should be below 1MB which is feasible. */ - val blockManagerIdCache = CacheBuilder.newBuilder() -.maximumSize(1) -.build(new CacheLoader[BlockManagerId, BlockManagerId]() { - override def load(id: BlockManagerId) = id -}) + val blockManagerIdCache = { Review comment: Similar to `getPartitionBlockLocations `, it cannot be compiled with `.build[BlockManagerId, BlockManagerId](identity)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dgd-contributor commented on a change in pull request #33317: [SPARK-36095][CORE] Grouping exception in core/rdd
dgd-contributor commented on a change in pull request #33317: URL: https://github.com/apache/spark/pull/33317#discussion_r677089979 ## File path: core/src/main/scala/org/apache/spark/errors/SparkCoreErrors.scala ## @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.errors + +import java.io.IOException + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.storage.{BlockId, RDDBlockId} + +/** + * Object for grouping error messages from (most) exceptions thrown during query execution. + */ +private[spark] object SparkCoreErrors { + def rddBlockNotFoundError(blockId: BlockId, id: Int): Throwable = { +new Exception(s"Could not compute split, block $blockId of RDD $id not found") + } + + def blockHaveBeenRemovedError(string: String): Throwable = { +new SparkException(s"Attempted to use $string after its blocks have been removed!") + } + + def histogramOnEmptyRDDOrContainingInfinityOrNaNError(): Throwable = { +new UnsupportedOperationException( + "Histogram on either an empty RDD or RDD containing +/-infinity or NaN") + } + + def emptyRDDError(): Throwable = { +new UnsupportedOperationException("empty RDD") + } + + def pathNotSupportedError(path: String): Throwable = { +new IOException(s"Path: ${path} is a directory, which is not supported by the " + + "record reader when `mapreduce.input.fileinputformat.input.dir.recursive` is false.") + } + + def checkpointRDDBlockIdNotFoundError(rddBlockId: RDDBlockId): Throwable = { +new SparkException( + s""" + |Checkpoint block $rddBlockId not found! Either the executor + |that originally checkpointed this partition is no longer alive, or the original RDD is + |unpersisted. If this problem persists, you may consider using `rdd.checkpoint()` + |instead, which is slower than local checkpointing but more fault-tolerant. + """.stripMargin.replaceAll("\n", " ")) + } + + def endOfStreamError(): Throwable = { +new java.util.NoSuchElementException("End of stream") + } + + def cannotUseMapSideCombiningWithArrayKeyError(): Throwable = { +new SparkException("Cannot use map-side combining with array keys.") + } + + def hashPartitionerCannotPartitionArrayKeyError(): Throwable = { +new SparkException("HashPartitioner cannot partition array keys.") + } + + def reduceByKeyLocallyNotSupportArrayKeysError(): Throwable = { +new SparkException("reduceByKeyLocally() does not support array keys") + } + + def noSuchElementException(): Throwable = { +new NoSuchElementException() + } + + def rddLacksSparkContextError(): Throwable = { +new SparkException("This RDD lacks a SparkContext. It could happen in the following cases: " + Review comment: if we do this, the message will be resulted in multiple lines, as stripMargin only remove "|" character -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
AmplabJenkins removed a comment on pull request #31517: URL: https://github.com/apache/spark/pull/31517#issuecomment-887178881 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46192/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
AmplabJenkins commented on pull request #31517: URL: https://github.com/apache/spark/pull/31517#issuecomment-887178881 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46192/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
SparkQA removed a comment on pull request #31517: URL: https://github.com/apache/spark/pull/31517#issuecomment-887177451 **[Test build #141677 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141677/testReport)** for PR 31517 at commit [`9cd9c35`](https://github.com/apache/spark/commit/9cd9c35872094b0f60f5175dc85494c45cde10d8). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31517: [SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache
AmplabJenkins removed a comment on pull request #31517: URL: https://github.com/apache/spark/pull/31517#issuecomment-887178755 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141677/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org