Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran merged PR #6716: URL: https://github.com/apache/hadoop/pull/6716 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
hadoop-yetus commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2102258570 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 05s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 05s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 05s | | detect-secrets was not available. | | +0 :ok: | shellcheck | 0m 05s | | Shellcheck was not available. | | +0 :ok: | shelldocs | 0m 05s | | Shelldocs was not available. | | +0 :ok: | spotbugs | 0m 01s | | spotbugs executables are not available. | | +0 :ok: | markdownlint | 0m 01s | | markdownlint was not available. | | +0 :ok: | xmllint | 0m 01s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 00s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 00s | | The patch appears to include 11 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 2m 35s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 87m 02s | | trunk passed | | +1 :green_heart: | compile | 38m 10s | | trunk passed | | +1 :green_heart: | checkstyle | 6m 01s | | trunk passed | | +1 :green_heart: | mvnsite | 16m 14s | | trunk passed | | +1 :green_heart: | javadoc | 14m 05s | | trunk passed | | +1 :green_heart: | shadedclient | 165m 47s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 2m 13s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 11m 25s | | the patch passed | | +1 :green_heart: | compile | 36m 11s | | the patch passed | | +1 :green_heart: | javac | 36m 11s | | the patch passed | | +1 :green_heart: | blanks | 0m 01s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 5m 43s | | the patch passed | | +1 :green_heart: | mvnsite | 16m 13s | | the patch passed | | +1 :green_heart: | javadoc | 13m 47s | | the patch passed | | +1 :green_heart: | shadedclient | 174m 45s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | asflicense | 6m 07s | | The patch does not generate ASF License warnings. | | | | 532m 21s | | | | Subsystem | Report/Notes | |--:|:-| | GITHUB PR | https://github.com/apache/hadoop/pull/6716 | | Optional Tests | dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint | | uname | MINGW64_NT-10.0-17763 54377fa890b8 3.4.10-87d57229.x86_64 2024-02-14 20:17 UTC x86_64 Msys | | Build tool | maven | | Personality | /c/hadoop/dev-support/bin/hadoop.sh | | git revision | trunk / 68dff782a05bdaaf1fcc1e4999e6fd404f07796b | | Default Java | Azul Systems, Inc.-1.8.0_332-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/4/testReport/ | | modules | C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project hadoop-tools/hadoop-azure U: . | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/4/console | | versions | git=2.44.0.windows.1 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2098947031 @saxenapranav what do you think of the patch now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
saxenapranav commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1590568348 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java: ## @@ -582,19 +611,111 @@ protected final Path directoryMustExist( * Save a task manifest or summary. This will be done by * writing to a temp path and then renaming. * If the destination path exists: Delete it. + * This will retry so that a rename failure from abfs load or IO errors + * will not fail the task. * @param manifestData the manifest/success file * @param tempPath temp path for the initial save * @param finalPath final path for rename. - * @throws IOException failure to load/parse + * @return the manifest saved. + * @throws IOException failure to rename after retries. */ @SuppressWarnings("unchecked") - protected final void save(T manifestData, + protected final T save( + final T manifestData, final Path tempPath, final Path finalPath) throws IOException { -LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, finalPath); -trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () -> -operations.save(manifestData, tempPath, true)); -renameFile(tempPath, finalPath); +return saveManifest(() -> manifestData, tempPath, finalPath, OP_SAVE_TASK_MANIFEST); + } + + /** + * Generate and save a task manifest or summary file. + * This is be done by writing to a temp path and then renaming. + * + * If the destination path exists: Delete it before the rename. + * + * This will retry so that a rename failure from abfs load or IO errors + * such as delete or save failure will not fail the task. + * + * The {@code manifestSource} supplier is invoked to get the manifest data + * on every attempt. + * This permits statistics to be updated, including those of failures. + * @param manifestSource supplier the manifest/success file + * @param tempPath temp path for the initial save + * @param finalPath final path for rename. + * @param statistic statistic to use for timing + * @return the manifest saved. + * @throws IOException failure to save/delete/rename after retries. + */ + @SuppressWarnings("unchecked") + protected final T saveManifest( + final Supplier manifestSource, + final Path tempPath, + final Path finalPath, + String statistic) throws IOException { + +AtomicInteger retryCount = new AtomicInteger(0); +RetryPolicy retryPolicy = retryUpToMaximumCountWithProportionalSleep( +getStageConfig().getManifestSaveAttempts(), +SAVE_SLEEP_INTERVAL, +TimeUnit.MILLISECONDS); + +// loop until returning a value or raising an exception +while (true) { + try { +T manifestData = requireNonNull(manifestSource.get()); +trackDurationOfInvocation(getIOStatistics(), statistic, () -> { + LOG.info("{}: save manifest to {} then rename as {}'); retry count={}", + getName(), tempPath, finalPath, retryCount); + + // delete temp path. + // even though this is written with overwrite=true, this extra recursive + // delete also handles a directory being there. + deleteRecursive(tempPath, OP_DELETE); + + // save the temp file, overwriting any which remains from an earlier attempt + operations.save(manifestData, tempPath, true); + + // delete the destination in case it exists either from a failed previous + // attempt or from a concurrent task commit. + delete(finalPath, true, OP_DELETE); + + // rename temp to final + renameFile(tempPath, finalPath); Review Comment: This shall be better, as if the recovery also fail, we would not do additional HEAD calls for `escalateRenameFailure`. This looks good! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2086367792 I've now moved to commitFile() to rename the task manifest, after doing a getFileStatus() call first...which means its iO cost is the same as a rename with recovery enabled. it does let us see what happened, which we log at WARN. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1584877464 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md: ## @@ -523,7 +509,7 @@ And optional settings for debugging/performance analysis ``` spark.hadoop.mapreduce.outputcommitter.factory.scheme.abfs org.apache.hadoop.fs.azurebfs.commit.AzureManifestCommitterFactory -spark.hadoop.fs.azure.io.rate.limit 1 +spark.hadoop.fs.azure.io.rate.limit 1000 Review Comment: good q. The default allocation for the entire cluster is 10K; this ensures that a single commit only uses at most 10% of it during renames. in contrast, s3 throttles by shard so only things down the same directory tree were impacted, and if they were only reading data, not throttled at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1584841337 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java: ## @@ -582,19 +611,111 @@ protected final Path directoryMustExist( * Save a task manifest or summary. This will be done by * writing to a temp path and then renaming. * If the destination path exists: Delete it. + * This will retry so that a rename failure from abfs load or IO errors + * will not fail the task. * @param manifestData the manifest/success file * @param tempPath temp path for the initial save * @param finalPath final path for rename. - * @throws IOException failure to load/parse + * @return the manifest saved. + * @throws IOException failure to rename after retries. */ @SuppressWarnings("unchecked") - protected final void save(T manifestData, + protected final T save( + final T manifestData, final Path tempPath, final Path finalPath) throws IOException { -LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, finalPath); -trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () -> -operations.save(manifestData, tempPath, true)); -renameFile(tempPath, finalPath); +return saveManifest(() -> manifestData, tempPath, finalPath, OP_SAVE_TASK_MANIFEST); + } + + /** + * Generate and save a task manifest or summary file. + * This is be done by writing to a temp path and then renaming. + * + * If the destination path exists: Delete it before the rename. + * + * This will retry so that a rename failure from abfs load or IO errors + * such as delete or save failure will not fail the task. + * + * The {@code manifestSource} supplier is invoked to get the manifest data + * on every attempt. + * This permits statistics to be updated, including those of failures. + * @param manifestSource supplier the manifest/success file + * @param tempPath temp path for the initial save + * @param finalPath final path for rename. + * @param statistic statistic to use for timing + * @return the manifest saved. + * @throws IOException failure to save/delete/rename after retries. + */ + @SuppressWarnings("unchecked") + protected final T saveManifest( + final Supplier manifestSource, + final Path tempPath, + final Path finalPath, + String statistic) throws IOException { + +AtomicInteger retryCount = new AtomicInteger(0); +RetryPolicy retryPolicy = retryUpToMaximumCountWithProportionalSleep( +getStageConfig().getManifestSaveAttempts(), +SAVE_SLEEP_INTERVAL, +TimeUnit.MILLISECONDS); + +// loop until returning a value or raising an exception +while (true) { + try { +T manifestData = requireNonNull(manifestSource.get()); +trackDurationOfInvocation(getIOStatistics(), statistic, () -> { + LOG.info("{}: save manifest to {} then rename as {}'); retry count={}", + getName(), tempPath, finalPath, retryCount); + + // delete temp path. + // even though this is written with overwrite=true, this extra recursive + // delete also handles a directory being there. + deleteRecursive(tempPath, OP_DELETE); + + // save the temp file, overwriting any which remains from an earlier attempt + operations.save(manifestData, tempPath, true); + + // delete the destination in case it exists either from a failed previous + // attempt or from a concurrent task commit. + delete(finalPath, true, OP_DELETE); + + // rename temp to final + renameFile(tempPath, finalPath); Review Comment: you mean use commitFile() after creating a file entry, so pushing more of the recovery down? we could do that. we won't have the etag of the create file though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
hadoop-yetus commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2074254789 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 05s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 05s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 05s | | detect-secrets was not available. | | +0 :ok: | shellcheck | 0m 05s | | Shellcheck was not available. | | +0 :ok: | shelldocs | 0m 05s | | Shelldocs was not available. | | +0 :ok: | spotbugs | 0m 01s | | spotbugs executables are not available. | | +0 :ok: | markdownlint | 0m 01s | | markdownlint was not available. | | +0 :ok: | xmllint | 0m 01s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 00s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 00s | | The patch appears to include 10 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 2m 13s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 89m 13s | | trunk passed | | +1 :green_heart: | compile | 39m 10s | | trunk passed | | +1 :green_heart: | checkstyle | 6m 15s | | trunk passed | | +1 :green_heart: | mvnsite | 16m 30s | | trunk passed | | +1 :green_heart: | javadoc | 14m 00s | | trunk passed | | +1 :green_heart: | shadedclient | 170m 25s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 2m 13s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 11m 14s | | the patch passed | | +1 :green_heart: | compile | 37m 21s | | the patch passed | | +1 :green_heart: | javac | 37m 21s | | the patch passed | | +1 :green_heart: | blanks | 0m 01s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 5m 47s | | the patch passed | | +1 :green_heart: | mvnsite | 16m 05s | | the patch passed | | +1 :green_heart: | javadoc | 13m 42s | | the patch passed | | +1 :green_heart: | shadedclient | 178m 33s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | asflicense | 5m 29s | | The patch does not generate ASF License warnings. | | | | 543m 45s | | | | Subsystem | Report/Notes | |--:|:-| | GITHUB PR | https://github.com/apache/hadoop/pull/6716 | | Optional Tests | dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint | | uname | MINGW64_NT-10.0-17763 b033f3aa81e5 3.4.10-87d57229.x86_64 2024-02-14 20:17 UTC x86_64 Msys | | Build tool | maven | | Personality | /c/hadoop/dev-support/bin/hadoop.sh | | git revision | trunk / 2b38434c46b494f7689acc125a52935d6cb17870 | | Default Java | Azul Systems, Inc.-1.8.0_332-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/1/testReport/ | | modules | C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project hadoop-tools/hadoop-azure U: . | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/1/console | | versions | git=2.44.0.windows.1 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
saxenapranav commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1577303182 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java: ## @@ -142,64 +154,104 @@ protected Result executeStage( } Outcome outcome = null; -IOException exception; +IOException exception = null; +boolean baseDirDeleted = false; // to delete. LOG.info("{}: Deleting job directory {}", getName(), baseDir); +final long directoryCount = args.directoryCount; +if (directoryCount > 0) { + // log the expected directory count, which drives duration in GCS + // and may cause timeouts on azure if the count is too high for a + // timely permissions tree scan. + LOG.info("{}: Expected directory count: {}", getName(), directoryCount); +} +progress(); +// check and maybe execute parallel delete of task attempt dirs. if (args.deleteTaskAttemptDirsInParallel) { - // Attempt to do a parallel delete of task attempt dirs; - // don't overreact if a delete fails, but stop trying - // to delete the others, and fall back to deleting the - // job dir. - Path taskSubDir - = getStageConfig().getJobAttemptTaskSubDir(); - try (DurationInfo info = new DurationInfo(LOG, - "parallel deletion of task attempts in %s", - taskSubDir)) { -RemoteIterator dirs = -RemoteIterators.filteringRemoteIterator( -listStatusIterator(taskSubDir), -FileStatus::isDirectory); -TaskPool.foreach(dirs) -.executeWith(getIOProcessors()) -.stopOnFailure() -.suppressExceptions(false) -.run(this::rmTaskAttemptDir); -getIOStatistics().aggregate((retrieveIOStatistics(dirs))); - -if (getLastDeleteException() != null) { - // one of the task attempts failed. - throw getLastDeleteException(); + + + if (args.parallelDeleteAttemptBaseDeleteFirst) { +// attempt to delete the base dir first. +// This can reduce ABFS delete load but may time out +// (which the fallback to parallel delete will handle). +// on GCS it is slow. +try (DurationInfo info = new DurationInfo(LOG, true, +"Initial delete of %s", baseDir)) { + exception = deleteOneDir(baseDir); + if (exception == null) { +// success: record this as the outcome, +outcome = Outcome.DELETED; +// and will skip the parallel delete +baseDirDeleted = true; + } else { +// failure: log and continue +LOG.warn("{}: Exception on initial attempt at deleting base dir {}" ++ " with directory count {}. Falling back to parallel delete", +getName(), baseDir, directoryCount, exception); + } +} + } + if (!baseDirDeleted) { +// no base delete attempted or it failed. +// Attempt to do a parallel delete of task attempt dirs; +// don't overreact if a delete fails, but stop trying +// to delete the others, and fall back to deleting the +// job dir. +Path taskSubDir += getStageConfig().getJobAttemptTaskSubDir(); +try (DurationInfo info = new DurationInfo(LOG, true, +"parallel deletion of task attempts in %s", +taskSubDir)) { + RemoteIterator dirs = + RemoteIterators.filteringRemoteIterator( + listStatusIterator(taskSubDir), + FileStatus::isDirectory); + TaskPool.foreach(dirs) + .executeWith(getIOProcessors()) + .stopOnFailure() + .suppressExceptions(false) + .run(this::rmTaskAttemptDir); + getIOStatistics().aggregate((retrieveIOStatistics(dirs))); + + if (getLastDeleteException() != null) { +// one of the task attempts failed. +throw getLastDeleteException(); Review Comment: This will also register an exception raised in `deleteOneDir` in line 183. So, although there was no task failure, but would get recorded as task-failure in logs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
saxenapranav commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1577262719 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java: ## @@ -582,19 +611,111 @@ protected final Path directoryMustExist( * Save a task manifest or summary. This will be done by * writing to a temp path and then renaming. * If the destination path exists: Delete it. + * This will retry so that a rename failure from abfs load or IO errors + * will not fail the task. * @param manifestData the manifest/success file * @param tempPath temp path for the initial save * @param finalPath final path for rename. - * @throws IOException failure to load/parse + * @return the manifest saved. + * @throws IOException failure to rename after retries. */ @SuppressWarnings("unchecked") - protected final void save(T manifestData, + protected final T save( + final T manifestData, final Path tempPath, final Path finalPath) throws IOException { -LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, finalPath); -trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () -> -operations.save(manifestData, tempPath, true)); -renameFile(tempPath, finalPath); +return saveManifest(() -> manifestData, tempPath, finalPath, OP_SAVE_TASK_MANIFEST); + } + + /** + * Generate and save a task manifest or summary file. + * This is be done by writing to a temp path and then renaming. + * + * If the destination path exists: Delete it before the rename. + * + * This will retry so that a rename failure from abfs load or IO errors + * such as delete or save failure will not fail the task. + * + * The {@code manifestSource} supplier is invoked to get the manifest data + * on every attempt. + * This permits statistics to be updated, including those of failures. + * @param manifestSource supplier the manifest/success file + * @param tempPath temp path for the initial save + * @param finalPath final path for rename. + * @param statistic statistic to use for timing + * @return the manifest saved. + * @throws IOException failure to save/delete/rename after retries. + */ + @SuppressWarnings("unchecked") + protected final T saveManifest( + final Supplier manifestSource, + final Path tempPath, + final Path finalPath, + String statistic) throws IOException { + +AtomicInteger retryCount = new AtomicInteger(0); +RetryPolicy retryPolicy = retryUpToMaximumCountWithProportionalSleep( +getStageConfig().getManifestSaveAttempts(), +SAVE_SLEEP_INTERVAL, +TimeUnit.MILLISECONDS); + +// loop until returning a value or raising an exception +while (true) { + try { +T manifestData = requireNonNull(manifestSource.get()); +trackDurationOfInvocation(getIOStatistics(), statistic, () -> { + LOG.info("{}: save manifest to {} then rename as {}'); retry count={}", + getName(), tempPath, finalPath, retryCount); + + // delete temp path. + // even though this is written with overwrite=true, this extra recursive + // delete also handles a directory being there. + deleteRecursive(tempPath, OP_DELETE); + + // save the temp file, overwriting any which remains from an earlier attempt + operations.save(manifestData, tempPath, true); + + // delete the destination in case it exists either from a failed previous + // attempt or from a concurrent task commit. + delete(finalPath, true, OP_DELETE); + + // rename temp to final + renameFile(tempPath, finalPath); Review Comment: Got your point for the directory case. For the first point, I now understand that `executeRenamingOperation` would call `escalateRenameFailure` on fs.rename() failure which would raise PathIOException. I was thinking if instead of calling `renameFile` if we can do `operation.renameFile()` directly and raise exception from there. Reason being, `escalateRenameFailure` does a getFileStatus on both src and dst for logging. We can save 2 filesystem calls if we know the renameFile for the saveManifest has failed. Would like to know your view. But, I am good with this comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@h
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
hadoop-yetus commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2073739524 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 12m 17s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +0 :ok: | shelldocs | 0m 1s | | Shelldocs was not available. | | +0 :ok: | markdownlint | 0m 1s | | markdownlint was not available. | | +0 :ok: | xmllint | 0m 1s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 10 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 14s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 32m 12s | | trunk passed | | +1 :green_heart: | compile | 17m 30s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 16m 28s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 4m 23s | | trunk passed | | +1 :green_heart: | mvnsite | 3m 35s | | trunk passed | | +1 :green_heart: | javadoc | 3m 15s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 2m 56s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 6m 38s | | trunk passed | | +1 :green_heart: | shadedclient | 34m 4s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 34s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 5s | | the patch passed | | +1 :green_heart: | compile | 16m 51s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 16m 51s | | the patch passed | | +1 :green_heart: | compile | 15m 57s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 15m 57s | | the patch passed | | +1 :green_heart: | blanks | 0m 1s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 4m 23s | | the patch passed | | +1 :green_heart: | mvnsite | 3m 25s | | the patch passed | | +1 :green_heart: | shellcheck | 0m 24s | | No new issues. | | +1 :green_heart: | javadoc | 3m 6s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 3m 1s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 7m 16s | | the patch passed | | +1 :green_heart: | shadedclient | 34m 33s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 8m 27s | | hadoop-mapreduce-client-core in the patch passed. | | +1 :green_heart: | unit | 162m 7s | | hadoop-mapreduce-project in the patch passed. | | +1 :green_heart: | unit | 2m 56s | | hadoop-azure in the patch passed. | | +1 :green_heart: | asflicense | 1m 18s | | The patch does not generate ASF License warnings. | | | | 422m 41s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/11/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6716 | | Optional Tests | dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint | | uname | Linux 8700e5c7281c 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 2b38434c46b494f7689acc125a52935d6cb17870 | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/11/testReport/ | |
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2072467711 One thing I'm considering here, make that "initial attempt at base dir delete" a numeric threshold. good: agile bad: harder to test, less consistent; harder to replicate for now, leaving a simple switch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1576225937 ## hadoop-tools/hadoop-azure/pom.xml: ## @@ -45,6 +45,8 @@ 7200 10 1000 + +100 Review Comment: this is for maven test runs only. s3a fs now defaults to 500, though we have to worry about TTL, stale connections etc. as abfs moves to httpclient, these things will surface there too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1576221197 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java: ## @@ -582,19 +611,111 @@ protected final Path directoryMustExist( * Save a task manifest or summary. This will be done by * writing to a temp path and then renaming. * If the destination path exists: Delete it. + * This will retry so that a rename failure from abfs load or IO errors + * will not fail the task. * @param manifestData the manifest/success file * @param tempPath temp path for the initial save * @param finalPath final path for rename. - * @throws IOException failure to load/parse + * @return the manifest saved. + * @throws IOException failure to rename after retries. */ @SuppressWarnings("unchecked") - protected final void save(T manifestData, + protected final T save( + final T manifestData, final Path tempPath, final Path finalPath) throws IOException { -LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, finalPath); -trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () -> -operations.save(manifestData, tempPath, true)); -renameFile(tempPath, finalPath); +return saveManifest(() -> manifestData, tempPath, finalPath, OP_SAVE_TASK_MANIFEST); + } + + /** + * Generate and save a task manifest or summary file. + * This is be done by writing to a temp path and then renaming. + * + * If the destination path exists: Delete it before the rename. + * + * This will retry so that a rename failure from abfs load or IO errors + * such as delete or save failure will not fail the task. + * + * The {@code manifestSource} supplier is invoked to get the manifest data + * on every attempt. + * This permits statistics to be updated, including those of failures. + * @param manifestSource supplier the manifest/success file + * @param tempPath temp path for the initial save + * @param finalPath final path for rename. + * @param statistic statistic to use for timing + * @return the manifest saved. + * @throws IOException failure to save/delete/rename after retries. + */ + @SuppressWarnings("unchecked") + protected final T saveManifest( + final Supplier manifestSource, + final Path tempPath, + final Path finalPath, + String statistic) throws IOException { + +AtomicInteger retryCount = new AtomicInteger(0); +RetryPolicy retryPolicy = retryUpToMaximumCountWithProportionalSleep( +getStageConfig().getManifestSaveAttempts(), +SAVE_SLEEP_INTERVAL, +TimeUnit.MILLISECONDS); + +// loop until returning a value or raising an exception +while (true) { + try { +T manifestData = requireNonNull(manifestSource.get()); +trackDurationOfInvocation(getIOStatistics(), statistic, () -> { + LOG.info("{}: save manifest to {} then rename as {}'); retry count={}", + getName(), tempPath, finalPath, retryCount); + + // delete temp path. + // even though this is written with overwrite=true, this extra recursive + // delete also handles a directory being there. + deleteRecursive(tempPath, OP_DELETE); + + // save the temp file, overwriting any which remains from an earlier attempt + operations.save(manifestData, tempPath, true); + + // delete the destination in case it exists either from a failed previous + // attempt or from a concurrent task commit. + delete(finalPath, true, OP_DELETE); + + // rename temp to final + renameFile(tempPath, finalPath); Review Comment: 1. renameFile javadocs `throws PathIOException – if the rename() call returned false.`. so no need to check the result here 2. directory deletion, maybe: but what is going to create a directory here? nothing in the committer will, and if some other process is doing stuff in the job attempt dir you are doomed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
saxenapranav commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1574200780 ## hadoop-tools/hadoop-azure/pom.xml: ## @@ -45,6 +45,8 @@ 7200 10 1000 + +100 Review Comment: Hope its for only test runs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
saxenapranav commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1572484303 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java: ## @@ -582,19 +611,111 @@ protected final Path directoryMustExist( * Save a task manifest or summary. This will be done by * writing to a temp path and then renaming. * If the destination path exists: Delete it. + * This will retry so that a rename failure from abfs load or IO errors + * will not fail the task. * @param manifestData the manifest/success file * @param tempPath temp path for the initial save * @param finalPath final path for rename. - * @throws IOException failure to load/parse + * @return the manifest saved. + * @throws IOException failure to rename after retries. */ @SuppressWarnings("unchecked") - protected final void save(T manifestData, + protected final T save( + final T manifestData, final Path tempPath, final Path finalPath) throws IOException { -LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, finalPath); -trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () -> -operations.save(manifestData, tempPath, true)); -renameFile(tempPath, finalPath); +return saveManifest(() -> manifestData, tempPath, finalPath, OP_SAVE_TASK_MANIFEST); + } + + /** + * Generate and save a task manifest or summary file. + * This is be done by writing to a temp path and then renaming. + * + * If the destination path exists: Delete it before the rename. + * + * This will retry so that a rename failure from abfs load or IO errors + * such as delete or save failure will not fail the task. + * + * The {@code manifestSource} supplier is invoked to get the manifest data + * on every attempt. + * This permits statistics to be updated, including those of failures. + * @param manifestSource supplier the manifest/success file + * @param tempPath temp path for the initial save + * @param finalPath final path for rename. + * @param statistic statistic to use for timing + * @return the manifest saved. + * @throws IOException failure to save/delete/rename after retries. + */ + @SuppressWarnings("unchecked") + protected final T saveManifest( + final Supplier manifestSource, + final Path tempPath, + final Path finalPath, + String statistic) throws IOException { + +AtomicInteger retryCount = new AtomicInteger(0); +RetryPolicy retryPolicy = retryUpToMaximumCountWithProportionalSleep( +getStageConfig().getManifestSaveAttempts(), +SAVE_SLEEP_INTERVAL, +TimeUnit.MILLISECONDS); + +// loop until returning a value or raising an exception +while (true) { + try { +T manifestData = requireNonNull(manifestSource.get()); +trackDurationOfInvocation(getIOStatistics(), statistic, () -> { + LOG.info("{}: save manifest to {} then rename as {}'); retry count={}", + getName(), tempPath, finalPath, retryCount); + + // delete temp path. + // even though this is written with overwrite=true, this extra recursive + // delete also handles a directory being there. + deleteRecursive(tempPath, OP_DELETE); + + // save the temp file, overwriting any which remains from an earlier attempt + operations.save(manifestData, tempPath, true); + + // delete the destination in case it exists either from a failed previous + // attempt or from a concurrent task commit. + delete(finalPath, true, OP_DELETE); + + // rename temp to final + renameFile(tempPath, finalPath); Review Comment: There can be a parallel process which might create a directory in between line 680 and 683, should we check post line 683, if finalPath is a file or not? ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java: ## @@ -582,19 +611,111 @@ protected final Path directoryMustExist( * Save a task manifest or summary. This will be done by * writing to a temp path and then renaming. * If the destination path exists: Delete it. + * This will retry so that a rename failure from abfs load or IO errors + * will not fail the task. * @param manifestData the manifest/success file * @param tempPath temp path for the initial save * @param finalPath final path for rename. - * @throws IOException failure to load/parse + * @return the manifest saved. + * @throws IOException failure to rename after retries. */ @SuppressWarnings("unchecked") - protected final void save(T manifestData, + protected final T save( + final
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
hadoop-yetus commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2067365776 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 34s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | shelldocs | 0m 0s | | Shelldocs was not available. | | +0 :ok: | markdownlint | 0m 1s | | markdownlint was not available. | | +0 :ok: | xmllint | 0m 1s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 10 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 7s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 32m 9s | | trunk passed | | +1 :green_heart: | compile | 17m 23s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 16m 11s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 4m 22s | | trunk passed | | +1 :green_heart: | mvnsite | 3m 31s | | trunk passed | | +1 :green_heart: | javadoc | 3m 15s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 2m 58s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 6m 39s | | trunk passed | | +1 :green_heart: | shadedclient | 34m 6s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 33s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 7s | | the patch passed | | +1 :green_heart: | compile | 16m 48s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 16m 48s | | the patch passed | | +1 :green_heart: | compile | 16m 14s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 16m 14s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 4m 17s | [/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/10/artifact/out/results-checkstyle-root.txt) | root: The patch generated 2 new + 22 unchanged - 0 fixed = 24 total (was 22) | | +1 :green_heart: | mvnsite | 3m 23s | | the patch passed | | +1 :green_heart: | shellcheck | 0m 23s | | No new issues. | | +1 :green_heart: | javadoc | 3m 7s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 3m 1s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 7m 18s | | the patch passed | | +1 :green_heart: | shadedclient | 34m 19s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 7m 54s | | hadoop-mapreduce-client-core in the patch passed. | | +1 :green_heart: | unit | 161m 19s | | hadoop-mapreduce-project in the patch passed. | | +1 :green_heart: | unit | 2m 44s | | hadoop-azure in the patch passed. | | +1 :green_heart: | asflicense | 1m 17s | | The patch does not generate ASF License warnings. | | | | 408m 51s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/10/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6716 | | Optional Tests | dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint | | uname | Linux 1cd1cb0cbe32 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / cd40e7ff0138412044f977f5d4d079494c6d7dd0 | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
hadoop-yetus commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2065738026 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 34s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | shelldocs | 0m 0s | | Shelldocs was not available. | | +0 :ok: | markdownlint | 0m 1s | | markdownlint was not available. | | +0 :ok: | xmllint | 0m 1s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 10 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 57s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 32m 39s | | trunk passed | | +1 :green_heart: | compile | 17m 27s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 16m 20s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 4m 19s | | trunk passed | | +1 :green_heart: | mvnsite | 3m 33s | | trunk passed | | +1 :green_heart: | javadoc | 3m 15s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 3m 1s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 6m 39s | | trunk passed | | +1 :green_heart: | shadedclient | 34m 11s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 1m 58s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 16s | | the patch passed | | +1 :green_heart: | compile | 16m 49s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 16m 49s | | the patch passed | | +1 :green_heart: | compile | 16m 15s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 16m 15s | | the patch passed | | -1 :x: | blanks | 0m 0s | [/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/9/artifact/out/blanks-eol.txt) | The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | -0 :warning: | checkstyle | 4m 17s | [/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/9/artifact/out/results-checkstyle-root.txt) | root: The patch generated 10 new + 22 unchanged - 0 fixed = 32 total (was 22) | | +1 :green_heart: | mvnsite | 3m 24s | | the patch passed | | +1 :green_heart: | shellcheck | 0m 23s | | No new issues. | | +1 :green_heart: | javadoc | 3m 9s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 3m 2s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 7m 19s | | the patch passed | | +1 :green_heart: | shadedclient | 34m 28s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 7m 59s | | hadoop-mapreduce-client-core in the patch passed. | | +1 :green_heart: | unit | 160m 38s | | hadoop-mapreduce-project in the patch passed. | | +1 :green_heart: | unit | 2m 43s | | hadoop-azure in the patch passed. | | +1 :green_heart: | asflicense | 1m 14s | | The patch does not generate ASF License warnings. | | | | 410m 46s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/9/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6716 | | Optional Tests | dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint | | uname | Linux 2d173658385e 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / abed2feab2c31e845bf95bc4868c2cfd0
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2064541193 @snvijaya we actually know the total number of subdirs for the deletion! it is propagated via the manifests: each TA manifest includes the #of dirs as an IOStatistic, the aggregate summary adds these all up. the number of paths under the job dir is that number (counter committer_task_directory_count ) + any of failed task attempts. which means we could actually have a threshold of how many subdirectories will trigger an automatic switch to parallel delete. I'm just going to pass this down and log immediately before the cleanup kicks off, so if there are problems we will get the diagnostics adjacent to the error. Note that your details on retry timings imply that on a mapreduce job (rather than spark one) the progress() callback will not take place -so there's a risk that the job will actually timeout. I don't think that's an issue in MR job actions, the way it is is in task-side actions where a heartbeat back to the MapRed AM is required. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
hadoop-yetus commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2062530498 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 32s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +0 :ok: | markdownlint | 0m 1s | | markdownlint was not available. | | +0 :ok: | xmllint | 0m 1s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 7 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 43s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 32m 23s | | trunk passed | | +1 :green_heart: | compile | 17m 35s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 16m 7s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 4m 19s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 55s | | trunk passed | | +1 :green_heart: | javadoc | 1m 34s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 25s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 2m 53s | | trunk passed | | +1 :green_heart: | shadedclient | 34m 12s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 33s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 0s | | the patch passed | | +1 :green_heart: | compile | 16m 47s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 16m 47s | | the patch passed | | +1 :green_heart: | compile | 16m 25s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 16m 25s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 4m 22s | [/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/8/artifact/out/results-checkstyle-root.txt) | root: The patch generated 2 new + 22 unchanged - 0 fixed = 24 total (was 22) | | +1 :green_heart: | mvnsite | 1m 54s | | the patch passed | | +1 :green_heart: | javadoc | 1m 34s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 34s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 15s | | the patch passed | | +1 :green_heart: | shadedclient | 34m 26s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 7m 50s | | hadoop-mapreduce-client-core in the patch passed. | | +1 :green_heart: | unit | 2m 30s | | hadoop-azure in the patch passed. | | +1 :green_heart: | asflicense | 1m 3s | | The patch does not generate ASF License warnings. | | | | 228m 23s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/8/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6716 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint xmllint | | uname | Linux 2eeac626505b 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 3e5e1e61d99b9a784f1f324f5875792da9b49406 | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/8/testReport/ | | Max. process+thread count | 1368 (vs. ulimit of 5500) | | modules | C
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569270966 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java: ## @@ -142,64 +154,93 @@ protected Result executeStage( } Outcome outcome = null; -IOException exception; +IOException exception = null; +boolean baseDirDeleted = false; // to delete. LOG.info("{}: Deleting job directory {}", getName(), baseDir); if (args.deleteTaskAttemptDirsInParallel) { - // Attempt to do a parallel delete of task attempt dirs; - // don't overreact if a delete fails, but stop trying - // to delete the others, and fall back to deleting the - // job dir. - Path taskSubDir - = getStageConfig().getJobAttemptTaskSubDir(); - try (DurationInfo info = new DurationInfo(LOG, - "parallel deletion of task attempts in %s", - taskSubDir)) { -RemoteIterator dirs = -RemoteIterators.filteringRemoteIterator( -listStatusIterator(taskSubDir), -FileStatus::isDirectory); -TaskPool.foreach(dirs) -.executeWith(getIOProcessors()) -.stopOnFailure() -.suppressExceptions(false) -.run(this::rmTaskAttemptDir); -getIOStatistics().aggregate((retrieveIOStatistics(dirs))); - -if (getLastDeleteException() != null) { - // one of the task attempts failed. - throw getLastDeleteException(); + + // parallel delete of task attempt dirs. + + if (args.parallelDeleteAttemptBaseDeleteFirst) { +// attempt to delete the base dir first. +// This can reduce ABFS delete load but may time out +// (which the fallback to parallel delete will handle). +// on GCS it is slow. +try (DurationInfo info = new DurationInfo(LOG, true, +"Initial delete of %s", baseDir)) { + exception = deleteOneDir(baseDir); + if (exception == null) { +// success: record this as the outcome, which +// will skip the parallel delete. +outcome = Outcome.DELETED; +baseDirDeleted = true; + } else { +// failure: log and continue +LOG.warn("{}: Exception on initial attempt at deleting base dir {}\n" ++ "attempting parallel delete", +getName(), baseDir, exception); + } +} + } + if (!baseDirDeleted) { +// no base delete attempted or it failed. +// Attempt to do a parallel delete of task attempt dirs; +// don't overreact if a delete fails, but stop trying +// to delete the others, and fall back to deleting the +// job dir. +Path taskSubDir += getStageConfig().getJobAttemptTaskSubDir(); +try (DurationInfo info = new DurationInfo(LOG, true, +"parallel deletion of task attempts in %s", +taskSubDir)) { + RemoteIterator dirs = + RemoteIterators.filteringRemoteIterator( + listStatusIterator(taskSubDir), + FileStatus::isDirectory); + TaskPool.foreach(dirs) + .executeWith(getIOProcessors()) + .stopOnFailure() + .suppressExceptions(false) + .run(this::rmTaskAttemptDir); + getIOStatistics().aggregate((retrieveIOStatistics(dirs))); + + if (getLastDeleteException() != null) { +// one of the task attempts failed. +throw getLastDeleteException(); Review Comment: good q. rmTaskAttemptDir will not trigger a failure, just note and store the error. on L213 one of any such IOEs is rethrown, but that just gets to L222 which logs and then goes on to the final root dir attempt again so it should be recovered from...though if the first delete timed out then that root delete on L233 is probably doomed too... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569264943 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java: ## @@ -142,64 +154,93 @@ protected Result executeStage( } Outcome outcome = null; -IOException exception; +IOException exception = null; +boolean baseDirDeleted = false; // to delete. LOG.info("{}: Deleting job directory {}", getName(), baseDir); if (args.deleteTaskAttemptDirsInParallel) { - // Attempt to do a parallel delete of task attempt dirs; - // don't overreact if a delete fails, but stop trying - // to delete the others, and fall back to deleting the - // job dir. - Path taskSubDir - = getStageConfig().getJobAttemptTaskSubDir(); - try (DurationInfo info = new DurationInfo(LOG, - "parallel deletion of task attempts in %s", - taskSubDir)) { -RemoteIterator dirs = -RemoteIterators.filteringRemoteIterator( -listStatusIterator(taskSubDir), -FileStatus::isDirectory); -TaskPool.foreach(dirs) -.executeWith(getIOProcessors()) -.stopOnFailure() -.suppressExceptions(false) -.run(this::rmTaskAttemptDir); -getIOStatistics().aggregate((retrieveIOStatistics(dirs))); - -if (getLastDeleteException() != null) { - // one of the task attempts failed. - throw getLastDeleteException(); + + // parallel delete of task attempt dirs. + + if (args.parallelDeleteAttemptBaseDeleteFirst) { +// attempt to delete the base dir first. +// This can reduce ABFS delete load but may time out +// (which the fallback to parallel delete will handle). +// on GCS it is slow. +try (DurationInfo info = new DurationInfo(LOG, true, +"Initial delete of %s", baseDir)) { + exception = deleteOneDir(baseDir); + if (exception == null) { +// success: record this as the outcome, which +// will skip the parallel delete. +outcome = Outcome.DELETED; +baseDirDeleted = true; + } else { +// failure: log and continue +LOG.warn("{}: Exception on initial attempt at deleting base dir {}\n" ++ "attempting parallel delete", +getName(), baseDir, exception); + } +} + } + if (!baseDirDeleted) { Review Comment: note that the next stage will fail immediately on the list() call, which is caught as a FNFE and treated as success -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569262867 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java: ## @@ -142,64 +154,93 @@ protected Result executeStage( } Outcome outcome = null; -IOException exception; +IOException exception = null; +boolean baseDirDeleted = false; // to delete. LOG.info("{}: Deleting job directory {}", getName(), baseDir); if (args.deleteTaskAttemptDirsInParallel) { - // Attempt to do a parallel delete of task attempt dirs; - // don't overreact if a delete fails, but stop trying - // to delete the others, and fall back to deleting the - // job dir. - Path taskSubDir - = getStageConfig().getJobAttemptTaskSubDir(); - try (DurationInfo info = new DurationInfo(LOG, - "parallel deletion of task attempts in %s", - taskSubDir)) { -RemoteIterator dirs = -RemoteIterators.filteringRemoteIterator( -listStatusIterator(taskSubDir), -FileStatus::isDirectory); -TaskPool.foreach(dirs) -.executeWith(getIOProcessors()) -.stopOnFailure() -.suppressExceptions(false) -.run(this::rmTaskAttemptDir); -getIOStatistics().aggregate((retrieveIOStatistics(dirs))); - -if (getLastDeleteException() != null) { - // one of the task attempts failed. - throw getLastDeleteException(); + + // parallel delete of task attempt dirs. + + if (args.parallelDeleteAttemptBaseDeleteFirst) { +// attempt to delete the base dir first. +// This can reduce ABFS delete load but may time out +// (which the fallback to parallel delete will handle). +// on GCS it is slow. +try (DurationInfo info = new DurationInfo(LOG, true, +"Initial delete of %s", baseDir)) { + exception = deleteOneDir(baseDir); + if (exception == null) { +// success: record this as the outcome, which +// will skip the parallel delete. +outcome = Outcome.DELETED; +baseDirDeleted = true; + } else { +// failure: log and continue +LOG.warn("{}: Exception on initial attempt at deleting base dir {}\n" ++ "attempting parallel delete", +getName(), baseDir, exception); + } +} + } + if (!baseDirDeleted) { Review Comment: it gets set on L180; will comment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569233047 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java: ## @@ -142,64 +154,93 @@ protected Result executeStage( } Outcome outcome = null; -IOException exception; +IOException exception = null; +boolean baseDirDeleted = false; // to delete. LOG.info("{}: Deleting job directory {}", getName(), baseDir); if (args.deleteTaskAttemptDirsInParallel) { - // Attempt to do a parallel delete of task attempt dirs; - // don't overreact if a delete fails, but stop trying - // to delete the others, and fall back to deleting the - // job dir. - Path taskSubDir - = getStageConfig().getJobAttemptTaskSubDir(); - try (DurationInfo info = new DurationInfo(LOG, - "parallel deletion of task attempts in %s", - taskSubDir)) { -RemoteIterator dirs = -RemoteIterators.filteringRemoteIterator( -listStatusIterator(taskSubDir), -FileStatus::isDirectory); -TaskPool.foreach(dirs) -.executeWith(getIOProcessors()) -.stopOnFailure() -.suppressExceptions(false) -.run(this::rmTaskAttemptDir); -getIOStatistics().aggregate((retrieveIOStatistics(dirs))); - -if (getLastDeleteException() != null) { - // one of the task attempts failed. - throw getLastDeleteException(); + + // parallel delete of task attempt dirs. + + if (args.parallelDeleteAttemptBaseDeleteFirst) { +// attempt to delete the base dir first. +// This can reduce ABFS delete load but may time out +// (which the fallback to parallel delete will handle). +// on GCS it is slow. +try (DurationInfo info = new DurationInfo(LOG, true, +"Initial delete of %s", baseDir)) { + exception = deleteOneDir(baseDir); + if (exception == null) { Review Comment: ooh, so it's going to be quite a long time to fall back. I'm going to make the option default to false for now. > AzureManifestCommitterFactory could probably set this config to 0 before FileSystem.get() call happens. it'll come from the cache, we don't want to set it for everything else, but a low MAX_RETRIES_RECURSIVE_DELETE might make sense everywhere. something to consider later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569225918 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java: ## @@ -582,19 +607,64 @@ protected final Path directoryMustExist( * Save a task manifest or summary. This will be done by * writing to a temp path and then renaming. * If the destination path exists: Delete it. + * This will retry so that a rename failure from abfs load or IO errors + * will not fail the task. * @param manifestData the manifest/success file * @param tempPath temp path for the initial save * @param finalPath final path for rename. - * @throws IOException failure to load/parse + * @throws IOException failure to rename after retries. */ @SuppressWarnings("unchecked") protected final void save(T manifestData, final Path tempPath, final Path finalPath) throws IOException { -LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, finalPath); -trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () -> -operations.save(manifestData, tempPath, true)); -renameFile(tempPath, finalPath); + +int retryCount = 0; +RetryPolicy retryPolicy = retryUpToMaximumCountWithProportionalSleep( +getStageConfig().getManifestSaveAttempts(), +SAVE_SLEEP_INTERVAL, +TimeUnit.MILLISECONDS); +boolean success = false; +while (!success) { + try { +LOG.info("{}: save manifest to {} then rename as {}'); retry count={}", +getName(), tempPath, finalPath, retryCount); + +trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () -> +operations.save(manifestData, tempPath, true)); Review Comment: so renameFile() has always deleted the destination because we need to do that to cope with failures of a previous/concurrent task attempt. Whoever commits last wins. To make this clearer I'm pulling up more of the code into this method and adding comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569228824 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java: ## @@ -582,19 +605,46 @@ protected final Path directoryMustExist( * Save a task manifest or summary. This will be done by * writing to a temp path and then renaming. * If the destination path exists: Delete it. + * This will retry so that a rename failure from abfs load or IO errors + * will not fail the task. * @param manifestData the manifest/success file * @param tempPath temp path for the initial save * @param finalPath final path for rename. - * @throws IOException failure to load/parse + * @throws IOException failure to rename after retries. */ @SuppressWarnings("unchecked") protected final void save(T manifestData, final Path tempPath, final Path finalPath) throws IOException { -LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, finalPath); -trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () -> -operations.save(manifestData, tempPath, true)); -renameFile(tempPath, finalPath); +boolean success = false; +int failures = 0; +while (!success) { + try { +LOG.trace("{}: attempt {} save('{}, {}, {}')", +getName(), failures, manifestData, tempPath, finalPath); + +trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () -> +operations.save(manifestData, tempPath, true)); +renameFile(tempPath, finalPath); Review Comment: any error raised during rename triggers fallback of * catch IOE * save temp file again * delete dest path * rename temp path to final path this is attempted a configurable number of times, with a sleep in between. no attempt to be clever about which IOEs are unrecoverable (permissions etc), just catch, log, retry -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569210996 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java: ## @@ -445,9 +448,29 @@ protected Boolean delete( final boolean recursive, final String statistic) throws IOException { -return trackDuration(getIOStatistics(), statistic, () -> { - return operations.delete(path, recursive); -}); +if (recursive) { Review Comment: ok, deleteDir will also delete a file. let me highlight that. I'd done this delete dir/file split to support different capacity requests, without that it is a bit over-complex. it does let us collect different statistics though, which may be useful -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569203962 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/ManifestCommitterConstants.java: ## @@ -143,6 +145,20 @@ public final class ManifestCommitterConstants { */ public static final boolean OPT_CLEANUP_PARALLEL_DELETE_DIRS_DEFAULT = true; + /** + * Should parallel cleanup try to delete teh base first? + * Best for azure as it skips the task attempt deletions unless + * the toplevel delete fails. + * Value: {@value}. + */ + public static final String OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST = + OPT_PREFIX + "cleanup.parallel.delete.base.first"; + + /** + * Default value of option {@link #OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST}: {@value}. + */ + public static final boolean OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST_DEFAULT = true; Review Comment: really don't know here. In the docs I try to cover this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569201237 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/ManifestStoreOperations.java: ## @@ -97,6 +97,36 @@ public boolean isFile(Path path) throws IOException { public abstract boolean delete(Path path, boolean recursive) throws IOException; + /** + * Forward to {@code delete(Path, true)} + * unless overridden. + * + * If it returns without an error: there is no file at + * the end of the path. + * @param path path + * @return outcome + * @throws IOException failure. + */ + public boolean deleteFile(Path path) + throws IOException { +return delete(path, false); Review Comment: no it won't -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569200611 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/ManifestStoreOperations.java: ## @@ -97,6 +97,36 @@ public boolean isFile(Path path) throws IOException { public abstract boolean delete(Path path, boolean recursive) throws IOException; + /** + * Forward to {@code delete(Path, true)} + * unless overridden. + * + * If it returns without an error: there is no file at + * the end of the path. + * @param path path + * @return outcome + * @throws IOException failure. + */ + public boolean deleteFile(Path path) + throws IOException { +return delete(path, false); + } + + /** + * Acquire the delete capacity then call {@code FileSystem#delete(Path, true)} Review Comment: aah, I'd cut that. leaving delete capacity out of this PR as it'd need rate limiting in abfs *and* guessing about size/depth of directory. which we could actually add in future task manifests, leaving only the homework of aggregating it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
snvijaya commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1567404776 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java: ## @@ -142,64 +154,93 @@ protected Result executeStage( } Outcome outcome = null; -IOException exception; +IOException exception = null; +boolean baseDirDeleted = false; // to delete. LOG.info("{}: Deleting job directory {}", getName(), baseDir); if (args.deleteTaskAttemptDirsInParallel) { - // Attempt to do a parallel delete of task attempt dirs; - // don't overreact if a delete fails, but stop trying - // to delete the others, and fall back to deleting the - // job dir. - Path taskSubDir - = getStageConfig().getJobAttemptTaskSubDir(); - try (DurationInfo info = new DurationInfo(LOG, - "parallel deletion of task attempts in %s", - taskSubDir)) { -RemoteIterator dirs = -RemoteIterators.filteringRemoteIterator( -listStatusIterator(taskSubDir), -FileStatus::isDirectory); -TaskPool.foreach(dirs) -.executeWith(getIOProcessors()) -.stopOnFailure() -.suppressExceptions(false) -.run(this::rmTaskAttemptDir); -getIOStatistics().aggregate((retrieveIOStatistics(dirs))); - -if (getLastDeleteException() != null) { - // one of the task attempts failed. - throw getLastDeleteException(); + + // parallel delete of task attempt dirs. + + if (args.parallelDeleteAttemptBaseDeleteFirst) { +// attempt to delete the base dir first. +// This can reduce ABFS delete load but may time out +// (which the fallback to parallel delete will handle). +// on GCS it is slow. +try (DurationInfo info = new DurationInfo(LOG, true, +"Initial delete of %s", baseDir)) { + exception = deleteOneDir(baseDir); + if (exception == null) { Review Comment: As added by you in this logic, when the directory tree is very large and is over OAuth authentication, Azure cloud could fail the baseDir delete due to exhaustive ACL permissions checks. But this delete will entry the retry loop as it was request timeout and for this scenario all the retries too might fail and can take a while to report failure with backoff and retry attempts as per AZURE_MAX_IO_RETRIES (default value 30). Default max retry count is 30 today just to ensure any 5-10 min network/service transient failures do not lead failures of long running workloads. If this logic to attempt basedir delete before falling back to parallel deletes, is optimal only for Azure cloud, we could look for ways to fail fast for Delete with recursive. Would this work - Add a new config MAX_RETRIES_RECURSIVE_DELETE which by default will be the same as AZURE_MAX_IO_RETRIES in ABFS driver. AzureManifestCommitterFactory could probably set this config to 0 before FileSystem.get() call happens. If this sounds ok, we can look into changes needed in AbfsClient, AbfsRestOperation and ExponentialRetry to make MAX_RETRIES_RECURSIVE_DELETE config effective. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
anmolanmol1234 commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1567132937 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/ManifestCommitterConstants.java: ## @@ -143,6 +145,20 @@ public final class ManifestCommitterConstants { */ public static final boolean OPT_CLEANUP_PARALLEL_DELETE_DIRS_DEFAULT = true; + /** + * Should parallel cleanup try to delete teh base first? Review Comment: typo: the ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java: ## @@ -142,64 +154,93 @@ protected Result executeStage( } Outcome outcome = null; -IOException exception; +IOException exception = null; +boolean baseDirDeleted = false; // to delete. LOG.info("{}: Deleting job directory {}", getName(), baseDir); if (args.deleteTaskAttemptDirsInParallel) { - // Attempt to do a parallel delete of task attempt dirs; - // don't overreact if a delete fails, but stop trying - // to delete the others, and fall back to deleting the - // job dir. - Path taskSubDir - = getStageConfig().getJobAttemptTaskSubDir(); - try (DurationInfo info = new DurationInfo(LOG, - "parallel deletion of task attempts in %s", - taskSubDir)) { -RemoteIterator dirs = -RemoteIterators.filteringRemoteIterator( -listStatusIterator(taskSubDir), -FileStatus::isDirectory); -TaskPool.foreach(dirs) -.executeWith(getIOProcessors()) -.stopOnFailure() -.suppressExceptions(false) -.run(this::rmTaskAttemptDir); -getIOStatistics().aggregate((retrieveIOStatistics(dirs))); - -if (getLastDeleteException() != null) { - // one of the task attempts failed. - throw getLastDeleteException(); + + // parallel delete of task attempt dirs. + + if (args.parallelDeleteAttemptBaseDeleteFirst) { +// attempt to delete the base dir first. +// This can reduce ABFS delete load but may time out +// (which the fallback to parallel delete will handle). +// on GCS it is slow. +try (DurationInfo info = new DurationInfo(LOG, true, +"Initial delete of %s", baseDir)) { + exception = deleteOneDir(baseDir); + if (exception == null) { +// success: record this as the outcome, which +// will skip the parallel delete. +outcome = Outcome.DELETED; +baseDirDeleted = true; + } else { +// failure: log and continue +LOG.warn("{}: Exception on initial attempt at deleting base dir {}\n" ++ "attempting parallel delete", +getName(), baseDir, exception); + } +} + } + if (!baseDirDeleted) { Review Comment: Do we need to change the state of this variable here because if not it will go to the next if block ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java: ## @@ -142,64 +154,93 @@ protected Result executeStage( } Outcome outcome = null; -IOException exception; +IOException exception = null; +boolean baseDirDeleted = false; // to delete. LOG.info("{}: Deleting job directory {}", getName(), baseDir); if (args.deleteTaskAttemptDirsInParallel) { - // Attempt to do a parallel delete of task attempt dirs; - // don't overreact if a delete fails, but stop trying - // to delete the others, and fall back to deleting the - // job dir. - Path taskSubDir - = getStageConfig().getJobAttemptTaskSubDir(); - try (DurationInfo info = new DurationInfo(LOG, - "parallel deletion of task attempts in %s", - taskSubDir)) { -RemoteIterator dirs = -RemoteIterators.filteringRemoteIterator( -listStatusIterator(taskSubDir), -FileStatus::isDirectory); -TaskPool.foreach(dirs) -.executeWith(getIOProcessors()) -.stopOnFailure() -.suppressExceptions(false) -.run(this::rmTaskAttemptDir); -getIOStatistics().aggregate((retrieveIOStatistics(dirs))); - -if (getLastDeleteException() != null) { - // one of the task attempts failed. - throw getLastDeleteException(); + + // parallel delete of task attempt dirs. + + if (args.parallelDeleteAttemptBaseDeleteFirst) { +// attempt to delete the base dir first. +// This can reduce ABFS delete load but may time out +
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
hadoop-yetus commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2058026632 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 54s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +0 :ok: | markdownlint | 0m 1s | | markdownlint was not available. | | +0 :ok: | xmllint | 0m 1s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 7 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 58s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 32m 39s | | trunk passed | | +1 :green_heart: | compile | 17m 26s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 16m 8s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 4m 23s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 57s | | trunk passed | | +1 :green_heart: | javadoc | 1m 39s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 32s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 2m 54s | | trunk passed | | +1 :green_heart: | shadedclient | 34m 10s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 34s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 1s | | the patch passed | | +1 :green_heart: | compile | 16m 43s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 16m 43s | | the patch passed | | +1 :green_heart: | compile | 16m 16s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 16m 16s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 4m 17s | [/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/7/artifact/out/results-checkstyle-root.txt) | root: The patch generated 1 new + 22 unchanged - 0 fixed = 23 total (was 22) | | +1 :green_heart: | mvnsite | 1m 52s | | the patch passed | | +1 :green_heart: | javadoc | 1m 35s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 29s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 19s | | the patch passed | | +1 :green_heart: | shadedclient | 34m 20s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 7m 55s | | hadoop-mapreduce-client-core in the patch passed. | | +1 :green_heart: | unit | 2m 40s | | hadoop-azure in the patch passed. | | +1 :green_heart: | asflicense | 1m 4s | | The patch does not generate ASF License warnings. | | | | 229m 15s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/7/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6716 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint xmllint | | uname | Linux 34da63c6214c 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 9193085fea42ef84e475ed84303fc39c199ffd67 | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/7/testReport/ | | Max. process+thread count | 1585 (vs. ulimit of 5500) | | modules | C
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
mukund-thakur commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1566388267 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java: ## @@ -582,19 +605,46 @@ protected final Path directoryMustExist( * Save a task manifest or summary. This will be done by * writing to a temp path and then renaming. * If the destination path exists: Delete it. + * This will retry so that a rename failure from abfs load or IO errors + * will not fail the task. * @param manifestData the manifest/success file * @param tempPath temp path for the initial save * @param finalPath final path for rename. - * @throws IOException failure to load/parse + * @throws IOException failure to rename after retries. */ @SuppressWarnings("unchecked") protected final void save(T manifestData, final Path tempPath, final Path finalPath) throws IOException { -LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, finalPath); -trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () -> -operations.save(manifestData, tempPath, true)); -renameFile(tempPath, finalPath); +boolean success = false; +int failures = 0; +while (!success) { + try { +LOG.trace("{}: attempt {} save('{}, {}, {}')", +getName(), failures, manifestData, tempPath, finalPath); + +trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () -> +operations.save(manifestData, tempPath, true)); +renameFile(tempPath, finalPath); Review Comment: rename may fail in second retry with SourcePathNotFound exception but it actually succeeded during the first try only. How are we recovering from that? ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java: ## @@ -445,9 +448,29 @@ protected Boolean delete( final boolean recursive, final String statistic) throws IOException { -return trackDuration(getIOStatistics(), statistic, () -> { - return operations.delete(path, recursive); -}); +if (recursive) { Review Comment: unable to understand this. How is a recursive flag determining that it is a dir or a file? ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/ManifestStoreOperations.java: ## @@ -97,6 +97,36 @@ public boolean isFile(Path path) throws IOException { public abstract boolean delete(Path path, boolean recursive) throws IOException; + /** + * Forward to {@code delete(Path, true)} + * unless overridden. + * + * If it returns without an error: there is no file at + * the end of the path. + * @param path path + * @return outcome + * @throws IOException failure. + */ + public boolean deleteFile(Path path) + throws IOException { +return delete(path, false); + } + + /** + * Acquire the delete capacity then call {@code FileSystem#delete(Path, true)} Review Comment: don't see the ratelimiter acquiring the deletecapacity. ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/ManifestCommitterConstants.java: ## @@ -143,6 +145,20 @@ public final class ManifestCommitterConstants { */ public static final boolean OPT_CLEANUP_PARALLEL_DELETE_DIRS_DEFAULT = true; + /** + * Should parallel cleanup try to delete teh base first? + * Best for azure as it skips the task attempt deletions unless + * the toplevel delete fails. + * Value: {@value}. + */ + public static final String OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST = + OPT_PREFIX + "cleanup.parallel.delete.base.first"; + + /** + * Default value of option {@link #OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST}: {@value}. + */ + public static final boolean OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST_DEFAULT = true; Review Comment: As it is bad for GCS, shouldn't the default be false? ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md: ## @@ -505,7 +491,7 @@ The core set of Azure-optimized options becomes spark.hadoop.fs.azure.io.rate.limit - 1 + 1000 Review Comment: I guess you reduced this for testing? ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md: ## @@ -523,7 +509,7 @@ And optional settings for debugging/performance analysis ``` spark.hadoop.mapreduce.outputcommitter.factory.sc
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
hadoop-yetus commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2052490254 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 31s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +0 :ok: | markdownlint | 0m 1s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 7 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 2s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 32m 11s | | trunk passed | | +1 :green_heart: | compile | 17m 40s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 16m 21s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 4m 21s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 56s | | trunk passed | | +1 :green_heart: | javadoc | 1m 38s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 35s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 2m 55s | | trunk passed | | +1 :green_heart: | shadedclient | 34m 34s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 33s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 3s | | the patch passed | | +1 :green_heart: | compile | 16m 50s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 16m 50s | | the patch passed | | +1 :green_heart: | compile | 16m 28s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 16m 28s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 4m 14s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 48s | | the patch passed | | +1 :green_heart: | javadoc | 1m 35s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 33s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 18s | | the patch passed | | +1 :green_heart: | shadedclient | 35m 30s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 7m 52s | | hadoop-mapreduce-client-core in the patch passed. | | +1 :green_heart: | unit | 2m 33s | | hadoop-azure in the patch passed. | | +1 :green_heart: | asflicense | 1m 4s | | The patch does not generate ASF License warnings. | | | | 230m 36s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6716 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint | | uname | Linux 412988d12e6f 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / a3117cf5bf7ba3a0cc4ffa0f3b1912241fa13c05 | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/5/testReport/ | | Max. process+thread count | 1592 (vs. ulimit of 5500) | | modules | C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-azure U: . | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/5/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Po
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
hadoop-yetus commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2052380306 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 19s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | markdownlint | 0m 0s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 7 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 19s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 19m 45s | | trunk passed | | +1 :green_heart: | compile | 8m 54s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 8m 10s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 2m 4s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 11s | | trunk passed | | +1 :green_heart: | javadoc | 1m 7s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 0s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 55s | | trunk passed | | +1 :green_heart: | shadedclient | 20m 34s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 23s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 0m 41s | | the patch passed | | +1 :green_heart: | compile | 8m 34s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 8m 34s | | the patch passed | | +1 :green_heart: | compile | 8m 11s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 8m 11s | | the patch passed | | +1 :green_heart: | blanks | 0m 1s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 2m 1s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 10s | | the patch passed | | +1 :green_heart: | javadoc | 1m 2s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 2s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 2m 2s | | the patch passed | | +1 :green_heart: | shadedclient | 20m 49s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 5m 44s | | hadoop-mapreduce-client-core in the patch passed. | | +1 :green_heart: | unit | 2m 2s | | hadoop-azure in the patch passed. | | +1 :green_heart: | asflicense | 0m 43s | | The patch does not generate ASF License warnings. | | | | 138m 37s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/6/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6716 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint | | uname | Linux b03fe1ba958a 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 16e1be4a39e8415ecb45ddd3c43c70c9a35f01b0 | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/6/testReport/ | | Max. process+thread count | 1530 (vs. ulimit of 5500) | | modules | C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-azure U: . | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/6/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Po
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
hadoop-yetus commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2050651681 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 32s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +0 :ok: | markdownlint | 0m 1s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 7 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 46s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 32m 21s | | trunk passed | | +1 :green_heart: | compile | 17m 28s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 16m 22s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 4m 22s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 54s | | trunk passed | | +1 :green_heart: | javadoc | 1m 39s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 27s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 2m 53s | | trunk passed | | +1 :green_heart: | shadedclient | 34m 9s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 33s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 0s | | the patch passed | | +1 :green_heart: | compile | 16m 47s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 16m 47s | | the patch passed | | +1 :green_heart: | compile | 16m 8s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 16m 8s | | the patch passed | | -1 :x: | blanks | 0m 0s | [/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/3/artifact/out/blanks-eol.txt) | The patch has 3 line(s) that end in blanks. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | +1 :green_heart: | checkstyle | 4m 19s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 55s | | the patch passed | | +1 :green_heart: | javadoc | 1m 36s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 32s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 17s | | the patch passed | | +1 :green_heart: | shadedclient | 34m 43s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 7m 47s | | hadoop-mapreduce-client-core in the patch passed. | | +1 :green_heart: | unit | 2m 32s | | hadoop-azure in the patch passed. | | +1 :green_heart: | asflicense | 1m 5s | | The patch does not generate ASF License warnings. | | | | 228m 34s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6716 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint | | uname | Linux 5470963506ac 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 3eb8b8742375fbdf056f9241b9f3c847cde5034e | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/3/testReport/ | | Max. process+thread count | 1367 (vs. ulimit of 5500) | | modules | C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hado
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
hadoop-yetus commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2050579864 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 6m 46s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | markdownlint | 0m 0s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 7 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 30s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 19m 58s | | trunk passed | | +1 :green_heart: | compile | 8m 55s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 8m 15s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 2m 8s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 13s | | trunk passed | | +1 :green_heart: | javadoc | 1m 6s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 2s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 54s | | trunk passed | | +1 :green_heart: | shadedclient | 20m 35s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 22s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 0m 40s | | the patch passed | | +1 :green_heart: | compile | 8m 30s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 8m 30s | | the patch passed | | +1 :green_heart: | compile | 8m 10s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 8m 10s | | the patch passed | | -1 :x: | blanks | 0m 0s | [/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/4/artifact/out/blanks-eol.txt) | The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | +1 :green_heart: | checkstyle | 2m 1s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 12s | | the patch passed | | +1 :green_heart: | javadoc | 0m 57s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 0s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 2m 5s | | the patch passed | | +1 :green_heart: | shadedclient | 20m 44s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 5m 46s | | hadoop-mapreduce-client-core in the patch passed. | | +1 :green_heart: | unit | 2m 4s | | hadoop-azure in the patch passed. | | +1 :green_heart: | asflicense | 0m 42s | | The patch does not generate ASF License warnings. | | | | 145m 11s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6716 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint | | uname | Linux 8e0cc7ab2749 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 0ed84a98c5e630590f376a445eb28c27c2e8e780 | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/4/testReport/ | | Max. process+thread count | 1620 (vs. ulimit of 5500) | | modules | C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hado
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2050326218 Reviews invited from @mukund-thakur @anmolanmol1234 @anujmodi2021 @HarshitGupta11 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
hadoop-yetus commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2046034751 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 12m 27s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 3 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 45m 9s | | trunk passed | | +1 :green_heart: | compile | 0m 42s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 0m 39s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 0m 43s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 47s | | trunk passed | | +1 :green_heart: | javadoc | 0m 35s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 0m 29s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 25s | | trunk passed | | +1 :green_heart: | shadedclient | 33m 50s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 32s | | the patch passed | | +1 :green_heart: | compile | 0m 32s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 0m 32s | | the patch passed | | +1 :green_heart: | compile | 0m 31s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 0m 30s | | the patch passed | | -1 :x: | blanks | 0m 0s | [/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/1/artifact/out/blanks-eol.txt) | The patch has 2 line(s) that end in blanks. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | +1 :green_heart: | checkstyle | 0m 31s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 35s | | the patch passed | | +1 :green_heart: | javadoc | 0m 20s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 0m 20s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 22s | | the patch passed | | +1 :green_heart: | shadedclient | 33m 49s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 7m 41s | [/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/1/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.txt) | hadoop-mapreduce-client-core in the patch passed. | | +1 :green_heart: | asflicense | 0m 38s | | The patch does not generate ASF License warnings. | | | | 147m 25s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.mapreduce.lib.output.committer.manifest.TestCreateOutputDirectoriesStage | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6716 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 27c1a33d07a2 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 4dc5a605e08c88f8529ce7d117ff6033a4e6594c | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/1/testReport/ | | Max. process+thread count | 1601 (vs. ulimit of 5500) | | modules | C: hadoop
Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]
steveloughran commented on PR #6716: URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2045879335 testing: azure cardiff -Dparallel-tests=abfs -DtestsThreadCount=8 lots of tests failed for me, but I've now got an account with very low IO threshold. We need to look at those failures in general -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org