[PR] [GLUTEN-9801] Handle ManifestCommitter to create new tasks to the temporary path for failure files fix. [incubator-gluten]

via GitHub Tue, 17 Jun 2025 00:51:58 -0700


RushabhK opened a new pull request, #9993:
URL: https://github.com/apache/incubator-gluten/pull/9993

## What changes were proposed in this pull request?

This PR fixes the failed tasks files write issue with the Manifest
committer: https://github.com/apache/incubator-gluten/issues/9801
Before this, in the creation the new task attempt temporary path for
Manifest committer, the file path creation was defaulting to the base write
path(.spark staging directory).
Sample log: `1749656453822 25/06/11 15:40:53 [Executor task launch worker
for task 206.0 in stage 2.0 (TID 1801)] ERROR VeloxColumnarWriteFilesRDD: Velox
staging write path:
gs://<some_path>/.spark-staging-a0487498-59b2-4317-a70f-b72f303e3bfb`
This led to all the base files being copies to the target location as a part
of the commit task.
Sample log: `Velox staging write path:
gs://<some_path>/.spark-staging-c5287b54-5545-47b5-908a-584d09787d71/_temporary/f12b689f-8508-4422-b7cd-aa79864e6428/00/tasks/attempt_202506131552457352180433024741873_0001_m_000147_148`

(Fixes: \#GLUTEN-9801)
With this fix, I upgrade the hadoop client version from `3.3.4` to `3.3.6`
which has the `ManifestCommitter` support. I then handle the case for
ManifestCommitter in the new task attempt temporary path creation to get the
work path, similar to the `FileOutputCommitter`

## How was this patch tested?
1. I took the gluten build with these changes, built my new spark image
2. I have a spark job which writes parquet with 300 tasks, 8 core per
executor is the config.
3. While it is writing from the 300 tasks, I kill 5 of the executors (40
failed tasks), it retries and then it finishes.
4. I then try reading the parquet files and just do a df.count() on it for
it to materialize. With this fix, I am no longer finding the invalid parquet
execption while reading the files and my data is matching exactly like the
Vanilla spark's run. Have tested this on multiple runs to validate the fix.
5. I also added the logs for the Velox staging write path:
https://github.com/RushabhK/incubator-gluten/blob/v1.3.0-fixes/backends-velox/src/main/scala/org/apache/spark/sql/execution/VeloxColumnarWriteFilesExec.scala#L213
This write path is now fixed from earlies .spark staging directory to now a
temp path inside .spark staging directory like mentioned in the above sample
logs.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [GLUTEN-9801] Handle ManifestCommitter to create new tasks to the temporary path for failure files fix. [incubator-gluten]

Reply via email to