Re: [PR] fix: Potential data lost without flush on writeSortedFileNative [datafusion-comet]

via GitHub Fri, 17 Oct 2025 07:06:04 -0700


mbutrovich commented on code in PR #2578:
URL: https://github.com/apache/datafusion-comet/pull/2578#discussion_r2432627173



##########
native/core/src/execution/shuffle/row.rs:
##########
@@ -839,6 +839,7 @@ pub fn process_sorted_row_partition(
             .open(&output_path)?;
 
         output_data.write_all(&frozen)?;
+        output_data.flush()?;

Review Comment:
   Does a partially written file read back successfully? Otherwise I don't 
think flushing each batch, resulting in an unrecoverable, partially-written 
file makes sense. If it's not readable, `flush()` should be outside of the loop 
after all batches have been written. If it is readable, I think `flush()` where 
you added it makes sense. Also it's misleading to say that this avoids data 
loss, since `flush()` does not do an `fsync()`.  I'm not proposing that we add 
an `fsync()`. I do not know what sort of durability guarantees Spark typically 
provides for its shuffle files when writing out batches.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: Potential data lost without flush on writeSortedFileNative [datafusion-comet]

Reply via email to