Re: [PR] fix: create file for empty stream [datafusion]

2025-07-08 Thread via GitHub


alamb commented on PR #16342:
URL: https://github.com/apache/datafusion/pull/16342#issuecomment-3050189370

   - Reverted in https://github.com/apache/datafusion/pull/16682


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: create file for empty stream [datafusion]

2025-07-05 Thread via GitHub


brunal commented on code in PR #16342:
URL: https://github.com/apache/datafusion/pull/16342#discussion_r2187061361


##
datafusion/datasource/src/file_sink_config.rs:
##
@@ -77,13 +79,34 @@ pub trait FileSink: DataSink {
 .runtime_env()
 .object_store(&config.object_store_url)?;
 let (demux_task, file_stream_rx) = start_demuxer_task(config, data, 
context);
-self.spawn_writer_tasks_and_join(
-context,
-demux_task,
-file_stream_rx,
-object_store,
-)
-.await
+let mut num_rows = self
+.spawn_writer_tasks_and_join(
+context,
+demux_task,
+file_stream_rx,
+Arc::clone(&object_store),
+)
+.await?;
+if num_rows == 0 {
+// If no rows were written, then no files are output either.

Review Comment:
   Indeed, I write one empty RecordBatch. Here is a short repro of the issue: 
https://github.com/brunal/datafusion/commit/aed5316583ecae7901d9596039ca4bfe7cd48811.
   
   I don't think num_rows should be used to determine whether a file was 
created.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: create file for empty stream [datafusion]

2025-07-04 Thread via GitHub


chenkovsky commented on code in PR #16342:
URL: https://github.com/apache/datafusion/pull/16342#discussion_r2185571139


##
datafusion/datasource/src/file_sink_config.rs:
##
@@ -77,13 +79,34 @@ pub trait FileSink: DataSink {
 .runtime_env()
 .object_store(&config.object_store_url)?;
 let (demux_task, file_stream_rx) = start_demuxer_task(config, data, 
context);
-self.spawn_writer_tasks_and_join(
-context,
-demux_task,
-file_stream_rx,
-object_store,
-)
-.await
+let mut num_rows = self
+.spawn_writer_tasks_and_join(
+context,
+demux_task,
+file_stream_rx,
+Arc::clone(&object_store),
+)
+.await?;
+if num_rows == 0 {
+// If no rows were written, then no files are output either.

Review Comment:
   @alamb I think we need to define 'empty' in datafusion clearly. currently 
it's vec![] 
https://github.com/apache/datafusion/blob/a7151730a6c65a62793d293557ca15e975fcb701/datafusion/physical-plan/src/empty.rs#L70
 , I guess brunal uses `vec![empty_record_batch]`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: create file for empty stream [datafusion]

2025-07-04 Thread via GitHub


chenkovsky commented on code in PR #16342:
URL: https://github.com/apache/datafusion/pull/16342#discussion_r2185358781


##
datafusion/datasource/src/file_sink_config.rs:
##
@@ -77,13 +79,34 @@ pub trait FileSink: DataSink {
 .runtime_env()
 .object_store(&config.object_store_url)?;
 let (demux_task, file_stream_rx) = start_demuxer_task(config, data, 
context);
-self.spawn_writer_tasks_and_join(
-context,
-demux_task,
-file_stream_rx,
-object_store,
-)
-.await
+let mut num_rows = self
+.spawn_writer_tasks_and_join(
+context,
+demux_task,
+file_stream_rx,
+Arc::clone(&object_store),
+)
+.await?;
+if num_rows == 0 {
+// If no rows were written, then no files are output either.

Review Comment:
   Hi, I didn't reproduce you problem. could you please share your test. I will 
check it ASAP



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: create file for empty stream [datafusion]

2025-07-04 Thread via GitHub


brunal commented on PR #16342:
URL: https://github.com/apache/datafusion/pull/16342#issuecomment-3036100543

   How can i send a rollback of this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: create file for empty stream [datafusion]

2025-07-04 Thread via GitHub


brunal commented on PR #16342:
URL: https://github.com/apache/datafusion/pull/16342#issuecomment-3036097695

   User facing breakage: one cannot explicitly write an empty recordbatch that 
has a schema anymore.
   
   The tests in the PR don't have a schema so they don't reveal the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: create file for empty stream [datafusion]

2025-07-04 Thread via GitHub


brunal commented on code in PR #16342:
URL: https://github.com/apache/datafusion/pull/16342#discussion_r2185262228


##
datafusion/datasource/src/file_sink_config.rs:
##
@@ -77,13 +79,34 @@ pub trait FileSink: DataSink {
 .runtime_env()
 .object_store(&config.object_store_url)?;
 let (demux_task, file_stream_rx) = start_demuxer_task(config, data, 
context);
-self.spawn_writer_tasks_and_join(
-context,
-demux_task,
-file_stream_rx,
-object_store,
-)
-.await
+let mut num_rows = self
+.spawn_writer_tasks_and_join(
+context,
+demux_task,
+file_stream_rx,
+Arc::clone(&object_store),
+)
+.await?;
+if num_rows == 0 {
+// If no rows were written, then no files are output either.

Review Comment:
   You say now row => no file was created.
   
   But then you say write an empty recordbatch => ensure a file gets created.
   
   Except an empty recordbatch has no rows (at least when written to a parquet 
file).
   
   Your 2 sentences don't make sense together.
   
   In practice, this PR caused a regression: we cannot write empty recordbatch 
to parquet anymore,  as the code here tries to write it a second time, and we 
get an error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: create file for empty stream [datafusion]

2025-06-18 Thread via GitHub


alamb merged PR #16342:
URL: https://github.com/apache/datafusion/pull/16342


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: create file for empty stream [datafusion]

2025-06-10 Thread via GitHub


alamb commented on code in PR #16342:
URL: https://github.com/apache/datafusion/pull/16342#discussion_r2138987979


##
datafusion/core/src/datasource/file_format/csv.rs:
##
@@ -795,6 +796,25 @@ mod tests {
 Ok(())
 }
 
+#[tokio::test]
+async fn test_csv_write_empty_file() -> Result<()> {

Review Comment:
   Can you please also add a test of what happens if we are writing into a 
directory (aka a partitioned file)? 
   
   Like a test that targets a path of `/foo`
   
   In this case, I think I would expect that no files are created when the data 
stream is empty



##
datafusion/datasource/src/file_sink_config.rs:
##
@@ -77,13 +79,32 @@ pub trait FileSink: DataSink {
 .runtime_env()
 .object_store(&config.object_store_url)?;
 let (demux_task, file_stream_rx) = start_demuxer_task(config, data, 
context);
-self.spawn_writer_tasks_and_join(
-context,
-demux_task,
-file_stream_rx,
-object_store,
-)
-.await
+let mut num_rows = self
+.spawn_writer_tasks_and_join(
+context,
+demux_task,
+file_stream_rx,
+Arc::clone(&object_store),
+)
+.await?;
+if num_rows == 0 {

Review Comment:
   I think it would help here to add some comments with context about what it 
is doing. For example, something like
   
   ```suggestion
   // If no rows were written, then no files are output either. In this 
case 
   // send an empty record batch through to ensure the output file is 
generated
   if num_rows == 0 {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: create file for empty stream [datafusion]

2025-06-09 Thread via GitHub


mmooyyii commented on PR #16342:
URL: https://github.com/apache/datafusion/pull/16342#issuecomment-2957479062

   Maybe add same test for write_parquet and write_json?  I think they should 
have same behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]