Re: [PR] fix: create file for empty stream [datafusion]
alamb commented on PR #16342: URL: https://github.com/apache/datafusion/pull/16342#issuecomment-3050189370 - Reverted in https://github.com/apache/datafusion/pull/16682 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: create file for empty stream [datafusion]
brunal commented on code in PR #16342:
URL: https://github.com/apache/datafusion/pull/16342#discussion_r2187061361
##
datafusion/datasource/src/file_sink_config.rs:
##
@@ -77,13 +79,34 @@ pub trait FileSink: DataSink {
.runtime_env()
.object_store(&config.object_store_url)?;
let (demux_task, file_stream_rx) = start_demuxer_task(config, data,
context);
-self.spawn_writer_tasks_and_join(
-context,
-demux_task,
-file_stream_rx,
-object_store,
-)
-.await
+let mut num_rows = self
+.spawn_writer_tasks_and_join(
+context,
+demux_task,
+file_stream_rx,
+Arc::clone(&object_store),
+)
+.await?;
+if num_rows == 0 {
+// If no rows were written, then no files are output either.
Review Comment:
Indeed, I write one empty RecordBatch. Here is a short repro of the issue:
https://github.com/brunal/datafusion/commit/aed5316583ecae7901d9596039ca4bfe7cd48811.
I don't think num_rows should be used to determine whether a file was
created.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] fix: create file for empty stream [datafusion]
chenkovsky commented on code in PR #16342:
URL: https://github.com/apache/datafusion/pull/16342#discussion_r2185571139
##
datafusion/datasource/src/file_sink_config.rs:
##
@@ -77,13 +79,34 @@ pub trait FileSink: DataSink {
.runtime_env()
.object_store(&config.object_store_url)?;
let (demux_task, file_stream_rx) = start_demuxer_task(config, data,
context);
-self.spawn_writer_tasks_and_join(
-context,
-demux_task,
-file_stream_rx,
-object_store,
-)
-.await
+let mut num_rows = self
+.spawn_writer_tasks_and_join(
+context,
+demux_task,
+file_stream_rx,
+Arc::clone(&object_store),
+)
+.await?;
+if num_rows == 0 {
+// If no rows were written, then no files are output either.
Review Comment:
@alamb I think we need to define 'empty' in datafusion clearly. currently
it's vec![]
https://github.com/apache/datafusion/blob/a7151730a6c65a62793d293557ca15e975fcb701/datafusion/physical-plan/src/empty.rs#L70
, I guess brunal uses `vec![empty_record_batch]`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] fix: create file for empty stream [datafusion]
chenkovsky commented on code in PR #16342:
URL: https://github.com/apache/datafusion/pull/16342#discussion_r2185358781
##
datafusion/datasource/src/file_sink_config.rs:
##
@@ -77,13 +79,34 @@ pub trait FileSink: DataSink {
.runtime_env()
.object_store(&config.object_store_url)?;
let (demux_task, file_stream_rx) = start_demuxer_task(config, data,
context);
-self.spawn_writer_tasks_and_join(
-context,
-demux_task,
-file_stream_rx,
-object_store,
-)
-.await
+let mut num_rows = self
+.spawn_writer_tasks_and_join(
+context,
+demux_task,
+file_stream_rx,
+Arc::clone(&object_store),
+)
+.await?;
+if num_rows == 0 {
+// If no rows were written, then no files are output either.
Review Comment:
Hi, I didn't reproduce you problem. could you please share your test. I will
check it ASAP
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] fix: create file for empty stream [datafusion]
brunal commented on PR #16342: URL: https://github.com/apache/datafusion/pull/16342#issuecomment-3036100543 How can i send a rollback of this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: create file for empty stream [datafusion]
brunal commented on PR #16342: URL: https://github.com/apache/datafusion/pull/16342#issuecomment-3036097695 User facing breakage: one cannot explicitly write an empty recordbatch that has a schema anymore. The tests in the PR don't have a schema so they don't reveal the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: create file for empty stream [datafusion]
brunal commented on code in PR #16342:
URL: https://github.com/apache/datafusion/pull/16342#discussion_r2185262228
##
datafusion/datasource/src/file_sink_config.rs:
##
@@ -77,13 +79,34 @@ pub trait FileSink: DataSink {
.runtime_env()
.object_store(&config.object_store_url)?;
let (demux_task, file_stream_rx) = start_demuxer_task(config, data,
context);
-self.spawn_writer_tasks_and_join(
-context,
-demux_task,
-file_stream_rx,
-object_store,
-)
-.await
+let mut num_rows = self
+.spawn_writer_tasks_and_join(
+context,
+demux_task,
+file_stream_rx,
+Arc::clone(&object_store),
+)
+.await?;
+if num_rows == 0 {
+// If no rows were written, then no files are output either.
Review Comment:
You say now row => no file was created.
But then you say write an empty recordbatch => ensure a file gets created.
Except an empty recordbatch has no rows (at least when written to a parquet
file).
Your 2 sentences don't make sense together.
In practice, this PR caused a regression: we cannot write empty recordbatch
to parquet anymore, as the code here tries to write it a second time, and we
get an error.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] fix: create file for empty stream [datafusion]
alamb merged PR #16342: URL: https://github.com/apache/datafusion/pull/16342 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: create file for empty stream [datafusion]
alamb commented on code in PR #16342:
URL: https://github.com/apache/datafusion/pull/16342#discussion_r2138987979
##
datafusion/core/src/datasource/file_format/csv.rs:
##
@@ -795,6 +796,25 @@ mod tests {
Ok(())
}
+#[tokio::test]
+async fn test_csv_write_empty_file() -> Result<()> {
Review Comment:
Can you please also add a test of what happens if we are writing into a
directory (aka a partitioned file)?
Like a test that targets a path of `/foo`
In this case, I think I would expect that no files are created when the data
stream is empty
##
datafusion/datasource/src/file_sink_config.rs:
##
@@ -77,13 +79,32 @@ pub trait FileSink: DataSink {
.runtime_env()
.object_store(&config.object_store_url)?;
let (demux_task, file_stream_rx) = start_demuxer_task(config, data,
context);
-self.spawn_writer_tasks_and_join(
-context,
-demux_task,
-file_stream_rx,
-object_store,
-)
-.await
+let mut num_rows = self
+.spawn_writer_tasks_and_join(
+context,
+demux_task,
+file_stream_rx,
+Arc::clone(&object_store),
+)
+.await?;
+if num_rows == 0 {
Review Comment:
I think it would help here to add some comments with context about what it
is doing. For example, something like
```suggestion
// If no rows were written, then no files are output either. In this
case
// send an empty record batch through to ensure the output file is
generated
if num_rows == 0 {
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] fix: create file for empty stream [datafusion]
mmooyyii commented on PR #16342: URL: https://github.com/apache/datafusion/pull/16342#issuecomment-2957479062 Maybe add same test for write_parquet and write_json? I think they should have same behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
