alamb commented on code in PR #7244:
URL: https://github.com/apache/arrow-datafusion/pull/7244#discussion_r1292754610
##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -543,6 +575,318 @@ async fn fetch_statistics(
Ok(statistics)
}
+/// Implements [`DataSink`] for writing to a parquet file.
+struct ParquetSink {
+ /// Config options for writing data
+ config: FileSinkConfig,
+}
+
+impl Debug for ParquetSink {
+ fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+ f.debug_struct("ParquetSink").finish()
+ }
+}
+
+impl DisplayAs for ParquetSink {
+ fn fmt_as(&self, t: DisplayFormatType, f: &mut fmt::Formatter<'_>) ->
fmt::Result {
+ match t {
+ DisplayFormatType::Default | DisplayFormatType::Verbose => {
+ write!(
+ f,
+ "ParquetSink(writer_mode={:?}, file_groups=",
+ self.config.writer_mode
+ )?;
+ FileGroupDisplay(&self.config.file_groups).fmt_as(t, f)?;
+ write!(f, ")")
+ }
+ }
+ }
+}
+
+/// Parses datafusion.execution.parquet.encoding String to a
parquet::basic::Encoding
+fn parse_encoding_string(str_setting: &str) ->
Result<parquet::basic::Encoding> {
+ let str_setting_lower: &str = &str_setting.to_lowercase();
+ match str_setting_lower {
+ "plain" => Ok(parquet::basic::Encoding::PLAIN),
Review Comment:
👍 this code could also probably be upstreamed to arrow-rs -- I'll file at
ticket to suggest it
##########
datafusion/core/tests/sql/expr.rs:
##########
@@ -448,8 +448,10 @@ async fn test_substring_expr() -> Result<()> {
Ok(())
}
+/// Test string expressions test split into two batches
Review Comment:
I am not concerned -- your comment is good enough
What we have found is that this large stack frame thing only affects debug
builds (I think because the rust compiler isn't reusing stack space so debugger
output is better) -- release builds have much smaller stack requirements
In this case could probably make individual test functions for each
expression rather than a few functions that each test more than one function
##########
datafusion/core/src/datasource/file_format/options.rs:
##########
@@ -250,6 +266,18 @@ impl<'a> ParquetReadOptions<'a> {
self.table_partition_cols = table_partition_cols;
self
}
+
+ /// Configure if file has known sort order
+ pub fn file_sort_order(mut self, file_sort_order: Vec<Vec<Expr>>) -> Self {
Review Comment:
I tested out the usecase in
https://github.com/apache/arrow-datafusion/issues/7036 and I sadly think there
is a little bit more to go
```
❯ create external table foo stored as parquet location '/tmp/foo' with order
(time);
Error during planning: Provide a schema before specifying the order while
creating a table.
❯ create external table foo(cpu varchar, host1 varchar, time timestamp)
stored as parquet location '/tmp/foo' with order (time);
Error during planning: Column definitions can not be specified for PARQUET
files.
```
But now that you have added the underlying feature implementation I think
making it happen will be a small matter of plumbing
I updated the PR description to say "part of
https://github.com/apache/arrow-datafusion/issues/7036"
##########
datafusion/core/src/datasource/file_format/options.rs:
##########
@@ -250,6 +266,18 @@ impl<'a> ParquetReadOptions<'a> {
self.table_partition_cols = table_partition_cols;
self
}
+
+ /// Configure if file has known sort order
+ pub fn file_sort_order(mut self, file_sort_order: Vec<Vec<Expr>>) -> Self {
Review Comment:
I tested out the usecase in
https://github.com/apache/arrow-datafusion/issues/7036 and I sadly think there
is a little bit more to go
```
❯ create external table foo stored as parquet location '/tmp/foo' with order
(time);
Error during planning: Provide a schema before specifying the order while
creating a table.
❯ create external table foo(cpu varchar, host1 varchar, time timestamp)
stored as parquet location '/tmp/foo' with order (time);
Error during planning: Column definitions can not be specified for PARQUET
files.
```
But now that you have added the underlying feature implementation I think
making it happen will be a small matter of plumbing
I updated the PR description to say "part of
https://github.com/apache/arrow-datafusion/issues/7036"
##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -543,6 +575,318 @@ async fn fetch_statistics(
Ok(statistics)
}
+/// Implements [`DataSink`] for writing to a parquet file.
+struct ParquetSink {
+ /// Config options for writing data
+ config: FileSinkConfig,
+}
+
+impl Debug for ParquetSink {
+ fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+ f.debug_struct("ParquetSink").finish()
+ }
+}
+
+impl DisplayAs for ParquetSink {
+ fn fmt_as(&self, t: DisplayFormatType, f: &mut fmt::Formatter<'_>) ->
fmt::Result {
+ match t {
+ DisplayFormatType::Default | DisplayFormatType::Verbose => {
+ write!(
+ f,
+ "ParquetSink(writer_mode={:?}, file_groups=",
+ self.config.writer_mode
+ )?;
+ FileGroupDisplay(&self.config.file_groups).fmt_as(t, f)?;
+ write!(f, ")")
+ }
+ }
+ }
+}
+
+/// Parses datafusion.execution.parquet.encoding String to a
parquet::basic::Encoding
+fn parse_encoding_string(str_setting: &str) ->
Result<parquet::basic::Encoding> {
+ let str_setting_lower: &str = &str_setting.to_lowercase();
+ match str_setting_lower {
+ "plain" => Ok(parquet::basic::Encoding::PLAIN),
Review Comment:
👍 this code could also probably be upstreamed to arrow-rs -- I'll file at
ticket to suggest it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]