[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #7244: Extend insert into to support Parquet backed tables

via GitHub Sun, 13 Aug 2023 04:13:15 -0700


alamb commented on code in PR #7244:
URL: https://github.com/apache/arrow-datafusion/pull/7244#discussion_r1292754610



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -543,6 +575,318 @@ async fn fetch_statistics(
     Ok(statistics)
 }
 
+/// Implements [`DataSink`] for writing to a parquet file.
+struct ParquetSink {
+    /// Config options for writing data
+    config: FileSinkConfig,
+}
+
+impl Debug for ParquetSink {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        f.debug_struct("ParquetSink").finish()
+    }
+}
+
+impl DisplayAs for ParquetSink {
+    fn fmt_as(&self, t: DisplayFormatType, f: &mut fmt::Formatter<'_>) -> 
fmt::Result {
+        match t {
+            DisplayFormatType::Default | DisplayFormatType::Verbose => {
+                write!(
+                    f,
+                    "ParquetSink(writer_mode={:?}, file_groups=",
+                    self.config.writer_mode
+                )?;
+                FileGroupDisplay(&self.config.file_groups).fmt_as(t, f)?;
+                write!(f, ")")
+            }
+        }
+    }
+}
+
+/// Parses datafusion.execution.parquet.encoding String to a 
parquet::basic::Encoding
+fn parse_encoding_string(str_setting: &str) -> 
Result<parquet::basic::Encoding> {
+    let str_setting_lower: &str = &str_setting.to_lowercase();
+    match str_setting_lower {
+        "plain" => Ok(parquet::basic::Encoding::PLAIN),

Review Comment:
   👍  this code could also probably be upstreamed to arrow-rs -- I'll file at 
ticket to suggest it



##########
datafusion/core/tests/sql/expr.rs:
##########
@@ -448,8 +448,10 @@ async fn test_substring_expr() -> Result<()> {
     Ok(())
 }
 
+/// Test string expressions test split into two batches

Review Comment:
   I am not concerned -- your comment is good enough
   
   What we have found is that this large stack frame thing only affects debug 
builds (I think because the rust compiler isn't reusing stack space so debugger 
output is better) -- release builds have much smaller stack requirements
   
   In this case could probably make individual test functions for each 
expression rather than a few functions that each test more than one function



##########
datafusion/core/src/datasource/file_format/options.rs:
##########
@@ -250,6 +266,18 @@ impl<'a> ParquetReadOptions<'a> {
         self.table_partition_cols = table_partition_cols;
         self
     }
+
+    /// Configure if file has known sort order
+    pub fn file_sort_order(mut self, file_sort_order: Vec<Vec<Expr>>) -> Self {

Review Comment:
   I tested out the usecase in 
https://github.com/apache/arrow-datafusion/issues/7036 and I sadly think there 
is a little bit more to go
   
   ```
   ❯ create external table foo stored as parquet location '/tmp/foo' with order 
(time);
   Error during planning: Provide a schema before specifying the order while 
creating a table.
   ❯ create external table foo(cpu varchar, host1 varchar, time timestamp) 
stored as parquet location '/tmp/foo' with order (time);
   Error during planning: Column definitions can not be specified for PARQUET 
files.
   ```
   
   But now that you have added the underlying feature implementation I think 
making it happen will be a small matter of plumbing
   
   I updated the PR description to say "part of 
https://github.com/apache/arrow-datafusion/issues/7036";



##########
datafusion/core/src/datasource/file_format/options.rs:
##########
@@ -250,6 +266,18 @@ impl<'a> ParquetReadOptions<'a> {
         self.table_partition_cols = table_partition_cols;
         self
     }
+
+    /// Configure if file has known sort order
+    pub fn file_sort_order(mut self, file_sort_order: Vec<Vec<Expr>>) -> Self {

Review Comment:
   I tested out the usecase in 
https://github.com/apache/arrow-datafusion/issues/7036 and I sadly think there 
is a little bit more to go
   
   ```
   ❯ create external table foo stored as parquet location '/tmp/foo' with order 
(time);
   Error during planning: Provide a schema before specifying the order while 
creating a table.
   ❯ create external table foo(cpu varchar, host1 varchar, time timestamp) 
stored as parquet location '/tmp/foo' with order (time);
   Error during planning: Column definitions can not be specified for PARQUET 
files.
   ```
   
   But now that you have added the underlying feature implementation I think 
making it happen will be a small matter of plumbing
   
   I updated the PR description to say "part of 
https://github.com/apache/arrow-datafusion/issues/7036";



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -543,6 +575,318 @@ async fn fetch_statistics(
     Ok(statistics)
 }
 
+/// Implements [`DataSink`] for writing to a parquet file.
+struct ParquetSink {
+    /// Config options for writing data
+    config: FileSinkConfig,
+}
+
+impl Debug for ParquetSink {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        f.debug_struct("ParquetSink").finish()
+    }
+}
+
+impl DisplayAs for ParquetSink {
+    fn fmt_as(&self, t: DisplayFormatType, f: &mut fmt::Formatter<'_>) -> 
fmt::Result {
+        match t {
+            DisplayFormatType::Default | DisplayFormatType::Verbose => {
+                write!(
+                    f,
+                    "ParquetSink(writer_mode={:?}, file_groups=",
+                    self.config.writer_mode
+                )?;
+                FileGroupDisplay(&self.config.file_groups).fmt_as(t, f)?;
+                write!(f, ")")
+            }
+        }
+    }
+}
+
+/// Parses datafusion.execution.parquet.encoding String to a 
parquet::basic::Encoding
+fn parse_encoding_string(str_setting: &str) -> 
Result<parquet::basic::Encoding> {
+    let str_setting_lower: &str = &str_setting.to_lowercase();
+    match str_setting_lower {
+        "plain" => Ok(parquet::basic::Encoding::PLAIN),

Review Comment:
   👍  this code could also probably be upstreamed to arrow-rs -- I'll file at 
ticket to suggest it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #7244: Extend insert into to support Parquet backed tables

Reply via email to