Re: [PR] Set Formatted TableOptions Enum [datafusion]

2025-08-08 Thread via GitHub


github-actions[bot] closed pull request #16166: Set Formatted TableOptions Enum
URL: https://github.com/apache/datafusion/pull/16166


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Set Formatted TableOptions Enum [datafusion]

2025-07-31 Thread via GitHub


github-actions[bot] commented on PR #16166:
URL: https://github.com/apache/datafusion/pull/16166#issuecomment-3141928497

   Thank you for your contribution. Unfortunately, this pull request is stale 
because it has been open 60 days with no activity. Please remove the stale 
label or comment or this will be closed in 7 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Set Formatted TableOptions Enum [datafusion]

2025-05-30 Thread via GitHub


alamb commented on code in PR #16166:
URL: https://github.com/apache/datafusion/pull/16166#discussion_r2115541241


##
datafusion/common/src/config.rs:
##
@@ -1566,8 +1517,68 @@ impl TableOptions {
 /// # Parameters
 ///
 /// * `format`: The file format to use (e.g., CSV, Parquet).
-pub fn set_config_format(&mut self, format: ConfigFileType) {
-self.current_format = Some(format);
+pub fn get_table_format_options(&self, format: ConfigFileType) -> 
TableFormatOptions {
+match format {
+ConfigFileType::CSV => {
+let options = self.csv.clone();
+TableFormatOptions::Csv {
+options,
+extensions: self.extensions.clone(),
+}
+}
+#[cfg(feature = "parquet")]
+ConfigFileType::PARQUET => {
+let options = self.parquet.clone();
+TableFormatOptions::Parquet {
+options,
+extensions: self.extensions.clone(),
+}
+}
+ConfigFileType::JSON => {
+let options = self.json.clone();
+TableFormatOptions::Json {
+options,
+extensions: self.extensions.clone(),
+}
+}
+}
+}
+
+/// Initializes a new `TableOptions` from a hash map of string settings.
+///
+/// # Parameters
+///
+/// * `settings`: A hash map where each key-value pair represents a 
configuration setting.
+///
+/// # Returns
+///
+/// A result containing the new `TableOptions` instance or an error if any 
setting could not be applied.
+pub fn from_string_hash_map(settings: &HashMap) -> 
Result {
+let mut ret = Self::default();
+for (k, v) in settings {
+ret.set(k, v)?;
+}
+
+Ok(ret)
+}
+
+/// Modifies the current `TableOptions` instance with settings from a hash 
map.
+///
+/// # Parameters
+///
+/// * `settings`: A hash map where each key-value pair represents a 
configuration setting.
+///
+/// # Returns
+///
+/// A result indicating success or failure in applying the settings.
+pub fn alter_with_string_hash_map(

Review Comment:
   maybe `update` would be clearer as `alter` might imply some change other 
than updating the map



##
datafusion/common/src/config.rs:
##
@@ -1612,42 +1623,241 @@ impl TableOptions {
 };
 e.0.set(key, value)
 }
+}
 
-/// Initializes a new `TableOptions` from a hash map of string settings.
+#[derive(Debug, Clone)]
+#[allow(clippy::large_enum_variant)]
+pub enum TableFormatOptions {
+Csv {
+options: CsvOptions,
+extensions: Extensions,
+},
+#[cfg(feature = "parquet")]
+Parquet {
+options: TableParquetOptions,
+extensions: Extensions,
+},
+Json {
+options: JsonOptions,
+extensions: Extensions,
+},
+}
+
+impl ConfigField for TableFormatOptions {
+/// Visits configuration settings for the current file format, or all 
formats if none is selected.
+///
+/// This method adapts the behavior based on whether a file format is 
currently selected in `current_format`.
+/// If a format is selected, it visits only the settings relevant to that 
format. Otherwise,
+/// it visits all available format settings.
+fn visit(&self, v: &mut V, _key_prefix: &str, _description: 
&'static str) {
+match self {
+#[cfg(feature = "parquet")]
+TableFormatOptions::Parquet { options, .. } => {
+options.visit(v, "format", "");
+}
+TableFormatOptions::Csv { options, .. } => {
+options.visit(v, "format", "");
+}
+TableFormatOptions::Json { options, .. } => {
+options.visit(v, "format", "");
+}
+}
+}
+
+/// Sets a configuration value for a specific key within `TableOptions`.
+///
+/// This method delegates setting configuration values to the specific 
file format configurations,
+/// based on the current format selected. If no format is selected, it 
returns an error.
 ///
 /// # Parameters
 ///
-/// * `settings`: A hash map where each key-value pair represents a 
configuration setting.
+/// * `key`: The configuration key specifying which setting to adjust, 
prefixed with the format (e.g., "format.delimiter")
+///   for CSV format.
+/// * `value`: The value to set for the specified configuration key.
 ///
 /// # Returns
 ///
-/// A result containing the new `TableOptions` instance or an error if any 
setting could not be applied.
-pub fn from_string_hash_map(settings: &HashMap) -> 
Result {
-let mut ret = Self::default();
-for (k, v) in settings {
-ret.set(k, v)?;
+///

Re: [PR] Set Formatted TableOptions Enum [datafusion]

2025-05-30 Thread via GitHub


alamb commented on PR #16166:
URL: https://github.com/apache/datafusion/pull/16166#issuecomment-2921854046

   
   > > However, I didn't see any tests or examples that show a different 
behavior after this PR.
   > > In my initial quick read through it I would say it has made the APIs 
harder to use (as now there are format specific options on `TableOptions` AND a 
new `TableFormatOptions`). Is there any way to consolidate the number of 
structures? If not, I think we should clearly document the difference between 
the two structures
   > 
   > The confusion we are trying to address is that the TableOptions struct is 
currently being used in two different ways:
   > 
   > 1. As a sub-configuration field in SessionState, similar to other 
configuration fields.
   > 2. For directly configuring the specific file currently being worked on, 
typically by cloning it from the SessionState and overriding fields as needed.,
   
   Yes I agree this is confusing
   
   > In other words, it seems like we actually need two distinct objects for 
these two different use cases:
   > 
   > * One for storing global TableOptions configuration across the session, 
applicable to all file types.
   > * Another for holding format-specific options derived from the global 
configuration, intended for the currently active file.
   >   Another key point is that once you set a specific file format, you don’t 
need the configuration fields for the others in the existing TableOptions. So 
for the second use case, the appropriate design might be an enum rather than a 
struct.
   > 
   > In summary, this PR is more of a code cleanup than a functional change. It 
aims to clarify the separation of concerns and improve maintainability, in my 
opinion.
   
   I see. Thank you.
   
   I think we could largely resolve the confusion with some better 
documentation (perhaps documentation that explained the differences clearly so 
that when someone stumbled on the very similarly named `TableOptions` and 
`TableFormatOptions` there would be some context to help them understand
   
   Here are some other ideas:
   
   
   ## Different Names
   
   Use names that make the difference between the structs clearer. 
   
   Perhaps something like the following:
   * `DefaultTableOptions` for the structure on `ConfigOptions` (currently 
called `TableOptions`)  
   * `ResolvedTableOptions` or `SpecificTableOptions` for configuring a file 
specific option (called `TableFormatOptions` in this PR)
   
   
   ## Use the inner structures directly
   Once we know the format, could we potentially then change the code to use 
the `ParquetOptions` or `CsvOptions` directly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Set Formatted TableOptions Enum [datafusion]

2025-05-29 Thread via GitHub


berkaysynnada commented on code in PR #16166:
URL: https://github.com/apache/datafusion/pull/16166#discussion_r2113297575


##
datafusion/datasource/src/file_format.rs:
##
@@ -120,7 +121,26 @@ pub trait FileFormatFactory: Sync + Send + GetExt + 
fmt::Debug {
 &self,
 state: &dyn Session,
 format_options: &HashMap,
-) -> Result>;
+) -> Result> {
+let options = match self.options() {
+(None, file_type) => {
+let mut table_options = state.file_table_options(file_type);
+table_options.alter_with_string_hash_map(format_options)?;
+table_options.output_format()
+}
+(Some(options), _) => {
+let mut table_options = 
TableFormatOptions::from_output_format(options);
+table_options.alter_with_string_hash_map(format_options)?;
+table_options.output_format()
+}
+};
+
+Ok(self.default_from_output_format(options))
+}
+
+fn default_from_output_format(&self, options: OutputFormat) -> Arc;
+
+fn options(&self) -> (Option, ConfigFileType);

Review Comment:
   Yeah, this confuses me now too. Having ConfigFileType and OutputFormat at 
the same time is also a bit weird. I'll take a second look at this part



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Set Formatted TableOptions Enum [datafusion]

2025-05-28 Thread via GitHub


berkaysynnada commented on code in PR #16166:
URL: https://github.com/apache/datafusion/pull/16166#discussion_r2113294832


##
datafusion/datasource/src/file_format.rs:
##
@@ -120,7 +121,26 @@ pub trait FileFormatFactory: Sync + Send + GetExt + 
fmt::Debug {
 &self,
 state: &dyn Session,
 format_options: &HashMap,
-) -> Result>;
+) -> Result> {
+let options = match self.options() {

Review Comment:
   They should implement `default_from_output_format` and `options`. This 
change is not directly related to the actual changes of the PR, but all 
create() methods show a similar pattern, so I remember we thought like why not 
avoid duplication



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Set Formatted TableOptions Enum [datafusion]

2025-05-28 Thread via GitHub


berkaysynnada commented on code in PR #16166:
URL: https://github.com/apache/datafusion/pull/16166#discussion_r2113286434


##
datafusion/common/src/config.rs:
##
@@ -1612,42 +1623,241 @@ impl TableOptions {
 };
 e.0.set(key, value)
 }
+}
 
-/// Initializes a new `TableOptions` from a hash map of string settings.
+#[derive(Debug, Clone)]
+#[allow(clippy::large_enum_variant)]
+pub enum TableFormatOptions {

Review Comment:
   I will resolve these if I can relieve your concerns :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Set Formatted TableOptions Enum [datafusion]

2025-05-28 Thread via GitHub


berkaysynnada commented on PR #16166:
URL: https://github.com/apache/datafusion/pull/16166#issuecomment-2918419765

   > Thank you for this contribution @berkaysynnada and @mertak-synnada
   > 
   > I am a little confused about the new structure and exactly what problem is 
being solved with this change.
   > 
   > The PR description says
   > 
   > > In the current implementation, the TableOptions struct stores 
information for CSV, JSON, and Parquet file types, and its behavior changes 
depending on whether the file format is set before or after.
   > 
   > However, I didn't see any tests or examples that show a different behavior 
after this PR.
   > 
   > In my initial quick read through it I would say it has made the APIs 
harder to use (as now there are format specific options on `TableOptions` AND a 
new `TableFormatOptions`). Is there any way to consolidate the number of 
structures? If not, I think we should clearly document the difference between 
the two structures
   
   The confusion we are trying to address is that the TableOptions struct is 
currently being used in two different ways:
   1) As a sub-configuration field in SessionState, similar to other 
configuration fields.
   2) For directly configuring the specific file currently being worked on, 
typically by cloning it from the SessionState and overriding fields as needed.,
   
   In other words, it seems like we actually need two distinct objects for 
these two different use cases: 
   - One for storing global TableOptions configuration across the session, 
applicable to all file types.
   - Another for holding format-specific options derived from the global 
configuration, intended for the currently active file.
   Another key point is that once you set a specific file format, you don’t 
need the configuration fields for the others in the existing TableOptions. So 
for the second use case, the appropriate design might be an enum rather than a 
struct.
   
   In summary, this PR is more of a code cleanup than a functional change. It 
aims to clarify the separation of concerns and improve maintainability, in my 
opinion.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Set Formatted TableOptions Enum [datafusion]

2025-05-28 Thread via GitHub


alamb commented on code in PR #16166:
URL: https://github.com/apache/datafusion/pull/16166#discussion_r2112474913


##
datafusion/datasource/src/file_format.rs:
##
@@ -120,7 +121,26 @@ pub trait FileFormatFactory: Sync + Send + GetExt + 
fmt::Debug {
 &self,
 state: &dyn Session,
 format_options: &HashMap,
-) -> Result>;
+) -> Result> {
+let options = match self.options() {
+(None, file_type) => {
+let mut table_options = state.file_table_options(file_type);
+table_options.alter_with_string_hash_map(format_options)?;
+table_options.output_format()
+}
+(Some(options), _) => {
+let mut table_options = 
TableFormatOptions::from_output_format(options);
+table_options.alter_with_string_hash_map(format_options)?;
+table_options.output_format()
+}
+};
+
+Ok(self.default_from_output_format(options))
+}
+
+fn default_from_output_format(&self, options: OutputFormat) -> Arc;

Review Comment:
   Can we please document these new public APIs?



##
datafusion/common/src/config.rs:
##
@@ -1612,42 +1623,241 @@ impl TableOptions {
 };
 e.0.set(key, value)
 }
+}
 
-/// Initializes a new `TableOptions` from a hash map of string settings.
+#[derive(Debug, Clone)]
+#[allow(clippy::large_enum_variant)]
+pub enum TableFormatOptions {

Review Comment:
   Since this is a new pub struct, can we please add documentation explaining
   1.  what it is used for
   2. How it is different than `TableOptions`?
   
   



##
datafusion/datasource/src/file_format.rs:
##
@@ -120,7 +121,26 @@ pub trait FileFormatFactory: Sync + Send + GetExt + 
fmt::Debug {
 &self,
 state: &dyn Session,
 format_options: &HashMap,
-) -> Result>;
+) -> Result> {
+let options = match self.options() {

Review Comment:
   since this method is part of a trait, I think it can be overridden by other 
implementations. Is that the intention? 
   
   In other words, do you expect anyone implementing FileFormatFactory to 
implement `create()` or should they now implement `default_from_output_format` 
and `options`?



##
datafusion/common/src/config.rs:
##
@@ -1612,42 +1623,241 @@ impl TableOptions {
 };
 e.0.set(key, value)
 }
+}
 
-/// Initializes a new `TableOptions` from a hash map of string settings.
+#[derive(Debug, Clone)]
+#[allow(clippy::large_enum_variant)]
+pub enum TableFormatOptions {
+Csv {
+options: CsvOptions,
+extensions: Extensions,
+},
+#[cfg(feature = "parquet")]
+Parquet {
+options: TableParquetOptions,
+extensions: Extensions,
+},
+Json {
+options: JsonOptions,
+extensions: Extensions,
+},
+}
+
+impl ConfigField for TableFormatOptions {
+/// Visits configuration settings for the current file format, or all 
formats if none is selected.
+///
+/// This method adapts the behavior based on whether a file format is 
currently selected in `current_format`.
+/// If a format is selected, it visits only the settings relevant to that 
format. Otherwise,
+/// it visits all available format settings.
+fn visit(&self, v: &mut V, _key_prefix: &str, _description: 
&'static str) {
+match self {
+#[cfg(feature = "parquet")]
+TableFormatOptions::Parquet { options, .. } => {
+options.visit(v, "format", "");
+}
+TableFormatOptions::Csv { options, .. } => {
+options.visit(v, "format", "");
+}
+TableFormatOptions::Json { options, .. } => {
+options.visit(v, "format", "");
+}
+}
+}
+
+/// Sets a configuration value for a specific key within `TableOptions`.

Review Comment:
   this is in `TableFormatOptions` I think 



##
datafusion/common/src/config.rs:
##
@@ -1612,42 +1623,241 @@ impl TableOptions {
 };
 e.0.set(key, value)
 }
+}
 
-/// Initializes a new `TableOptions` from a hash map of string settings.
+#[derive(Debug, Clone)]
+#[allow(clippy::large_enum_variant)]
+pub enum TableFormatOptions {
+Csv {
+options: CsvOptions,
+extensions: Extensions,
+},
+#[cfg(feature = "parquet")]
+Parquet {
+options: TableParquetOptions,
+extensions: Extensions,
+},
+Json {
+options: JsonOptions,
+extensions: Extensions,
+},
+}
+
+impl ConfigField for TableFormatOptions {
+/// Visits configuration settings for the current file format, or all 
formats if none is selected.
+///
+/// This method adapts the behavior based on whether a file format is 
currently selected in `current_format`.
+/// If a format is selected, it visits only the settings relevan

[PR] Set Formatted TableOptions Enum [datafusion]

2025-05-23 Thread via GitHub


mertak-synnada opened a new pull request, #16166:
URL: https://github.com/apache/datafusion/pull/16166

   ## Which issue does this PR close?
   
   
   
   - Closes #.
   In the current implementation, the TableOptions struct stores information 
for CSV, JSON, and Parquet file types, and its behavior changes depending on 
whether the file format is set before or after. This PR introduces a new struct 
for file-type-specific cases, while keeping TableOptions as the default state, 
since it is also used in the session before a file type is specified.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]