Re: [PR] Set Formatted TableOptions Enum [datafusion]
github-actions[bot] closed pull request #16166: Set Formatted TableOptions Enum URL: https://github.com/apache/datafusion/pull/16166 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Set Formatted TableOptions Enum [datafusion]
github-actions[bot] commented on PR #16166: URL: https://github.com/apache/datafusion/pull/16166#issuecomment-3141928497 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Set Formatted TableOptions Enum [datafusion]
alamb commented on code in PR #16166:
URL: https://github.com/apache/datafusion/pull/16166#discussion_r2115541241
##
datafusion/common/src/config.rs:
##
@@ -1566,8 +1517,68 @@ impl TableOptions {
/// # Parameters
///
/// * `format`: The file format to use (e.g., CSV, Parquet).
-pub fn set_config_format(&mut self, format: ConfigFileType) {
-self.current_format = Some(format);
+pub fn get_table_format_options(&self, format: ConfigFileType) ->
TableFormatOptions {
+match format {
+ConfigFileType::CSV => {
+let options = self.csv.clone();
+TableFormatOptions::Csv {
+options,
+extensions: self.extensions.clone(),
+}
+}
+#[cfg(feature = "parquet")]
+ConfigFileType::PARQUET => {
+let options = self.parquet.clone();
+TableFormatOptions::Parquet {
+options,
+extensions: self.extensions.clone(),
+}
+}
+ConfigFileType::JSON => {
+let options = self.json.clone();
+TableFormatOptions::Json {
+options,
+extensions: self.extensions.clone(),
+}
+}
+}
+}
+
+/// Initializes a new `TableOptions` from a hash map of string settings.
+///
+/// # Parameters
+///
+/// * `settings`: A hash map where each key-value pair represents a
configuration setting.
+///
+/// # Returns
+///
+/// A result containing the new `TableOptions` instance or an error if any
setting could not be applied.
+pub fn from_string_hash_map(settings: &HashMap) ->
Result {
+let mut ret = Self::default();
+for (k, v) in settings {
+ret.set(k, v)?;
+}
+
+Ok(ret)
+}
+
+/// Modifies the current `TableOptions` instance with settings from a hash
map.
+///
+/// # Parameters
+///
+/// * `settings`: A hash map where each key-value pair represents a
configuration setting.
+///
+/// # Returns
+///
+/// A result indicating success or failure in applying the settings.
+pub fn alter_with_string_hash_map(
Review Comment:
maybe `update` would be clearer as `alter` might imply some change other
than updating the map
##
datafusion/common/src/config.rs:
##
@@ -1612,42 +1623,241 @@ impl TableOptions {
};
e.0.set(key, value)
}
+}
-/// Initializes a new `TableOptions` from a hash map of string settings.
+#[derive(Debug, Clone)]
+#[allow(clippy::large_enum_variant)]
+pub enum TableFormatOptions {
+Csv {
+options: CsvOptions,
+extensions: Extensions,
+},
+#[cfg(feature = "parquet")]
+Parquet {
+options: TableParquetOptions,
+extensions: Extensions,
+},
+Json {
+options: JsonOptions,
+extensions: Extensions,
+},
+}
+
+impl ConfigField for TableFormatOptions {
+/// Visits configuration settings for the current file format, or all
formats if none is selected.
+///
+/// This method adapts the behavior based on whether a file format is
currently selected in `current_format`.
+/// If a format is selected, it visits only the settings relevant to that
format. Otherwise,
+/// it visits all available format settings.
+fn visit(&self, v: &mut V, _key_prefix: &str, _description:
&'static str) {
+match self {
+#[cfg(feature = "parquet")]
+TableFormatOptions::Parquet { options, .. } => {
+options.visit(v, "format", "");
+}
+TableFormatOptions::Csv { options, .. } => {
+options.visit(v, "format", "");
+}
+TableFormatOptions::Json { options, .. } => {
+options.visit(v, "format", "");
+}
+}
+}
+
+/// Sets a configuration value for a specific key within `TableOptions`.
+///
+/// This method delegates setting configuration values to the specific
file format configurations,
+/// based on the current format selected. If no format is selected, it
returns an error.
///
/// # Parameters
///
-/// * `settings`: A hash map where each key-value pair represents a
configuration setting.
+/// * `key`: The configuration key specifying which setting to adjust,
prefixed with the format (e.g., "format.delimiter")
+/// for CSV format.
+/// * `value`: The value to set for the specified configuration key.
///
/// # Returns
///
-/// A result containing the new `TableOptions` instance or an error if any
setting could not be applied.
-pub fn from_string_hash_map(settings: &HashMap) ->
Result {
-let mut ret = Self::default();
-for (k, v) in settings {
-ret.set(k, v)?;
+///
Re: [PR] Set Formatted TableOptions Enum [datafusion]
alamb commented on PR #16166: URL: https://github.com/apache/datafusion/pull/16166#issuecomment-2921854046 > > However, I didn't see any tests or examples that show a different behavior after this PR. > > In my initial quick read through it I would say it has made the APIs harder to use (as now there are format specific options on `TableOptions` AND a new `TableFormatOptions`). Is there any way to consolidate the number of structures? If not, I think we should clearly document the difference between the two structures > > The confusion we are trying to address is that the TableOptions struct is currently being used in two different ways: > > 1. As a sub-configuration field in SessionState, similar to other configuration fields. > 2. For directly configuring the specific file currently being worked on, typically by cloning it from the SessionState and overriding fields as needed., Yes I agree this is confusing > In other words, it seems like we actually need two distinct objects for these two different use cases: > > * One for storing global TableOptions configuration across the session, applicable to all file types. > * Another for holding format-specific options derived from the global configuration, intended for the currently active file. > Another key point is that once you set a specific file format, you don’t need the configuration fields for the others in the existing TableOptions. So for the second use case, the appropriate design might be an enum rather than a struct. > > In summary, this PR is more of a code cleanup than a functional change. It aims to clarify the separation of concerns and improve maintainability, in my opinion. I see. Thank you. I think we could largely resolve the confusion with some better documentation (perhaps documentation that explained the differences clearly so that when someone stumbled on the very similarly named `TableOptions` and `TableFormatOptions` there would be some context to help them understand Here are some other ideas: ## Different Names Use names that make the difference between the structs clearer. Perhaps something like the following: * `DefaultTableOptions` for the structure on `ConfigOptions` (currently called `TableOptions`) * `ResolvedTableOptions` or `SpecificTableOptions` for configuring a file specific option (called `TableFormatOptions` in this PR) ## Use the inner structures directly Once we know the format, could we potentially then change the code to use the `ParquetOptions` or `CsvOptions` directly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Set Formatted TableOptions Enum [datafusion]
berkaysynnada commented on code in PR #16166:
URL: https://github.com/apache/datafusion/pull/16166#discussion_r2113297575
##
datafusion/datasource/src/file_format.rs:
##
@@ -120,7 +121,26 @@ pub trait FileFormatFactory: Sync + Send + GetExt +
fmt::Debug {
&self,
state: &dyn Session,
format_options: &HashMap,
-) -> Result>;
+) -> Result> {
+let options = match self.options() {
+(None, file_type) => {
+let mut table_options = state.file_table_options(file_type);
+table_options.alter_with_string_hash_map(format_options)?;
+table_options.output_format()
+}
+(Some(options), _) => {
+let mut table_options =
TableFormatOptions::from_output_format(options);
+table_options.alter_with_string_hash_map(format_options)?;
+table_options.output_format()
+}
+};
+
+Ok(self.default_from_output_format(options))
+}
+
+fn default_from_output_format(&self, options: OutputFormat) -> Arc;
+
+fn options(&self) -> (Option, ConfigFileType);
Review Comment:
Yeah, this confuses me now too. Having ConfigFileType and OutputFormat at
the same time is also a bit weird. I'll take a second look at this part
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Set Formatted TableOptions Enum [datafusion]
berkaysynnada commented on code in PR #16166:
URL: https://github.com/apache/datafusion/pull/16166#discussion_r2113294832
##
datafusion/datasource/src/file_format.rs:
##
@@ -120,7 +121,26 @@ pub trait FileFormatFactory: Sync + Send + GetExt +
fmt::Debug {
&self,
state: &dyn Session,
format_options: &HashMap,
-) -> Result>;
+) -> Result> {
+let options = match self.options() {
Review Comment:
They should implement `default_from_output_format` and `options`. This
change is not directly related to the actual changes of the PR, but all
create() methods show a similar pattern, so I remember we thought like why not
avoid duplication
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Set Formatted TableOptions Enum [datafusion]
berkaysynnada commented on code in PR #16166:
URL: https://github.com/apache/datafusion/pull/16166#discussion_r2113286434
##
datafusion/common/src/config.rs:
##
@@ -1612,42 +1623,241 @@ impl TableOptions {
};
e.0.set(key, value)
}
+}
-/// Initializes a new `TableOptions` from a hash map of string settings.
+#[derive(Debug, Clone)]
+#[allow(clippy::large_enum_variant)]
+pub enum TableFormatOptions {
Review Comment:
I will resolve these if I can relieve your concerns :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Set Formatted TableOptions Enum [datafusion]
berkaysynnada commented on PR #16166: URL: https://github.com/apache/datafusion/pull/16166#issuecomment-2918419765 > Thank you for this contribution @berkaysynnada and @mertak-synnada > > I am a little confused about the new structure and exactly what problem is being solved with this change. > > The PR description says > > > In the current implementation, the TableOptions struct stores information for CSV, JSON, and Parquet file types, and its behavior changes depending on whether the file format is set before or after. > > However, I didn't see any tests or examples that show a different behavior after this PR. > > In my initial quick read through it I would say it has made the APIs harder to use (as now there are format specific options on `TableOptions` AND a new `TableFormatOptions`). Is there any way to consolidate the number of structures? If not, I think we should clearly document the difference between the two structures The confusion we are trying to address is that the TableOptions struct is currently being used in two different ways: 1) As a sub-configuration field in SessionState, similar to other configuration fields. 2) For directly configuring the specific file currently being worked on, typically by cloning it from the SessionState and overriding fields as needed., In other words, it seems like we actually need two distinct objects for these two different use cases: - One for storing global TableOptions configuration across the session, applicable to all file types. - Another for holding format-specific options derived from the global configuration, intended for the currently active file. Another key point is that once you set a specific file format, you don’t need the configuration fields for the others in the existing TableOptions. So for the second use case, the appropriate design might be an enum rather than a struct. In summary, this PR is more of a code cleanup than a functional change. It aims to clarify the separation of concerns and improve maintainability, in my opinion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Set Formatted TableOptions Enum [datafusion]
alamb commented on code in PR #16166:
URL: https://github.com/apache/datafusion/pull/16166#discussion_r2112474913
##
datafusion/datasource/src/file_format.rs:
##
@@ -120,7 +121,26 @@ pub trait FileFormatFactory: Sync + Send + GetExt +
fmt::Debug {
&self,
state: &dyn Session,
format_options: &HashMap,
-) -> Result>;
+) -> Result> {
+let options = match self.options() {
+(None, file_type) => {
+let mut table_options = state.file_table_options(file_type);
+table_options.alter_with_string_hash_map(format_options)?;
+table_options.output_format()
+}
+(Some(options), _) => {
+let mut table_options =
TableFormatOptions::from_output_format(options);
+table_options.alter_with_string_hash_map(format_options)?;
+table_options.output_format()
+}
+};
+
+Ok(self.default_from_output_format(options))
+}
+
+fn default_from_output_format(&self, options: OutputFormat) -> Arc;
Review Comment:
Can we please document these new public APIs?
##
datafusion/common/src/config.rs:
##
@@ -1612,42 +1623,241 @@ impl TableOptions {
};
e.0.set(key, value)
}
+}
-/// Initializes a new `TableOptions` from a hash map of string settings.
+#[derive(Debug, Clone)]
+#[allow(clippy::large_enum_variant)]
+pub enum TableFormatOptions {
Review Comment:
Since this is a new pub struct, can we please add documentation explaining
1. what it is used for
2. How it is different than `TableOptions`?
##
datafusion/datasource/src/file_format.rs:
##
@@ -120,7 +121,26 @@ pub trait FileFormatFactory: Sync + Send + GetExt +
fmt::Debug {
&self,
state: &dyn Session,
format_options: &HashMap,
-) -> Result>;
+) -> Result> {
+let options = match self.options() {
Review Comment:
since this method is part of a trait, I think it can be overridden by other
implementations. Is that the intention?
In other words, do you expect anyone implementing FileFormatFactory to
implement `create()` or should they now implement `default_from_output_format`
and `options`?
##
datafusion/common/src/config.rs:
##
@@ -1612,42 +1623,241 @@ impl TableOptions {
};
e.0.set(key, value)
}
+}
-/// Initializes a new `TableOptions` from a hash map of string settings.
+#[derive(Debug, Clone)]
+#[allow(clippy::large_enum_variant)]
+pub enum TableFormatOptions {
+Csv {
+options: CsvOptions,
+extensions: Extensions,
+},
+#[cfg(feature = "parquet")]
+Parquet {
+options: TableParquetOptions,
+extensions: Extensions,
+},
+Json {
+options: JsonOptions,
+extensions: Extensions,
+},
+}
+
+impl ConfigField for TableFormatOptions {
+/// Visits configuration settings for the current file format, or all
formats if none is selected.
+///
+/// This method adapts the behavior based on whether a file format is
currently selected in `current_format`.
+/// If a format is selected, it visits only the settings relevant to that
format. Otherwise,
+/// it visits all available format settings.
+fn visit(&self, v: &mut V, _key_prefix: &str, _description:
&'static str) {
+match self {
+#[cfg(feature = "parquet")]
+TableFormatOptions::Parquet { options, .. } => {
+options.visit(v, "format", "");
+}
+TableFormatOptions::Csv { options, .. } => {
+options.visit(v, "format", "");
+}
+TableFormatOptions::Json { options, .. } => {
+options.visit(v, "format", "");
+}
+}
+}
+
+/// Sets a configuration value for a specific key within `TableOptions`.
Review Comment:
this is in `TableFormatOptions` I think
##
datafusion/common/src/config.rs:
##
@@ -1612,42 +1623,241 @@ impl TableOptions {
};
e.0.set(key, value)
}
+}
-/// Initializes a new `TableOptions` from a hash map of string settings.
+#[derive(Debug, Clone)]
+#[allow(clippy::large_enum_variant)]
+pub enum TableFormatOptions {
+Csv {
+options: CsvOptions,
+extensions: Extensions,
+},
+#[cfg(feature = "parquet")]
+Parquet {
+options: TableParquetOptions,
+extensions: Extensions,
+},
+Json {
+options: JsonOptions,
+extensions: Extensions,
+},
+}
+
+impl ConfigField for TableFormatOptions {
+/// Visits configuration settings for the current file format, or all
formats if none is selected.
+///
+/// This method adapts the behavior based on whether a file format is
currently selected in `current_format`.
+/// If a format is selected, it visits only the settings relevan
[PR] Set Formatted TableOptions Enum [datafusion]
mertak-synnada opened a new pull request, #16166: URL: https://github.com/apache/datafusion/pull/16166 ## Which issue does this PR close? - Closes #. In the current implementation, the TableOptions struct stores information for CSV, JSON, and Parquet file types, and its behavior changes depending on whether the file format is set before or after. This PR introduces a new struct for file-type-specific cases, while keeping TableOptions as the default state, since it is also used in the session before a file type is specified. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
