Re: [PR] Minor: Csv Options Clean-up [datafusion]
berkaysynnada commented on PR #10650: URL: https://github.com/apache/datafusion/pull/10650#issuecomment-2133239672 After talking with @ozankabak, we decided to undo the changes related to header options. There's also another place where `SessionState` is used in the `FileFormat` API: https://github.com/apache/datafusion/blob/ae107548bdccbb356b75f99edde3ce296532396e/datafusion/core/src/datasource/file_format/parquet.rs#L192 Suggesting that people should extract `SessionState` values during `TableProvider` creation might not be safe. I'm not sure if removing `SessionState` from this API is the right move, but it definitely needs more thought. Rushing to clean up the code could lead to losing functionality. We should discuss these in a different issue. For now, this PR only makes a small fix to `CastExpr`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Csv Options Clean-up [datafusion]
berkaysynnada commented on PR #10650: URL: https://github.com/apache/datafusion/pull/10650#issuecomment-2132854613 > The upside of the approach in this PR is that it simplifies the code across multiple files (e.g. proto), but the fallback behavior no longer becomes automatic for all custom table providers as far as I can tell -- @berkaysynnada please correct me if I'm wrong. This explanation summarizes it very well. Currently, the extraction of catalog information is encapsulated in the `create()` method of `TableProviderFactory`. It seems tidier to have this extraction centralized in one spot (or we could move it to another recommended location if there is any) rather than having it scattered. However, as @ozankabak mentioned, a custom `TableProvider` should also handle this. I can add an explanation for this in the docstring of the trait method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Csv Options Clean-up [datafusion]
berkaysynnada commented on code in PR #10650: URL: https://github.com/apache/datafusion/pull/10650#discussion_r1615535832 ## datafusion/core/src/datasource/file_format/csv.rs: ## @@ -301,13 +296,7 @@ impl CsvFormat { while let Some(chunk) = stream.next().await.transpose()? { let format = arrow::csv::reader::Format::default() -.with_header( -first_chunk -&& self -.options -.has_header - .unwrap_or(state.config_options().catalog.has_header), -) +.with_header(first_chunk && self.options.has_header) Review Comment: Thanks @metegenez for the review :) I have tried to extract the option from the catalog in the most inclusive way possible. This step seems to be when we create a `TableProvider`. Another alternative would be to consult the catalog wherever we read the header from a `FileFormat`. However, doing this everywhere makes the code cluttered (in many places, there isn't even a catalog object available). Therefore, I think that if we check this once during the execution, where `FileFormat` is first introduced—that is, at the part where `TableProvider` is created—we won't need to check it again. Are there any use cases for creating and using `FileFormat` independently of any `TableProvider`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Csv Options Clean-up [datafusion]
alamb commented on PR #10650: URL: https://github.com/apache/datafusion/pull/10650#issuecomment-2131235103 > Maybe we can do the following: Instead of making every TableProviderFactory responsible for implementing this behavior in create, we can add some general mechanism at the trait level (maybe via a default method implementation?) to make this fallback automatic for all TableProviderFactory implementations unless overridden. @alamb, what do you think? Any ideas come to mind? Thank you for the explanation -- what you describe makes sense -- maybe we can update the documentation to add some of this context and that would be good enough. I haven't thought through all the nuances of general fallbacks, but given what you say perhaps it is a corner case we can worry about if/when somone actually hits it. From my perspective the most important thing is for the behavior to be clearly documented. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Csv Options Clean-up [datafusion]
ozankabak commented on PR #10650: URL: https://github.com/apache/datafusion/pull/10650#issuecomment-2131118238 Maybe I can help clarify things a little bit. The purpose of the PR is not to change the current fallback behavior, just simplify its implementation. AFAICT @berkaysynnada is moving the fallback logic to `TableProviderFactory`'s `create` (from `FileFormat`'s `create_physical_plan`), which enables him to simplify the rest of the code. I see that he implemented this for `ListingTableFactory`. The proto change is a consequence of the above -- it can be made independently AFAICT. The upside of the approach in this PR is that it simplifies the code across multiple files (e.g. proto), but the fallback behavior no longer becomes automatic for all custom table providers as far as I can tell -- @berkaysynnada please correct me if I'm wrong. Maybe we can do the following: Instead of making every `TableProviderFactory` responsible for implementing this behavior in `create`, we can add some general mechanism at the trait level (maybe via a default method implementation?) to make this fallback automatic for all `TableProviderFactory` implementations unless overridden. @alamb, what do you think? Any ideas come to mind? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Csv Options Clean-up [datafusion]
ozankabak commented on code in PR #10650: URL: https://github.com/apache/datafusion/pull/10650#discussion_r1614446423 ## datafusion/physical-expr/src/expressions/cast.rs: ## @@ -170,7 +170,8 @@ impl PhysicalExpr for CastExpr { let target_type = _type; let unbounded = Interval::make_unbounded(target_type)?; -if source_datatype.is_numeric() && target_type.is_numeric() +if (source_datatype.is_numeric() || source_datatype == Boolean) Review Comment: @berkaysynnada let's add a test for this if there is none -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Csv Options Clean-up [datafusion]
alamb commented on code in PR #10650: URL: https://github.com/apache/datafusion/pull/10650#discussion_r1614042615 ## datafusion/functions/Cargo.toml: ## @@ -80,7 +80,7 @@ itertools = { workspace = true } log = { workspace = true } md-5 = { version = "^0.10.0", optional = true } rand = { workspace = true } -regex = { worksapce = true, optional = true } +regex = { workspace = true, optional = true } Review Comment: Thank you -- that is a nice fix -- it would be easier to merge as a separate PR ## datafusion/common/src/config.rs: ## @@ -1555,7 +1555,7 @@ config_namespace! { /// Specifies whether there is a CSV header (i.e. the first line /// consists of is column names). The value `None` indicates that /// the configuration should be consulted. -pub has_header: Option, default = None + pub has_header: bool, default = false Review Comment: I don't understand the implication of this change and there are no tests that cover it. Can you please at least update the documentation to reflect the new behavior (it can no longer be None) Also there appears to be some strange indenting going on ## datafusion/physical-expr/src/expressions/cast.rs: ## @@ -170,7 +170,8 @@ impl PhysicalExpr for CastExpr { let target_type = _type; let unbounded = Interval::make_unbounded(target_type)?; -if source_datatype.is_numeric() && target_type.is_numeric() +if (source_datatype.is_numeric() || source_datatype == Boolean) Review Comment: I don't see any test coverage for this change ## datafusion/proto/proto/datafusion.proto: ## @@ -1106,7 +1106,7 @@ message CsvWriterOptions { // Options controlling CSV format message CsvOptions { - bytes has_header = 1; // Indicates if the CSV has a header row + bool has_header = 1; // Indicates if the CSV has a header row Review Comment: that is certainly nicer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Csv Options Clean-up [datafusion]
metegenez commented on code in PR #10650: URL: https://github.com/apache/datafusion/pull/10650#discussion_r1613754503 ## datafusion/core/src/datasource/file_format/csv.rs: ## @@ -301,13 +296,7 @@ impl CsvFormat { while let Some(chunk) = stream.next().await.transpose()? { let format = arrow::csv::reader::Format::default() -.with_header( -first_chunk -&& self -.options -.has_header - .unwrap_or(state.config_options().catalog.has_header), -) +.with_header(first_chunk && self.options.has_header) Review Comment: What if there is a catalog option on header? This might not be tested in the code since the SQL tests wouldn't cover here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Minor: Csv Options Clean-up [datafusion]
berkaysynnada opened a new pull request, #10650: URL: https://github.com/apache/datafusion/pull/10650 ## Which issue does this PR close? Closes #. ## Rationale for this change When CSV header option is not specified from the options clause, it is set from the session config. This behavior is now more understandable, and the Option wrap has been removed. Additionally, there is a minor fix in the `CastExpr` ordering. Boolean to numeric cast needs to preserve ordering. There is also a typo in the functions cargo.toml, which is fixed. ## What changes are included in this PR? * Removed Option wrap when CSV header options are not specified * Fixed `CastExpr` ordering for Boolean to numeric cast * Corrected a typo in functions cargo.toml ## Are these changes tested? with existing tests ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org