Re: [PR] Minor: Csv Options Clean-up [datafusion]

2024-05-27 Thread via GitHub


berkaysynnada commented on PR #10650:
URL: https://github.com/apache/datafusion/pull/10650#issuecomment-2133239672

   After talking with @ozankabak, we decided to undo the changes related to 
header options. 
   
   There's also another place where `SessionState` is used in the `FileFormat` 
API: 
https://github.com/apache/datafusion/blob/ae107548bdccbb356b75f99edde3ce296532396e/datafusion/core/src/datasource/file_format/parquet.rs#L192
 Suggesting that people should extract `SessionState` values during 
`TableProvider` creation might not be safe. I'm not sure if removing 
`SessionState` from this API is the right move, but it definitely needs more 
thought. Rushing to clean up the code could lead to losing functionality. We 
should discuss these in a different issue.
   
   For now, this PR only makes a small fix to `CastExpr`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Csv Options Clean-up [datafusion]

2024-05-27 Thread via GitHub


berkaysynnada commented on PR #10650:
URL: https://github.com/apache/datafusion/pull/10650#issuecomment-2132854613

   > The upside of the approach in this PR is that it simplifies the code 
across multiple files (e.g. proto), but the fallback behavior no longer becomes 
automatic for all custom table providers as far as I can tell -- @berkaysynnada 
please correct me if I'm wrong.
   
   This explanation summarizes it very well. Currently, the extraction of 
catalog information is encapsulated in the `create()` method of 
`TableProviderFactory`. It seems tidier to have this extraction centralized in 
one spot (or we could move it to another recommended location if there is any) 
rather than having it scattered. However, as @ozankabak mentioned, a custom 
`TableProvider` should also handle this. I can add an explanation for this in 
the docstring of the trait method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Csv Options Clean-up [datafusion]

2024-05-27 Thread via GitHub


berkaysynnada commented on code in PR #10650:
URL: https://github.com/apache/datafusion/pull/10650#discussion_r1615535832


##
datafusion/core/src/datasource/file_format/csv.rs:
##
@@ -301,13 +296,7 @@ impl CsvFormat {
 
 while let Some(chunk) = stream.next().await.transpose()? {
 let format = arrow::csv::reader::Format::default()
-.with_header(
-first_chunk
-&& self
-.options
-.has_header
-
.unwrap_or(state.config_options().catalog.has_header),
-)
+.with_header(first_chunk && self.options.has_header)

Review Comment:
   Thanks @metegenez for the review :) I have tried to extract the option from 
the catalog in the most inclusive way possible. This step seems to be when we 
create a `TableProvider`.
   
   Another alternative would be to consult the catalog wherever we read the 
header from a `FileFormat`. However, doing this everywhere makes the code 
cluttered (in many places, there isn't even a catalog object available). 
Therefore, I think that if we check this once during the execution, where 
`FileFormat` is first introduced—that is, at the part where `TableProvider` is 
created—we won't need to check it again. Are there any use cases for creating 
and using `FileFormat` independently of any `TableProvider`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Csv Options Clean-up [datafusion]

2024-05-25 Thread via GitHub


alamb commented on PR #10650:
URL: https://github.com/apache/datafusion/pull/10650#issuecomment-2131235103

   > Maybe we can do the following: Instead of making every 
TableProviderFactory responsible for implementing this behavior in create, we 
can add some general mechanism at the trait level (maybe via a default method 
implementation?) to make this fallback automatic for all TableProviderFactory 
implementations unless overridden. @alamb, what do you think? Any ideas come to 
mind?
   
   Thank you for the explanation -- what you describe makes sense -- maybe we 
can update the documentation to add some of this context and that would be good 
enough. I haven't thought through all the nuances of general fallbacks, but 
given what you say perhaps it is a corner case we can worry about if/when 
somone actually hits it.
   
   From my perspective the most important thing is for the behavior to be 
clearly documented. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Csv Options Clean-up [datafusion]

2024-05-25 Thread via GitHub


ozankabak commented on PR #10650:
URL: https://github.com/apache/datafusion/pull/10650#issuecomment-2131118238

   Maybe I can help clarify things a little bit. The purpose of the PR is not 
to change the current fallback behavior, just simplify its implementation. 
AFAICT @berkaysynnada is moving the fallback logic to `TableProviderFactory`'s 
`create` (from `FileFormat`'s `create_physical_plan`), which enables him to 
simplify the rest of the code. I see that he implemented this for 
`ListingTableFactory`. The proto change is a consequence of the above -- it can 
be made independently AFAICT.
   
   The upside of the approach in this PR is that it simplifies the code across 
multiple files (e.g. proto), but the fallback behavior no longer becomes 
automatic for all custom table providers as far as I can tell -- @berkaysynnada 
 please correct me if I'm wrong.
   
   Maybe we can do the following: Instead of making every 
`TableProviderFactory` responsible for implementing this behavior in `create`, 
we can add some general mechanism at the trait level (maybe via a default 
method implementation?) to make this fallback automatic for all 
`TableProviderFactory` implementations unless overridden. @alamb, what do you 
think? Any ideas come to mind?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Csv Options Clean-up [datafusion]

2024-05-25 Thread via GitHub


ozankabak commented on code in PR #10650:
URL: https://github.com/apache/datafusion/pull/10650#discussion_r1614446423


##
datafusion/physical-expr/src/expressions/cast.rs:
##
@@ -170,7 +170,8 @@ impl PhysicalExpr for CastExpr {
 let target_type = _type;
 
 let unbounded = Interval::make_unbounded(target_type)?;
-if source_datatype.is_numeric() && target_type.is_numeric()
+if (source_datatype.is_numeric() || source_datatype == Boolean)

Review Comment:
   @berkaysynnada let's add a test for this if there is none



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Csv Options Clean-up [datafusion]

2024-05-24 Thread via GitHub


alamb commented on code in PR #10650:
URL: https://github.com/apache/datafusion/pull/10650#discussion_r1614042615


##
datafusion/functions/Cargo.toml:
##
@@ -80,7 +80,7 @@ itertools = { workspace = true }
 log = { workspace = true }
 md-5 = { version = "^0.10.0", optional = true }
 rand = { workspace = true }
-regex = { worksapce = true, optional = true }
+regex = { workspace = true, optional = true }

Review Comment:
   Thank you -- that is a nice fix -- it would be easier to merge as a separate 
PR



##
datafusion/common/src/config.rs:
##
@@ -1555,7 +1555,7 @@ config_namespace! {
 /// Specifies whether there is a CSV header (i.e. the first line
 /// consists of is column names). The value `None` indicates that
 /// the configuration should be consulted.
-pub has_header: Option, default = None
+  pub has_header: bool, default = false

Review Comment:
   I don't understand the implication of this change and there are no tests 
that cover it.
   
   Can you please at least update the documentation to reflect the new behavior 
(it can no longer be None)
   
   Also there appears to be some strange indenting going on



##
datafusion/physical-expr/src/expressions/cast.rs:
##
@@ -170,7 +170,8 @@ impl PhysicalExpr for CastExpr {
 let target_type = _type;
 
 let unbounded = Interval::make_unbounded(target_type)?;
-if source_datatype.is_numeric() && target_type.is_numeric()
+if (source_datatype.is_numeric() || source_datatype == Boolean)

Review Comment:
   I don't see any test coverage for this change



##
datafusion/proto/proto/datafusion.proto:
##
@@ -1106,7 +1106,7 @@ message CsvWriterOptions {
 
 // Options controlling CSV format
 message CsvOptions {
-  bytes has_header = 1; // Indicates if the CSV has a header row
+  bool has_header = 1; // Indicates if the CSV has a header row

Review Comment:
   that is certainly nicer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Csv Options Clean-up [datafusion]

2024-05-24 Thread via GitHub


metegenez commented on code in PR #10650:
URL: https://github.com/apache/datafusion/pull/10650#discussion_r1613754503


##
datafusion/core/src/datasource/file_format/csv.rs:
##
@@ -301,13 +296,7 @@ impl CsvFormat {
 
 while let Some(chunk) = stream.next().await.transpose()? {
 let format = arrow::csv::reader::Format::default()
-.with_header(
-first_chunk
-&& self
-.options
-.has_header
-
.unwrap_or(state.config_options().catalog.has_header),
-)
+.with_header(first_chunk && self.options.has_header)

Review Comment:
   What if there is a catalog option on header? This might not be tested in the 
code since the SQL tests wouldn't cover here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Minor: Csv Options Clean-up [datafusion]

2024-05-24 Thread via GitHub


berkaysynnada opened a new pull request, #10650:
URL: https://github.com/apache/datafusion/pull/10650

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   When CSV header option is not specified from the options clause, it is set 
from the session config. This behavior is now more understandable, and the 
Option wrap has been removed.
   
   Additionally, there is a minor fix in the `CastExpr` ordering. Boolean to 
numeric cast needs to preserve ordering.
   
   There is also a typo in the functions cargo.toml, which is fixed.
   
   ## What changes are included in this PR?
   
   
   
   * Removed Option wrap when CSV header options are not specified
   * Fixed `CastExpr` ordering for Boolean to numeric cast
   * Corrected a typo in functions cargo.toml
   
   ## Are these changes tested?
   
   
   
   with existing tests
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org