Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-08-22 Thread via GitHub


github-actions[bot] closed pull request #16332: Add support for glob string in 
datafusion-cli query
URL: https://github.com/apache/datafusion/pull/16332


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-08-13 Thread via GitHub


github-actions[bot] commented on PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#issuecomment-3186505480

   Thank you for your contribution. Unfortunately, this pull request is stale 
because it has been open 60 days with no activity. Please remove the stale 
label or comment or this will be closed in 7 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-13 Thread via GitHub


alamb commented on PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#issuecomment-2970544930

   > rather than a function for each file type (read_parquet, read_csv, etc). 
You are correct that the latter is more common so will for this.
   
   I think the reason that DuckDB et al use a function for each file type is 
that it simplifies option handling (there are many options that are better 
suited for parquet that are not csv)
   
   That being said, adding a function like `read_file(..)` or `read_data(...)` 
that handled all file types might be a reasonable thing to do in 
`datafusion-cli` as then you could probably reuse most of the ListingTable  code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-10 Thread via GitHub


a-agmon commented on PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#issuecomment-2961294401

   @alamb  - thank you very much for the generous comments. I appreciate it. 
   Re naming - I completely agree. Was just wondering whether its better to 
introduce one function that infer the file type (like `read()` or `glob()`) 
rather than a function for each file type (`read_parquet`, `read_csv`, etc). 
You are correct that the latter is more common so will for this. 
   Re the other comments - will review and handle. Thanks.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-10 Thread via GitHub


alamb commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2139195148


##
datafusion-cli/src/functions.rs:
##
@@ -460,3 +473,92 @@ impl TableFunctionImpl for ParquetMetadataFunc {
 Ok(Arc::new(parquet_metadata))
 }
 }
+
+/// A table function that allows users to query files using glob patterns
+/// for example: SELECT * FROM glob('path/to/*/file.parquet')
+pub struct GlobFunc {
+// we need the ctx here to get the schema from the listing table later
+ctx: SessionContext,
+}
+
+impl std::fmt::Debug for GlobFunc {
+fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+f.debug_struct("GlobFunc")
+.field("ctx", &"")
+.finish()
+}
+}
+
+impl GlobFunc {
+/// Create a new GlobFunc
+pub fn new(ctx: SessionContext) -> Self {
+Self { ctx }
+}
+}
+
+fn as_utf8_literal<'a>(expr: &'a Expr, arg_name: &str) -> Result<&'a str> {
+match expr {
+Expr::Literal(ScalarValue::Utf8(Some(s)), _) => Ok(s),

Review Comment:
   Minor: Maybe thus could use he `try_as_str` function (which would also 
handle other literal types) 
https://docs.rs/datafusion/latest/datafusion/scalar/enum.ScalarValue.html#method.try_as_str
   
   



##
datafusion-cli/tests/sql/integration/glob_test.sql:
##
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM 
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM 
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;

Review Comment:
   I think another usecase that @robtandy  had was "a list of multiple files" 
-- like is there some way to select exactly two files? Something like
   
   ```sql
   glob('[../datafusion/core/tests/data/cars.csv',  
'../datafusion/core/tests/data/trucks.csv', ])
   ```
   
   Perhaps 🤔 



##
datafusion-cli/tests/sql/integration/glob_test.sql:
##
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM 
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM 
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
+
+-- Test 3: JSON file with explicit format parameter - verify format 
specification
+SELECT COUNT(*) AS json_count FROM 
glob('../datafusion/core/tests/data/1.json', 'json');
+
+-- Test 4: Single specific CSV file - verify another CSV works
+SELECT COUNT(*) AS example_count FROM 
glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM 
glob('../datafusion/core/tests/data/exa*.csv'); 

Review Comment:
   Another possibility would be to intercept the `CREATE EXTERNAL TABLE` 
command  in `datafusion-cli` itself
   
   For example, simliarly to how it peeks here:
   
   
https://github.com/apache/datafusion/blob/1d61f31d121632ca27c77c472cae7d604e9aa9d7/datafusion-cli/src/exec.rs#L357-L367
   
   We could implement a special handler in datafusion-cli rather than use the 
default one in SessionContext: 
https://github.com/apache/datafusion/blob/1d61f31d121632ca27c77c472cae7d604e9aa9d7/datafusion/core/src/execution/context/mod.rs#L669-L672
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub


a-agmon commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134804642


##
datafusion-cli/tests/sql/integration/glob_test.sql:
##
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM 
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM 
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
+
+-- Test 3: JSON file with explicit format parameter - verify format 
specification
+SELECT COUNT(*) AS json_count FROM 
glob('../datafusion/core/tests/data/1.json', 'json');
+
+-- Test 4: Single specific CSV file - verify another CSV works
+SELECT COUNT(*) AS example_count FROM 
glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM 
glob('../datafusion/core/tests/data/exa*.csv'); 

Review Comment:
   We can use the current model, but I think it will require touching some core 
modules within datafusion.  
   
   re the second question, I think that supporting multiple file types should 
not be supported.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub


a-agmon commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134812154


##
datafusion-cli/tests/sql/integration/glob_test.sql:
##
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM 
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM 
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
+
+-- Test 3: JSON file with explicit format parameter - verify format 
specification
+SELECT COUNT(*) AS json_count FROM 
glob('../datafusion/core/tests/data/1.json', 'json');
+
+-- Test 4: Single specific CSV file - verify another CSV works
+SELECT COUNT(*) AS example_count FROM 
glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM 
glob('../datafusion/core/tests/data/exa*.csv'); 

Review Comment:
   By the way, the current implementation supports the following - but just for 
local files (as it uses `::parse()`) 
   ```
   CREATE EXTERNAL TABLE logs 
   STORED AS CSV 
   LOCATION '/data/*_small.csv';
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub


a-agmon commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134812154


##
datafusion-cli/tests/sql/integration/glob_test.sql:
##
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM 
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM 
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
+
+-- Test 3: JSON file with explicit format parameter - verify format 
specification
+SELECT COUNT(*) AS json_count FROM 
glob('../datafusion/core/tests/data/1.json', 'json');
+
+-- Test 4: Single specific CSV file - verify another CSV works
+SELECT COUNT(*) AS example_count FROM 
glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM 
glob('../datafusion/core/tests/data/exa*.csv'); 

Review Comment:
   this also currently supported - but just for local files
   ```
   CREATE EXTERNAL TABLE logs 
   STORED AS CSV 
   LOCATION '/data/*_small.csv';
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub


a-agmon commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134804642


##
datafusion-cli/tests/sql/integration/glob_test.sql:
##
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM 
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM 
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
+
+-- Test 3: JSON file with explicit format parameter - verify format 
specification
+SELECT COUNT(*) AS json_count FROM 
glob('../datafusion/core/tests/data/1.json', 'json');
+
+-- Test 4: Single specific CSV file - verify another CSV works
+SELECT COUNT(*) AS example_count FROM 
glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM 
glob('../datafusion/core/tests/data/exa*.csv'); 

Review Comment:
   I think we can use the current model, but I think it will require touching 
some datafusion core modules. 
   
   re the second question, I think that supporting multiple file types should 
not be supported.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub


comphead commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134795185


##
datafusion-cli/tests/sql/integration/glob_test.sql:
##
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM 
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM 
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
+
+-- Test 3: JSON file with explicit format parameter - verify format 
specification
+SELECT COUNT(*) AS json_count FROM 
glob('../datafusion/core/tests/data/1.json', 'json');
+
+-- Test 4: Single specific CSV file - verify another CSV works
+SELECT COUNT(*) AS example_count FROM 
glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM 
glob('../datafusion/core/tests/data/exa*.csv'); 

Review Comment:
   should we introduce a new function? can we reuse current model? 
   
   what should be the behavior if there are mixed CSV/JSON/Parquet files in the 
folder?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub


comphead commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134794430


##
datafusion-cli/src/functions.rs:
##
@@ -460,3 +473,94 @@ impl TableFunctionImpl for ParquetMetadataFunc {
 Ok(Arc::new(parquet_metadata))
 }
 }
+
+/// A table function that allows users to query files using glob patterns
+/// for example: SELECT * FROM glob('path/to/*/file.parquet')
+pub struct GlobFunc {
+// we need the ctx here to get the schema from the listing table later
+ctx: SessionContext,
+}
+
+impl std::fmt::Debug for GlobFunc {
+fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+f.debug_struct("GlobFunc")
+.field("ctx", &"")
+.finish()
+}
+}
+
+impl GlobFunc {
+/// Create a new GlobFunc
+pub fn new(ctx: SessionContext) -> Self {
+Self { ctx }
+}
+}
+
+fn as_utf8_literal<'a>(expr: &'a Expr, arg_name: &str) -> Result<&'a str> {
+match expr {
+Expr::Literal(ScalarValue::Utf8(Some(s)), _) => Ok(s),
+Expr::Column(Column { name, .. }) => Ok(name),
+_ => plan_err!("glob() requires a string literal for the '{arg_name}' 
argument"),
+}
+}
+
+impl TableFunctionImpl for GlobFunc {
+fn call(&self, exprs: &[Expr]) -> Result> {
+// Parse arguments
+if exprs.is_empty() {
+return plan_err!("glob() requires a glob pattern");
+}
+let pattern = as_utf8_literal(&exprs[0], "pattern")?;
+let format = if exprs.len() > 1 {
+Some(as_utf8_literal(&exprs[1], "format")?)
+} else {
+None
+};
+
+// Create ListingTableUrl - distinguish between URLs with schemes and 
local paths
+let url = if pattern.contains("://") && pattern.contains(['*', '?', 
'[']) {
+// URL with scheme and glob - split manually to avoid URL encoding 
of glob chars
+let glob_pos = pattern.find(['*', '?', '[']).unwrap(); // we 
already checked it exists
+let split_pos = pattern[..glob_pos].rfind('/').unwrap() + 1; // 
find last '/' before glob
+let (base_path, glob_part) = pattern.split_at(split_pos);
+
+let base_url = Url::parse(&format!("{}/", 
base_path.trim_end_matches('/')))
+.map_err(|e| {
+DataFusionError::Plan(format!("Invalid base URL: {}", e))
+})?;
+let glob = Pattern::new(glob_part).map_err(|e| {
+DataFusionError::Plan(format!("Invalid glob pattern: {}", e))

Review Comment:
   ```suggestion
   plan_datafusion_err!("Invalid glob pattern: {}", e))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub


comphead commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134794387


##
datafusion-cli/src/functions.rs:
##
@@ -460,3 +473,94 @@ impl TableFunctionImpl for ParquetMetadataFunc {
 Ok(Arc::new(parquet_metadata))
 }
 }
+
+/// A table function that allows users to query files using glob patterns
+/// for example: SELECT * FROM glob('path/to/*/file.parquet')
+pub struct GlobFunc {
+// we need the ctx here to get the schema from the listing table later
+ctx: SessionContext,
+}
+
+impl std::fmt::Debug for GlobFunc {
+fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+f.debug_struct("GlobFunc")
+.field("ctx", &"")
+.finish()
+}
+}
+
+impl GlobFunc {
+/// Create a new GlobFunc
+pub fn new(ctx: SessionContext) -> Self {
+Self { ctx }
+}
+}
+
+fn as_utf8_literal<'a>(expr: &'a Expr, arg_name: &str) -> Result<&'a str> {
+match expr {
+Expr::Literal(ScalarValue::Utf8(Some(s)), _) => Ok(s),
+Expr::Column(Column { name, .. }) => Ok(name),
+_ => plan_err!("glob() requires a string literal for the '{arg_name}' 
argument"),
+}
+}
+
+impl TableFunctionImpl for GlobFunc {
+fn call(&self, exprs: &[Expr]) -> Result> {
+// Parse arguments
+if exprs.is_empty() {
+return plan_err!("glob() requires a glob pattern");
+}
+let pattern = as_utf8_literal(&exprs[0], "pattern")?;
+let format = if exprs.len() > 1 {
+Some(as_utf8_literal(&exprs[1], "format")?)
+} else {
+None
+};
+
+// Create ListingTableUrl - distinguish between URLs with schemes and 
local paths
+let url = if pattern.contains("://") && pattern.contains(['*', '?', 
'[']) {
+// URL with scheme and glob - split manually to avoid URL encoding 
of glob chars
+let glob_pos = pattern.find(['*', '?', '[']).unwrap(); // we 
already checked it exists
+let split_pos = pattern[..glob_pos].rfind('/').unwrap() + 1; // 
find last '/' before glob
+let (base_path, glob_part) = pattern.split_at(split_pos);
+
+let base_url = Url::parse(&format!("{}/", 
base_path.trim_end_matches('/')))
+.map_err(|e| {
+DataFusionError::Plan(format!("Invalid base URL: {}", e))

Review Comment:
   plan_df_error!
   ```suggestion
   plan_datafusion_err!("Invalid base URL: {}", e))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]