Re: [PR] Add support for glob string in datafusion-cli query [datafusion]
github-actions[bot] closed pull request #16332: Add support for glob string in datafusion-cli query URL: https://github.com/apache/datafusion/pull/16332 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add support for glob string in datafusion-cli query [datafusion]
github-actions[bot] commented on PR #16332: URL: https://github.com/apache/datafusion/pull/16332#issuecomment-3186505480 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add support for glob string in datafusion-cli query [datafusion]
alamb commented on PR #16332: URL: https://github.com/apache/datafusion/pull/16332#issuecomment-2970544930 > rather than a function for each file type (read_parquet, read_csv, etc). You are correct that the latter is more common so will for this. I think the reason that DuckDB et al use a function for each file type is that it simplifies option handling (there are many options that are better suited for parquet that are not csv) That being said, adding a function like `read_file(..)` or `read_data(...)` that handled all file types might be a reasonable thing to do in `datafusion-cli` as then you could probably reuse most of the ListingTable code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add support for glob string in datafusion-cli query [datafusion]
a-agmon commented on PR #16332: URL: https://github.com/apache/datafusion/pull/16332#issuecomment-2961294401 @alamb - thank you very much for the generous comments. I appreciate it. Re naming - I completely agree. Was just wondering whether its better to introduce one function that infer the file type (like `read()` or `glob()`) rather than a function for each file type (`read_parquet`, `read_csv`, etc). You are correct that the latter is more common so will for this. Re the other comments - will review and handle. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add support for glob string in datafusion-cli query [datafusion]
alamb commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2139195148
##
datafusion-cli/src/functions.rs:
##
@@ -460,3 +473,92 @@ impl TableFunctionImpl for ParquetMetadataFunc {
Ok(Arc::new(parquet_metadata))
}
}
+
+/// A table function that allows users to query files using glob patterns
+/// for example: SELECT * FROM glob('path/to/*/file.parquet')
+pub struct GlobFunc {
+// we need the ctx here to get the schema from the listing table later
+ctx: SessionContext,
+}
+
+impl std::fmt::Debug for GlobFunc {
+fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+f.debug_struct("GlobFunc")
+.field("ctx", &"")
+.finish()
+}
+}
+
+impl GlobFunc {
+/// Create a new GlobFunc
+pub fn new(ctx: SessionContext) -> Self {
+Self { ctx }
+}
+}
+
+fn as_utf8_literal<'a>(expr: &'a Expr, arg_name: &str) -> Result<&'a str> {
+match expr {
+Expr::Literal(ScalarValue::Utf8(Some(s)), _) => Ok(s),
Review Comment:
Minor: Maybe thus could use he `try_as_str` function (which would also
handle other literal types)
https://docs.rs/datafusion/latest/datafusion/scalar/enum.ScalarValue.html#method.try_as_str
##
datafusion-cli/tests/sql/integration/glob_test.sql:
##
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
Review Comment:
I think another usecase that @robtandy had was "a list of multiple files"
-- like is there some way to select exactly two files? Something like
```sql
glob('[../datafusion/core/tests/data/cars.csv',
'../datafusion/core/tests/data/trucks.csv', ])
```
Perhaps 🤔
##
datafusion-cli/tests/sql/integration/glob_test.sql:
##
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
+
+-- Test 3: JSON file with explicit format parameter - verify format
specification
+SELECT COUNT(*) AS json_count FROM
glob('../datafusion/core/tests/data/1.json', 'json');
+
+-- Test 4: Single specific CSV file - verify another CSV works
+SELECT COUNT(*) AS example_count FROM
glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM
glob('../datafusion/core/tests/data/exa*.csv');
Review Comment:
Another possibility would be to intercept the `CREATE EXTERNAL TABLE`
command in `datafusion-cli` itself
For example, simliarly to how it peeks here:
https://github.com/apache/datafusion/blob/1d61f31d121632ca27c77c472cae7d604e9aa9d7/datafusion-cli/src/exec.rs#L357-L367
We could implement a special handler in datafusion-cli rather than use the
default one in SessionContext:
https://github.com/apache/datafusion/blob/1d61f31d121632ca27c77c472cae7d604e9aa9d7/datafusion/core/src/execution/context/mod.rs#L669-L672
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add support for glob string in datafusion-cli query [datafusion]
a-agmon commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134804642
##
datafusion-cli/tests/sql/integration/glob_test.sql:
##
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
+
+-- Test 3: JSON file with explicit format parameter - verify format
specification
+SELECT COUNT(*) AS json_count FROM
glob('../datafusion/core/tests/data/1.json', 'json');
+
+-- Test 4: Single specific CSV file - verify another CSV works
+SELECT COUNT(*) AS example_count FROM
glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM
glob('../datafusion/core/tests/data/exa*.csv');
Review Comment:
We can use the current model, but I think it will require touching some core
modules within datafusion.
re the second question, I think that supporting multiple file types should
not be supported.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add support for glob string in datafusion-cli query [datafusion]
a-agmon commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134812154
##
datafusion-cli/tests/sql/integration/glob_test.sql:
##
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
+
+-- Test 3: JSON file with explicit format parameter - verify format
specification
+SELECT COUNT(*) AS json_count FROM
glob('../datafusion/core/tests/data/1.json', 'json');
+
+-- Test 4: Single specific CSV file - verify another CSV works
+SELECT COUNT(*) AS example_count FROM
glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM
glob('../datafusion/core/tests/data/exa*.csv');
Review Comment:
By the way, the current implementation supports the following - but just for
local files (as it uses `::parse()`)
```
CREATE EXTERNAL TABLE logs
STORED AS CSV
LOCATION '/data/*_small.csv';
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add support for glob string in datafusion-cli query [datafusion]
a-agmon commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134812154
##
datafusion-cli/tests/sql/integration/glob_test.sql:
##
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
+
+-- Test 3: JSON file with explicit format parameter - verify format
specification
+SELECT COUNT(*) AS json_count FROM
glob('../datafusion/core/tests/data/1.json', 'json');
+
+-- Test 4: Single specific CSV file - verify another CSV works
+SELECT COUNT(*) AS example_count FROM
glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM
glob('../datafusion/core/tests/data/exa*.csv');
Review Comment:
this also currently supported - but just for local files
```
CREATE EXTERNAL TABLE logs
STORED AS CSV
LOCATION '/data/*_small.csv';
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add support for glob string in datafusion-cli query [datafusion]
a-agmon commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134804642
##
datafusion-cli/tests/sql/integration/glob_test.sql:
##
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
+
+-- Test 3: JSON file with explicit format parameter - verify format
specification
+SELECT COUNT(*) AS json_count FROM
glob('../datafusion/core/tests/data/1.json', 'json');
+
+-- Test 4: Single specific CSV file - verify another CSV works
+SELECT COUNT(*) AS example_count FROM
glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM
glob('../datafusion/core/tests/data/exa*.csv');
Review Comment:
I think we can use the current model, but I think it will require touching
some datafusion core modules.
re the second question, I think that supporting multiple file types should
not be supported.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add support for glob string in datafusion-cli query [datafusion]
comphead commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134795185
##
datafusion-cli/tests/sql/integration/glob_test.sql:
##
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
+
+-- Test 3: JSON file with explicit format parameter - verify format
specification
+SELECT COUNT(*) AS json_count FROM
glob('../datafusion/core/tests/data/1.json', 'json');
+
+-- Test 4: Single specific CSV file - verify another CSV works
+SELECT COUNT(*) AS example_count FROM
glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM
glob('../datafusion/core/tests/data/exa*.csv');
Review Comment:
should we introduce a new function? can we reuse current model?
what should be the behavior if there are mixed CSV/JSON/Parquet files in the
folder?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add support for glob string in datafusion-cli query [datafusion]
comphead commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134794430
##
datafusion-cli/src/functions.rs:
##
@@ -460,3 +473,94 @@ impl TableFunctionImpl for ParquetMetadataFunc {
Ok(Arc::new(parquet_metadata))
}
}
+
+/// A table function that allows users to query files using glob patterns
+/// for example: SELECT * FROM glob('path/to/*/file.parquet')
+pub struct GlobFunc {
+// we need the ctx here to get the schema from the listing table later
+ctx: SessionContext,
+}
+
+impl std::fmt::Debug for GlobFunc {
+fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+f.debug_struct("GlobFunc")
+.field("ctx", &"")
+.finish()
+}
+}
+
+impl GlobFunc {
+/// Create a new GlobFunc
+pub fn new(ctx: SessionContext) -> Self {
+Self { ctx }
+}
+}
+
+fn as_utf8_literal<'a>(expr: &'a Expr, arg_name: &str) -> Result<&'a str> {
+match expr {
+Expr::Literal(ScalarValue::Utf8(Some(s)), _) => Ok(s),
+Expr::Column(Column { name, .. }) => Ok(name),
+_ => plan_err!("glob() requires a string literal for the '{arg_name}'
argument"),
+}
+}
+
+impl TableFunctionImpl for GlobFunc {
+fn call(&self, exprs: &[Expr]) -> Result> {
+// Parse arguments
+if exprs.is_empty() {
+return plan_err!("glob() requires a glob pattern");
+}
+let pattern = as_utf8_literal(&exprs[0], "pattern")?;
+let format = if exprs.len() > 1 {
+Some(as_utf8_literal(&exprs[1], "format")?)
+} else {
+None
+};
+
+// Create ListingTableUrl - distinguish between URLs with schemes and
local paths
+let url = if pattern.contains("://") && pattern.contains(['*', '?',
'[']) {
+// URL with scheme and glob - split manually to avoid URL encoding
of glob chars
+let glob_pos = pattern.find(['*', '?', '[']).unwrap(); // we
already checked it exists
+let split_pos = pattern[..glob_pos].rfind('/').unwrap() + 1; //
find last '/' before glob
+let (base_path, glob_part) = pattern.split_at(split_pos);
+
+let base_url = Url::parse(&format!("{}/",
base_path.trim_end_matches('/')))
+.map_err(|e| {
+DataFusionError::Plan(format!("Invalid base URL: {}", e))
+})?;
+let glob = Pattern::new(glob_part).map_err(|e| {
+DataFusionError::Plan(format!("Invalid glob pattern: {}", e))
Review Comment:
```suggestion
plan_datafusion_err!("Invalid glob pattern: {}", e))
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add support for glob string in datafusion-cli query [datafusion]
comphead commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134794387
##
datafusion-cli/src/functions.rs:
##
@@ -460,3 +473,94 @@ impl TableFunctionImpl for ParquetMetadataFunc {
Ok(Arc::new(parquet_metadata))
}
}
+
+/// A table function that allows users to query files using glob patterns
+/// for example: SELECT * FROM glob('path/to/*/file.parquet')
+pub struct GlobFunc {
+// we need the ctx here to get the schema from the listing table later
+ctx: SessionContext,
+}
+
+impl std::fmt::Debug for GlobFunc {
+fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+f.debug_struct("GlobFunc")
+.field("ctx", &"")
+.finish()
+}
+}
+
+impl GlobFunc {
+/// Create a new GlobFunc
+pub fn new(ctx: SessionContext) -> Self {
+Self { ctx }
+}
+}
+
+fn as_utf8_literal<'a>(expr: &'a Expr, arg_name: &str) -> Result<&'a str> {
+match expr {
+Expr::Literal(ScalarValue::Utf8(Some(s)), _) => Ok(s),
+Expr::Column(Column { name, .. }) => Ok(name),
+_ => plan_err!("glob() requires a string literal for the '{arg_name}'
argument"),
+}
+}
+
+impl TableFunctionImpl for GlobFunc {
+fn call(&self, exprs: &[Expr]) -> Result> {
+// Parse arguments
+if exprs.is_empty() {
+return plan_err!("glob() requires a glob pattern");
+}
+let pattern = as_utf8_literal(&exprs[0], "pattern")?;
+let format = if exprs.len() > 1 {
+Some(as_utf8_literal(&exprs[1], "format")?)
+} else {
+None
+};
+
+// Create ListingTableUrl - distinguish between URLs with schemes and
local paths
+let url = if pattern.contains("://") && pattern.contains(['*', '?',
'[']) {
+// URL with scheme and glob - split manually to avoid URL encoding
of glob chars
+let glob_pos = pattern.find(['*', '?', '[']).unwrap(); // we
already checked it exists
+let split_pos = pattern[..glob_pos].rfind('/').unwrap() + 1; //
find last '/' before glob
+let (base_path, glob_part) = pattern.split_at(split_pos);
+
+let base_url = Url::parse(&format!("{}/",
base_path.trim_end_matches('/')))
+.map_err(|e| {
+DataFusionError::Plan(format!("Invalid base URL: {}", e))
Review Comment:
plan_df_error!
```suggestion
plan_datafusion_err!("Invalid base URL: {}", e))
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
