alamb commented on code in PR #16332:
URL: https://github.com/apache/datafusion/pull/16332#discussion_r2139195148
##########
datafusion-cli/src/functions.rs:
##########
@@ -460,3 +473,92 @@ impl TableFunctionImpl for ParquetMetadataFunc {
Ok(Arc::new(parquet_metadata))
}
}
+
+/// A table function that allows users to query files using glob patterns
+/// for example: SELECT * FROM glob('path/to/*/file.parquet')
+pub struct GlobFunc {
+ // we need the ctx here to get the schema from the listing table later
+ ctx: SessionContext,
+}
+
+impl std::fmt::Debug for GlobFunc {
+ fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+ f.debug_struct("GlobFunc")
+ .field("ctx", &"<SessionContext>")
+ .finish()
+ }
+}
+
+impl GlobFunc {
+ /// Create a new GlobFunc
+ pub fn new(ctx: SessionContext) -> Self {
+ Self { ctx }
+ }
+}
+
+fn as_utf8_literal<'a>(expr: &'a Expr, arg_name: &str) -> Result<&'a str> {
+ match expr {
+ Expr::Literal(ScalarValue::Utf8(Some(s)), _) => Ok(s),
Review Comment:
Minor: Maybe thus could use he `try_as_str` function (which would also
handle other literal types)
https://docs.rs/datafusion/latest/datafusion/scalar/enum.ScalarValue.html#method.try_as_str
##########
datafusion-cli/tests/sql/integration/glob_test.sql:
##########
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
Review Comment:
I think another usecase that @robtandy had was "a list of multiple files"
-- like is there some way to select exactly two files? Something like
```sql
glob('[../datafusion/core/tests/data/cars.csv',
'../datafusion/core/tests/data/trucks.csv', ])
```
Perhaps 🤔
##########
datafusion-cli/tests/sql/integration/glob_test.sql:
##########
@@ -0,0 +1,15 @@
+-- Test glob function with files available in CI
+-- Test 1: Single CSV file - verify basic functionality
+SELECT COUNT(*) AS cars_count FROM
glob('../datafusion/core/tests/data/cars.csv');
+
+-- Test 2: Data aggregation from CSV file - verify actual data reading
+SELECT car, COUNT(*) as count FROM
glob('../datafusion/core/tests/data/cars.csv') GROUP BY car ORDER BY car;
+
+-- Test 3: JSON file with explicit format parameter - verify format
specification
+SELECT COUNT(*) AS json_count FROM
glob('../datafusion/core/tests/data/1.json', 'json');
+
+-- Test 4: Single specific CSV file - verify another CSV works
+SELECT COUNT(*) AS example_count FROM
glob('../datafusion/core/tests/data/example.csv');
+
+-- Test 5: Glob pattern with wildcard - test actual glob functionality
+SELECT COUNT(*) AS glob_pattern_count FROM
glob('../datafusion/core/tests/data/exa*.csv');
Review Comment:
Another possibility would be to intercept the `CREATE EXTERNAL TABLE`
command in `datafusion-cli` itself
For example, simliarly to how it peeks here:
https://github.com/apache/datafusion/blob/1d61f31d121632ca27c77c472cae7d604e9aa9d7/datafusion-cli/src/exec.rs#L357-L367
We could implement a special handler in datafusion-cli rather than use the
default one in SessionContext:
https://github.com/apache/datafusion/blob/1d61f31d121632ca27c77c472cae7d604e9aa9d7/datafusion/core/src/execution/context/mod.rs#L669-L672
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]