Re: [I] `SessionContext::create_physical_expr` does not unwrap casts (and thus is not always optimal) [datafusion]

2025-03-04 Thread via GitHub


jayzhan211 commented on issue #14987:
URL: https://github.com/apache/datafusion/issues/14987#issuecomment-2696573839

   What do you mean "simplify", is it `SimplifyExpressions` or other logic. If 
it is `SimplifyExpressions`, then you have logical plan optimization involved, 
if not and you go directly to `create_physical_expr` then I think this is the 
issue similar to Comet where they don't have coerced types for 
`create_physical_expr` and they need to done the coercion part themselves.
   
   Alternative idea is that we need to introduce another 
`create_physical_expr_from_non_resolved` that allows inputs that are "not 
resolved" (so we need to do coercion or type related casting things) and keep 
current `create_physical_expr` for inputs that are "resolved" and used for 
DataFusion or other projects that has DataFusion logical plan optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] _repr_ and _html_repr_ show '... and additional rows' message [datafusion-python]

2025-03-04 Thread via GitHub


kosiew commented on code in PR #1041:
URL: 
https://github.com/apache/datafusion-python/pull/1041#discussion_r1978735197


##
src/dataframe.rs:
##
@@ -90,59 +90,108 @@ impl PyDataFrame {
 }
 
 fn __repr__(&self, py: Python) -> PyDataFusionResult {
-let df = self.df.as_ref().clone().limit(0, Some(10))?;
+// Get 11 rows to check if there are more than 10
+let df = self.df.as_ref().clone().limit(0, Some(11))?;
 let batches = wait_for_future(py, df.collect())?;
-let batches_as_string = pretty::pretty_format_batches(&batches);
+let num_rows = batches.iter().map(|batch| 
batch.num_rows()).sum::();
+
+// Flatten batches into a single batch for the first 10 rows
+let mut all_rows = Vec::new();
+let mut total_rows = 0;
+
+for batch in &batches {
+let num_rows_to_take = if total_rows + batch.num_rows() > 10 {
+10 - total_rows
+} else {
+batch.num_rows()
+};
+
+if num_rows_to_take > 0 {
+let sliced_batch = batch.slice(0, num_rows_to_take);
+all_rows.push(sliced_batch);
+total_rows += num_rows_to_take;
+}
+
+if total_rows >= 10 {
+break;
+}
+}
+
+let batches_as_string = pretty::pretty_format_batches(&all_rows);
+

Review Comment:
   You can simplify `batches_as_string` to:
   
   ```rust
   // First get just the first 10 rows
   let preview_df = self.df.as_ref().clone().limit(0, Some(10))?;
   let preview_batches = wait_for_future(py, preview_df.collect())?;
   
   // Check if there are more rows by trying to get the 11th row
   let has_more_rows = {
   let check_df = self.df.as_ref().clone().limit(10, Some(1))?;
   let check_batch = wait_for_future(py, check_df.collect())?;
   !check_batch.is_empty()
   };
   
   let batches_as_string = 
pretty::pretty_format_batches(&preview_batches);
   ```
   This directly retrieves just the first 10 rows, eliminating the need for 
manual row tracking and slicing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] _repr_ and _html_repr_ show '... and additional rows' message [datafusion-python]

2025-03-04 Thread via GitHub


kosiew commented on code in PR #1041:
URL: 
https://github.com/apache/datafusion-python/pull/1041#discussion_r1978864670


##
src/dataframe.rs:
##
@@ -90,59 +90,108 @@ impl PyDataFrame {
 }
 
 fn __repr__(&self, py: Python) -> PyDataFusionResult {
-let df = self.df.as_ref().clone().limit(0, Some(10))?;
+// Get 11 rows to check if there are more than 10
+let df = self.df.as_ref().clone().limit(0, Some(11))?;
 let batches = wait_for_future(py, df.collect())?;
-let batches_as_string = pretty::pretty_format_batches(&batches);
+let num_rows = batches.iter().map(|batch| 
batch.num_rows()).sum::();
+
+// Flatten batches into a single batch for the first 10 rows
+let mut all_rows = Vec::new();
+let mut total_rows = 0;
+
+for batch in &batches {
+let num_rows_to_take = if total_rows + batch.num_rows() > 10 {
+10 - total_rows
+} else {
+batch.num_rows()
+};
+
+if num_rows_to_take > 0 {
+let sliced_batch = batch.slice(0, num_rows_to_take);
+all_rows.push(sliced_batch);
+total_rows += num_rows_to_take;
+}
+
+if total_rows >= 10 {
+break;
+}
+}
+
+let batches_as_string = pretty::pretty_format_batches(&all_rows);
+
 match batches_as_string {
-Ok(batch) => Ok(format!("DataFrame()\n{batch}")),
+Ok(batch) => {
+if num_rows > 10 {
+Ok(format!("DataFrame()\n{batch}\n... and additional 
rows"))
+} else {
+Ok(format!("DataFrame()\n{batch}"))
+}
+}
 Err(err) => Ok(format!("Error: {:?}", err.to_string())),
 }
 }
+
+
 
 fn _repr_html_(&self, py: Python) -> PyDataFusionResult {
 let mut html_str = "\n".to_string();
-
-let df = self.df.as_ref().clone().limit(0, Some(10))?;
+
+// Limit to the first 11 rows
+let df = self.df.as_ref().clone().limit(0, Some(11))?;
 let batches = wait_for_future(py, df.collect())?;
-
+
+// If there are no rows, close the table and return
 if batches.is_empty() {
 html_str.push_str("\n");
 return Ok(html_str);
 }
-
+
+// Get schema for headers
 let schema = batches[0].schema();
-
+
 let mut header = Vec::new();
 for field in schema.fields() {
-header.push(format!("{}", field.name()));
+header.push(format!("{}", field.name()));
 }
 let header_str = header.join("");
 html_str.push_str(&format!("{}\n", header_str));
-
-for batch in batches {
+
+// Flatten rows and format them as HTML
+let mut total_rows = 0;
+for batch in &batches {
+total_rows += batch.num_rows();
 let formatters = batch
 .columns()
 .iter()
 .map(|c| ArrayFormatter::try_new(c.as_ref(), 
&FormatOptions::default()))
-.map(|c| {
-c.map_err(|e| PyValueError::new_err(format!("Error: {:?}", 
e.to_string(
-})
+.map(|c| c.map_err(|e| PyValueError::new_err(format!("Error: 
{:?}", e.to_string()
 .collect::, _>>()?;
-
-for row in 0..batch.num_rows() {
+
+let num_rows_to_render = if total_rows > 10 { 10 } else { 
batch.num_rows() };
+
+for row in 0..num_rows_to_render {
 let mut cells = Vec::new();
 for formatter in &formatters {
 cells.push(format!("{}", formatter.value(row)));
 }
 let row_str = cells.join("");
 html_str.push_str(&format!("{}\n", row_str));
 }
-}
 
+if total_rows >= 10 {
+break;
+}

Review Comment:
   How about simplifying to:
   ```rust
   let rows_remaining = 10 - total_rows;
   let rows_in_batch = batch.num_rows().min(rows_remaining);
   
   for row in 0..rows_in_batch {
   html_str.push_str("");
   for col in batch.columns() {
   let formatter =
   ArrayFormatter::try_new(col.as_ref(), 
&FormatOptions::default())?;
   html_str.push_str("");
   html_str.push_str(&formatter.value(row).to_string());
   html_str.push_str("");
   }
   html_str.push_str("\n");
   }
   
   total_rows += rows_in_batch;
   ```
   Reasons:
   
   1. More Accurate Row Limiting:
   
   Before: total_rows was updated before checking the row limit, which could 
resul

Re: [I] `SessionContext::create_physical_expr` does not unwrap casts (and thus is not always optimal) [datafusion]

2025-03-04 Thread via GitHub


ion-elgreco commented on issue #14987:
URL: https://github.com/apache/datafusion/issues/14987#issuecomment-2696634942

   @jayzhan211 its using the expression simplifier:
   
   ```
   let logical_filter = self.filter.map(|expr| {
   // Simplify the expression first
   let props = ExecutionProps::new();
   let simplify_context =
   
SimplifyContext::new(&props).with_schema(df_schema.clone().into());
   let simplifier = 
ExprSimplifier::new(simplify_context).with_max_cycles(10);
   let simplified = simplifier.simplify(expr).unwrap();
   
   context
   .create_physical_expr(simplified, &df_schema)
   .unwrap()
   });
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] add method SessionStateBuilder::new_with_defaults() [datafusion]

2025-03-04 Thread via GitHub


alamb commented on code in PR #14998:
URL: https://github.com/apache/datafusion/pull/14998#discussion_r1979267227


##
datafusion/core/src/execution/session_state.rs:
##
@@ -1109,6 +1109,22 @@ impl SessionStateBuilder {
 
.with_table_function_list(SessionStateDefaults::default_table_functions())
 }
 
+/// Returns a new [`SessionStateBuilder`] with default features.
+///
+/// This is equivalent to calling [`Self::new()`] followed by 
[`Self::with_default_features()`].
+///
+/// ```
+/// use datafusion::execution::session_state::SessionStateBuilder;
+///
+/// // Create a new SessionState with default features
+/// let session_state = SessionStateBuilder::new_with_defaults()
+/// .with_session_id("my_session".to_string())
+/// .build();
+/// ```
+pub fn new_with_defaults() -> Self {

Review Comment:
   To keep the name consistent (and make it harder to confuse this with 
`default()`) could we please name this `default_features`? Like:
   
   ```suggestion
   pub fn new_with_default_features() -> Self {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Update `SessionStateBuilder::with_default_features` does not replace existing features [datafusion]

2025-03-04 Thread via GitHub


alamb merged PR #14935:
URL: https://github.com/apache/datafusion/pull/14935


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] minor: `SessionStateBuilder::with_default_features` ergonomics [datafusion]

2025-03-04 Thread via GitHub


alamb closed issue #14899: minor: `SessionStateBuilder::with_default_features` 
ergonomics
URL: https://github.com/apache/datafusion/issues/14899


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] add method SessionStateBuilder::new_with_defaults() [datafusion]

2025-03-04 Thread via GitHub


alamb commented on PR #14998:
URL: https://github.com/apache/datafusion/pull/14998#issuecomment-2697248930

   > Why don't we just implement Default trait?
   
   There is some more backstory here:
   - 
https://github.com/apache/datafusion/pull/14935#pullrequestreview-2654150748
   
   Basically the question is what "defaults" means -- is default an empty 
builder or does it have all the default functions, etc. 
   
   An alternate design might be to always create SessionStateBuilder with all 
default functions and have a `clear()` method or something to reset the state.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Update `SessionStateBuilder::with_default_features` does not replace existing features [datafusion]

2025-03-04 Thread via GitHub


alamb commented on PR #14935:
URL: https://github.com/apache/datafusion/pull/14935#issuecomment-2697244168

   Thanks again @irenjj and @milenkovicm 
   
   It seems as if @shruti2522 already has a PR to implement the 
new_with_default_features
   - https://github.com/apache/datafusion/pull/14998


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix documentation warnings and error if anymore occur [datafusion]

2025-03-04 Thread via GitHub


AmosAidoo commented on PR #14952:
URL: https://github.com/apache/datafusion/pull/14952#issuecomment-2697309878

   @alamb my pleasure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Implement `tree` explain for `FilterExec` [datafusion]

2025-03-04 Thread via GitHub


alamb opened a new issue, #15000:
URL: https://github.com/apache/datafusion/issues/15000

   ### Is your feature request related to a problem or challenge?
   
   - Part of https://github.com/apache/datafusion/issues/14914
   
   @irenjj  added a new `tree` explain mode in 
https://github.com/apache/datafusion/pull/14677. Now we need to add support for 
different types of operators. 
   
   ```
   set datafusion.explain.format = 'tree';
   create table foo(x int, y int) as values (1,2), (3,4);
   explain select * from foo where x = 4;
   +---++
   | plan_type | plan   |
   +---++
   | logical_plan  | Filter: foo.x = Int32(4)   |
   |   |   TableScan: foo projection=[x, y] |
   | physical_plan | ┌───┐  |
   |   | │CoalesceBatchesExec│  |
   |   | └─┬─┘  |
   |   | ┌─┴─┐  |
   |   | │ FilterExec│  |
   |   | └─┬─┘  |
   |   | ┌─┴─┐  |
   |   | │   DataSourceExec  │  |
   |   | │   │  |
   |   | │partition_sizes: [1]   │  |
   |   | │   partitions: 1   │  |
   |   | └───┘  |
   |   ||
   +---++
   ```
   
   
   
   ### Describe the solution you'd like
   
   Add `tree` format to the named ExecutionPlan
   
   Roughly speaking the process goes like:
   1. Add the relevant code to the operator 
   2. Add / update explain_tree.slt test to show the new code in action
   
   
   You can run the tests like
   ```shell
   cargo test --test sqllogictests -- explain_tree
   ```
   
   You can update the test like this:
   ```shell
   cargo test --test sqllogictests -- explain_tree --complete
   ```
   
   
   
   ### Describe alternatives you've considered
   
   Here is an example PR: TODO
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Executor memory overhead overriding [datafusion-comet]

2025-03-04 Thread via GitHub


wForget commented on code in PR #1462:
URL: https://github.com/apache/datafusion-comet/pull/1462#discussion_r1979412638


##
spark/src/test/scala/org/apache/spark/CometPluginsSuite.scala:
##
@@ -143,3 +143,36 @@ class CometPluginsNonOverrideSuite extends CometTestBase {
 assert(execMemOverhead4 == "2G")
   }
 }
+
+class CometPluginsUnifiedModeOverrideSuite extends CometTestBase {
+  override protected def sparkConf: SparkConf = {
+val conf = new SparkConf()
+conf.set("spark.driver.memory", "1G")
+conf.set("spark.executor.memory", "1G")
+conf.set("spark.executor.memoryOverhead", "1G")
+conf.set("spark.plugins", "org.apache.spark.CometPlugin")
+conf.set("spark.comet.enabled", "true")
+conf.set("spark.memory.offHeap.enabled", "true")
+conf.set("spark.memory.offHeap.size", "2G")
+conf.set("spark.comet.exec.shuffle.enabled", "true")
+conf.set("spark.comet.exec.enabled", "true")
+conf.set("spark.comet.memory.overhead.factor", "0.5")
+conf
+  }
+
+  /** Since using unified memory, executor memory should not be overridden */

Review Comment:
   > That one should be fine, they passed with current changes to overriding 
spark executor memory, but fail without them.
   
   Indeed, I successfully run this test locally, but this comment seems 
incorrect



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement `tree` explain for FilterExec [datafusion]

2025-03-04 Thread via GitHub


irenjj commented on code in PR #15001:
URL: https://github.com/apache/datafusion/pull/15001#discussion_r1979377790


##
datafusion/sqllogictest/test_files/explain_tree.slt:
##
@@ -185,6 +188,32 @@ physical_plan
 26)│   DataSourceExec  ││   DataSourceExec  │
 27)└───┘└───┘
 
+# Long Filter (demonstrate what happens with wrapping)
+query TT
+explain SELECT int_col FROM table1
+WHERE string_col != 'foo' AND string_col != 'bar' AND string_col != 'a really 
long string constant'
+;
+
+logical_plan
+01)Projection: table1.int_col
+02)--Filter: table1.string_col != Utf8("foo") AND table1.string_col != 
Utf8("bar") AND table1.string_col != Utf8("a really long string constant")
+03)TableScan: table1 projection=[int_col, string_col], 
partial_filters=[table1.string_col != Utf8("foo"), table1.string_col != 
Utf8("bar"), table1.string_col != Utf8("a really long string constant")]
+physical_plan
+01)┌───┐
+02)│CoalesceBatchesExec│
+03)└─┬─┘
+04)┌─┴─┐
+05)│ FilterExec│
+06)│   │
+07)│ predicate:│
+08)│string_col@1 != foo AND ...│

Review Comment:
   This is a very nice case for `// TODO: check every line is less than 
MAX_LINE_RENDER_SIZE`. In DuckDB, we can get the following result:
   ```
   D explain select * from t1 where c != 'foo' and c != 'bar' and c != 'a 
really long string constant';
   
   ┌─┐
   │┌───┐│
   ││   Physical Plan   ││
   │└───┘│
   └─┘
   ┌───┐
   │ SEQ_SCAN  │
   │   │
   │ Table: t1 │
   │   Type: Sequential Scan   │
   │   Projections: c  │
   │   │
   │  Filters: │
   │ c!='foo' AND c!='bar' AND │
   │  c!='a really long string │
   │  constant'│
   │   │
   │  ~1 Rows  │
   └───┘
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Expose to `AccumulatorArgs` whether all the groups are sorted [datafusion]

2025-03-04 Thread via GitHub


alamb commented on issue #14991:
URL: https://github.com/apache/datafusion/issues/14991#issuecomment-2697939824

   In the case where the data is already sorted on the group expressions, it 
should use a different grouping operator, specifically
   
https://docs.rs/datafusion/latest/datafusion/physical_plan/aggregates/order/struct.GroupOrderingFull.html
   
   I don't think `GroupOrderingFull` even uses GroupsAccumulator (it just uses 
Accumulator I think)
   
   So in other words, I don't think passing the order to groups accumulator 
will help
   
   
   There is another version where the data is grouped on a prefix of the 
grouping keys: 
https://docs.rs/datafusion/latest/datafusion/physical_plan/aggregates/order/struct.GroupOrderingPartial.html
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: remove code duplication in native_datafusion and native_iceberg_compat implementations [datafusion-comet]

2025-03-04 Thread via GitHub


parthchandra commented on PR #1443:
URL: 
https://github.com/apache/datafusion-comet/pull/1443#issuecomment-2698007418

   > Would it be difficult to split this PR? The title makes it seem like a 
minor refactor to the ParquetExec instantiation, but it's actually bringing in 
a lot of the object store-related logic.
   
   Let me do that  to make it easier for you (will take me a few days to get to 
it). @comphead I'll also address your comments. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add read array support [datafusion-comet]

2025-03-04 Thread via GitHub


andygrove commented on PR #1456:
URL: 
https://github.com/apache/datafusion-comet/pull/1456#issuecomment-2698077082

   Some tests are failing due to 
https://github.com/apache/datafusion-comet/issues/1289
   
   I think the root cause is that we are trying to shuffle with arrays and 
Comet shuffle does not support arrays yet. We need to fall back to Spark for 
these shuffles.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Unsupported Arrow Vector for export: class org.apache.arrow.vector.complex.ListVector [datafusion-comet]

2025-03-04 Thread via GitHub


andygrove commented on issue #1289:
URL: 
https://github.com/apache/datafusion-comet/issues/1289#issuecomment-2698079511

   I think the root cause is that we are trying to shuffle with complex types 
and Comet shuffle does not support complex types yet. We need to fall back to 
Spark for these shuffles.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add read array support [datafusion-comet]

2025-03-04 Thread via GitHub


andygrove commented on PR #1456:
URL: 
https://github.com/apache/datafusion-comet/pull/1456#issuecomment-2698094548

   In `CometExecRule` we check to see if we support the partitioning types for 
the shuffle but do not check that we support the types of other columns.
   
   @comphead Do you want to update these checks as part of this PR and see if 
it resolves the issue?
   
   ```scala
 case class CometExecRule(session: SparkSession) extends Rule[SparkPlan] {
   private def applyCometShuffle(plan: SparkPlan): SparkPlan = {
 plan.transformUp {
   case s: ShuffleExchangeExec
   if isCometPlan(s.child) && isCometNativeShuffleMode(conf) &&
 QueryPlanSerde.supportPartitioning(s.child.output, 
s.outputPartitioning)._1 =>
...
 
 
   case s: ShuffleExchangeExec
   if (!s.child.supportsColumnar || isCometPlan(s.child)) && 
isCometJVMShuffleMode(
 conf) &&
 QueryPlanSerde.supportPartitioningTypes(s.child.output, 
s.outputPartitioning)._1 &&
 !isShuffleOperator(s.child) =>
   ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add all missing table options to be handled in any order [datafusion-sqlparser-rs]

2025-03-04 Thread via GitHub


iffyio commented on PR #1747:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1747#issuecomment-2697471150

   Marking the PR as draft in the meantime as it is not currently pending 
review, @benrsatori feel free to undraft and ping when ready!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] add manual trigger for extended tests in pull requests [datafusion]

2025-03-04 Thread via GitHub


alamb commented on code in PR #14331:
URL: https://github.com/apache/datafusion/pull/14331#discussion_r1979614246


##
.github/workflows/extended.yml:
##
@@ -33,16 +33,46 @@ on:
   push:
 branches:
   - main
+  issue_comment:
+types: [created]
+
+permissions:
+  pull-requests: write
 
 jobs:
+  # Check issue comment and notify that extended tests are running
+  check_issue_comment:
+name: Check issue comment
+runs-on: ubuntu-latest
+if: github.event.issue.pull_request && github.event.comment.body == 'run 
extended tests'
+steps:
+  - uses: actions/github-script@v7
+with:
+  github-token: ${{secrets.GITHUB_TOKEN}}
+  script: |
+github.rest.issues.createComment({
+  issue_number: context.issue.number,
+  owner: context.repo.owner,
+  repo: context.repo.repo,
+  body: "Running extended tests..."
+})
+
   # Check crate compiles and base cargo check passes
   linux-build-lib:
 name: linux build test
 runs-on: ubuntu-latest
 container:
   image: amd64/rust
+if: |
+  github.event_name == 'push' ||
+  (github.event_name == 'issue_comment' && github.event.issue.pull_request 
&& github.event.comment.body == 'run extended tests')
 steps:
   - uses: actions/checkout@v4
+with:
+  # Check out the pull request branch if triggered by a comment
+  ref: ${{ github.event_name == 'issue_comment' && 
github.event.issue.pull_request.head.ref || github.ref }}

Review Comment:
   Thanks @danila-b 🙏 
   
   🤔  
   - https://github.com/danila-b/datafusion/pull/3 doesn't seem to have run the 
extended tests
   
   https://github.com/danila-b/datafusion/pull/3#issuecomment-2697904032



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [EPIC] Complete `SQL EXPLAIN` Tree Rendering [datafusion]

2025-03-04 Thread via GitHub


alamb commented on issue #14914:
URL: https://github.com/apache/datafusion/issues/14914#issuecomment-2697451179

   I made a first PR to show how to add tree explain
   - https://github.com/apache/datafusion/pull/15001
   
   Once we get that one in I plan to file a bunch of other tickets for the 
remaining `ExplainPlan` nodes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Snowflake: Support dollar quoted comment of table, view and field [datafusion-sqlparser-rs]

2025-03-04 Thread via GitHub


7phs opened a new pull request, #1755:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1755

   Snowflake 
[supports](https://docs.snowflake.com/en/sql-reference/literals-table) 
dollar-quoted comments as string literals when creating tables, views, and 
their fields.
   
   For example:
   ```
   CREATE OR REPLACE TEMPORARY VIEW foo.bar.baz (
   "COL_1" COMMENT $$comment 1$$
   ) COMMENT = $$view comment$$ AS (
   SELECT 1
   )
   ```
   
   This pull request adds support for this type of comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] add method SessionStateBuilder::new_with_defaults() [datafusion]

2025-03-04 Thread via GitHub


shruti2522 commented on code in PR #14998:
URL: https://github.com/apache/datafusion/pull/14998#discussion_r1979327789


##
datafusion/core/src/execution/session_state.rs:
##
@@ -1109,6 +1109,22 @@ impl SessionStateBuilder {
 
.with_table_function_list(SessionStateDefaults::default_table_functions())
 }
 
+/// Returns a new [`SessionStateBuilder`] with default features.
+///
+/// This is equivalent to calling [`Self::new()`] followed by 
[`Self::with_default_features()`].
+///
+/// ```
+/// use datafusion::execution::session_state::SessionStateBuilder;
+///
+/// // Create a new SessionState with default features
+/// let session_state = SessionStateBuilder::new_with_defaults()
+/// .with_session_id("my_session".to_string())
+/// .build();
+/// ```
+pub fn new_with_defaults() -> Self {

Review Comment:
   done @alamb 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Implement `tree` explain for FilterExec [datafusion]

2025-03-04 Thread via GitHub


alamb opened a new pull request, #15001:
URL: https://github.com/apache/datafusion/pull/15001

   ## Which issue does this PR close?
   
   
   
   - Closes https://github.com/apache/datafusion/issues/15000
   
   ## Rationale for this change
   
   Let's have nice explain plans! 
   
   I wanted an example of how to make a tree explain plan so that I can file a 
bunch of follow on tickets for 
https://github.com/apache/datafusion/issues/14914 to share the work
   
   ## What changes are included in this PR?
   
   1. Update comments about `ExplainFormat`
   2. Implement for `FilterExec`
   3. Update tests
   
   ## Are these changes tested?
   Yes
   
   
   ## Are there any user-facing changes?
   tree explain is better
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Make `create_ordering` pub and add doc for it [datafusion]

2025-03-04 Thread via GitHub


xudong963 merged PR #14996:
URL: https://github.com/apache/datafusion/pull/14996


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Executor memory overhead overriding [datafusion-comet]

2025-03-04 Thread via GitHub


LukMRVC commented on PR #1462:
URL: 
https://github.com/apache/datafusion-comet/pull/1462#issuecomment-2697480797

   That one should be fine, they passed with current changes to overriding 
spark executor memory, but fail without them. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement `tree` explain for FilterExec [datafusion]

2025-03-04 Thread via GitHub


irenjj commented on code in PR #15001:
URL: https://github.com/apache/datafusion/pull/15001#discussion_r1979381479


##
datafusion/sqllogictest/test_files/explain_tree.slt:
##
@@ -185,6 +188,32 @@ physical_plan
 26)│   DataSourceExec  ││   DataSourceExec  │
 27)└───┘└───┘
 
+# Long Filter (demonstrate what happens with wrapping)
+query TT
+explain SELECT int_col FROM table1
+WHERE string_col != 'foo' AND string_col != 'bar' AND string_col != 'a really 
long string constant'
+;
+
+logical_plan
+01)Projection: table1.int_col
+02)--Filter: table1.string_col != Utf8("foo") AND table1.string_col != 
Utf8("bar") AND table1.string_col != Utf8("a really long string constant")
+03)TableScan: table1 projection=[int_col, string_col], 
partial_filters=[table1.string_col != Utf8("foo"), table1.string_col != 
Utf8("bar"), table1.string_col != Utf8("a really long string constant")]
+physical_plan
+01)┌───┐
+02)│CoalesceBatchesExec│
+03)└─┬─┘
+04)┌─┴─┐
+05)│ FilterExec│
+06)│   │
+07)│ predicate:│
+08)│string_col@1 != foo AND ...│

Review Comment:
   I will try to make it the same as DuckDB later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Implement Nicer / DuckDB style explain plans [datafusion]

2025-03-04 Thread via GitHub


alamb closed issue #9371: Implement Nicer / DuckDB style explain plans
URL: https://github.com/apache/datafusion/issues/9371


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add `tree` / pretty explain mode [datafusion]

2025-03-04 Thread via GitHub


alamb merged PR #14677:
URL: https://github.com/apache/datafusion/pull/14677


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] `SessionContext::create_physical_expr` does not unwrap casts (and thus is not always optimal) [datafusion]

2025-03-04 Thread via GitHub


jayzhan211 commented on issue #14987:
URL: https://github.com/apache/datafusion/issues/14987#issuecomment-2697401042

   I see. We can try moving `UnwrapCastInComparison` into `Simplifier` large 
pattern matching rule


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Release DataFusion `46.0.0` [datafusion]

2025-03-04 Thread via GitHub


alamb commented on issue #14123:
URL: https://github.com/apache/datafusion/issues/14123#issuecomment-2697426444

   We have merged the fix here
   - https://github.com/apache/datafusion/pull/14990
   
   To simplify votiing I suggest we backport that fix to the `branch-46` line 
and make a second release candidate (RC2)
   
   Is this something you can do @xudong963 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add `tree` / pretty explain mode [datafusion]

2025-03-04 Thread via GitHub


alamb commented on PR #14677:
URL: https://github.com/apache/datafusion/pull/14677#issuecomment-2697397568

   🚀  thank you again so much @irenjj  -- this is so cool. A great step towards 
better explain plans. 
   
   I am working on enhancing the follow on ticket you filed in 
https://github.com/apache/datafusion/issues/14914 ❤️ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Refactor SortPushdown using the standard top-down visitor and using `EquivalenceProperties` [datafusion]

2025-03-04 Thread via GitHub


alamb commented on PR #14821:
URL: https://github.com/apache/datafusion/pull/14821#issuecomment-2698190011

   @berkaysynnada @wiedld and I had a brief meeting. The outcomes from my 
perspective is:
   1. @berkaysynnada  plans to review this PR and test on the synnada fork to 
ensure it still work
   2. In parallel, @wiedld and I will work on increasing the test coverage of 
EnforceDistribution to ensure it covers our InfluxDB Iox plans


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Executor memory overhead overriding [datafusion-comet]

2025-03-04 Thread via GitHub


LukMRVC commented on code in PR #1462:
URL: https://github.com/apache/datafusion-comet/pull/1462#discussion_r1979775668


##
spark/src/test/scala/org/apache/spark/CometPluginsSuite.scala:
##
@@ -143,3 +143,36 @@ class CometPluginsNonOverrideSuite extends CometTestBase {
 assert(execMemOverhead4 == "2G")
   }
 }
+
+class CometPluginsUnifiedModeOverrideSuite extends CometTestBase {
+  override protected def sparkConf: SparkConf = {
+val conf = new SparkConf()
+conf.set("spark.driver.memory", "1G")
+conf.set("spark.executor.memory", "1G")
+conf.set("spark.executor.memoryOverhead", "1G")
+conf.set("spark.plugins", "org.apache.spark.CometPlugin")
+conf.set("spark.comet.enabled", "true")
+conf.set("spark.memory.offHeap.enabled", "true")
+conf.set("spark.memory.offHeap.size", "2G")
+conf.set("spark.comet.exec.shuffle.enabled", "true")
+conf.set("spark.comet.exec.enabled", "true")
+conf.set("spark.comet.memory.overhead.factor", "0.5")
+conf
+  }
+
+  /** Since using unified memory, executor memory should not be overridden */

Review Comment:
   Yes, you are right. I overlooked it. I changed the comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Expose to `AccumulatorArgs` whether all the groups are sorted [datafusion]

2025-03-04 Thread via GitHub


rluvaton commented on issue #14991:
URL: https://github.com/apache/datafusion/issues/14991#issuecomment-2698202576

   Actually it uses GroupAccumulator even if it is fully sorted.
   
   you can see by adding breakpoint to 
https://github.com/apache/datafusion/blob/ac79ef3442e65f6197c7234da9fad964895b9101/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/prim_op.rs#L118
   
   and run the following `slt`:
   ```slt
   
   statement ok
   CREATE TABLE test_table (
   col_i32 INT,
   col_u32 INT UNSIGNED
   ) as VALUES
   ( NULL,NULL),
   ( -2147483648, 0),
   ( -2147483648, 0),
   ( 100, 100),
   ( 2147483647,  4294967295),
   ( NULL,NULL),
   ( -2147483648, 0),
   ( -2147483648, 0),
   ( 100, 100),
   ( 2147483646,  4294967294),
   ( 2147483647,  4294967295 )
   
   
   query II
   select col_i32, sum(col_u32) sum_col_u32 from (select * from test_table 
order by col_i32 limit 10) group by col_i32
   
   2147483647 8589934590
   -2147483648 0
   100 200
   2147483646 4294967294
   NULL NULL
   
   ```
   
   you will see that even though `InputOrderMode` is `Sorted` the 
`GroupAccumulator` is still used.
   
   (I think we should be using group accumulator for sorted or partial sorted 
data to avoid combining all scalars from `Accumulator`s to array
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Epic: Statistics improvements [datafusion]

2025-03-04 Thread via GitHub


alamb commented on issue #8227:
URL: https://github.com/apache/datafusion/issues/8227#issuecomment-2698552710

   @clflushopt  I don't really know as I am not driving the statistics rework 
myself and thus don't have much visibility into what is planned 
   
   The only remaining thing I know of that comes to mind is
   - https://github.com/apache/datafusion/issues/8229
   
   but that is more of an API exercise rather than algorithms. 
   
   Another thing that comes to mind is to look at / extend the existing 
https://github.com/apache/datafusion/blob/main/datafusion/physical-expr/src/intervals/cp_solver.rs
 code / show how it is useful
   
   I was playing around with duckdb this morning. Somehow it uses its constrait 
solver to simplify away redundant filters. For example, it correctly deduces `x 
> 1 AND x > 2 AND x > 3 AND x < 6 AND x < 7 AND x < 8` is the same as `x>3 AND 
x<6`
   
   ```sql
   create table foo(x int);
   D insert into foo values (1);
   D insert into foo values (5);
   
   D explain SELECT * from foo where x > 1 AND x > 2 AND x > 3 AND x < 6 AND x 
< 7 AND x < 8;
   
   ┌─┐
   │┌───┐│
   ││   Physical Plan   ││
   │└───┘│
   └─┘
   ┌───┐
   │ SEQ_SCAN  │
   │   │
   │ Table: foo│
   │   Type: Sequential Scan   │
   │   Projections: x  │
   │Filters: x>3 AND x<6   │
   │   │
   │  ~1 Rows  │
   └───┘
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Substrait plan read relation baseSchema does not include the struct with type information [datafusion]

2025-03-04 Thread via GitHub


amoeba commented on issue #12244:
URL: https://github.com/apache/datafusion/issues/12244#issuecomment-2699036156

   The behavior was modified in #12245 and the original issue looks addressed 
but the plans I'm getting don't validate. DataFusion currently hardcodes struct 
nullability to `NULLABILITY_UNSPECIFIED`, see 
https://github.com/apache/datafusion/blob/dd0fd889ea603f929accb99002e2f99280823f5c/datafusion/substrait/src/logical_plan/producer.rs#L1042
   
   The `baseSchema` I get with the latest datafusion looks like this,
   
   ```json
   "baseSchema": {
 "names": [
   "species",
   "island",
   "bill_length_mm",
   "bill_depth_mm",
   "body_mass_g",
   "sex",
   "year"
 ],
 "struct": {
   "types": [
 {
   "string": {
 "nullability": "NULLABILITY_NULLABLE"
   }
 },
 {
   "string": {
 "nullability": "NULLABILITY_NULLABLE"
   }
 },
 {
   "fp64": {
 "nullability": "NULLABILITY_NULLABLE"
   }
 },
 {
   "fp64": {
 "nullability": "NULLABILITY_NULLABLE"
   }
 },
 {
   "i32": {
 "nullability": "NULLABILITY_NULLABLE"
   }
 },
 {
   "string": {
 "nullability": "NULLABILITY_NULLABLE"
   }
 },
 {
   "i32": {
 "nullability": "NULLABILITY_NULLABLE"
   }
 }
   ]
 }
   },
   ```
   
   The above implicitly sets the nullability on the baseSchema to 
`NULLABILITY_UNSPECIFIED` which, when validated as part of a larger plan, 
errors out with:
   
   ```
   Error (code 0002):
 at 
plan.relations[0].rel_type.input.rel_type.input.rel_type.base_schema.struct.nullability:
 illegal value: nullability information is required in this context (code 
0002)
   ```
   
   I can get the plan to validate if I set nullability to 
`NULLABILITY_REQUIRED`,
   
   ```
   "baseSchema": {
 "names": [...],
 "struct": {
   "types": [...],
   "nullability": "NULLABILITY_REQUIRED"
 }
   },
   ```
   
   Is the validator right and should DataFusion change its behavior here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Add identation to EnforceDistribution test plans. [datafusion]

2025-03-04 Thread via GitHub


alamb commented on code in PR #15007:
URL: https://github.com/apache/datafusion/pull/15007#discussion_r1980306194


##
datafusion/core/tests/physical_optimizer/enforce_distribution.rs:
##
@@ -572,33 +573,33 @@ fn multi_hash_joins() -> Result<()> {
 // Should include 3 RepartitionExecs
 JoinType::Inner | JoinType::Left | JoinType::LeftSemi | 
JoinType::LeftAnti | JoinType::LeftMark => vec![
 top_join_plan.as_str(),
-join_plan.as_str(),
-"RepartitionExec: partitioning=Hash([a@0], 10), 
input_partitions=10",
-"RepartitionExec: partitioning=RoundRobinBatch(10), 
input_partitions=1",
-"DataSourceExec: file_groups={1 group: [[x]]}, 
projection=[a, b, c, d, e], file_type=parquet",
-"RepartitionExec: partitioning=Hash([b1@1], 10), 
input_partitions=10",
-"RepartitionExec: partitioning=RoundRobinBatch(10), 
input_partitions=1",
-"ProjectionExec: expr=[a@0 as a1, b@1 as b1, c@2 as 
c1, d@3 as d1, e@4 as e1]",
-"DataSourceExec: file_groups={1 group: [[x]]}, 
projection=[a, b, c, d, e], file_type=parquet",
-"RepartitionExec: partitioning=Hash([c@2], 10), 
input_partitions=10",
-"RepartitionExec: partitioning=RoundRobinBatch(10), 
input_partitions=1",
-"DataSourceExec: file_groups={1 group: [[x]]}, 
projection=[a, b, c, d, e], file_type=parquet",
+&join_plan_indent2,

Review Comment:
   ❤️ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Some aggregates silently ignore `IGNORE NULLS` and `ORDER BY` on arguments [datafusion]

2025-03-04 Thread via GitHub


vbarua commented on issue #9924:
URL: https://github.com/apache/datafusion/issues/9924#issuecomment-2698960345

   I opened up an issue that's potentially related to this when it comes to 
supporting IGNORE NULLS
   https://github.com/apache/datafusion/issues/15006


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Substrait plan read relation baseSchema does not include the struct with type information [datafusion]

2025-03-04 Thread via GitHub


Blizzara commented on issue #12244:
URL: https://github.com/apache/datafusion/issues/12244#issuecomment-2699060966

   I didn‘t see something in substrait.io where it‘d be explicitly said, but 
the example there 
(https://substrait.io/tutorial/sql_to_substrait/#types-and-schemas) does use 
NULLABILITY_REQUIRED. I don’t think it makes any difference in reality, given 
it’s the fields inside that struct type that matter, not the root struct itself.
   
   If the validator is happier with NULLABILITY_REQUIRED, fine by me to change 
it to that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Upgrade to DataFusion 46.0.0-rc2 [datafusion-comet]

2025-03-04 Thread via GitHub


andygrove commented on PR #1423:
URL: 
https://github.com/apache/datafusion-comet/pull/1423#issuecomment-2699132289

   @kazuyukitanimura I'd like to go ahead and merge this now to give us a few 
days to test with D 46 before the final release. WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Do not double alias Exprs [datafusion]

2025-03-04 Thread via GitHub


alamb opened a new pull request, #15008:
URL: https://github.com/apache/datafusion/pull/15008

   ## Which issue does this PR close?
   
   
   
   - Closes https://github.com/apache/datafusion/issues/14895
   
   ## Rationale for this change
   
   Nested `Expr::Alias` is confusing for display and for dealing with code
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Substrait plan read relation baseSchema does not include the struct with type information [datafusion]

2025-03-04 Thread via GitHub


amoeba commented on issue #12244:
URL: https://github.com/apache/datafusion/issues/12244#issuecomment-2699149322

   Thanks @Blizzara. I'll file a PR to change this soon and we can close this 
issue out. I asked for some clarification on the Substrait Slack so we can be 
sure it's the right change and I'll create a PR once I'm sure. Hopefully also 
update the Substrait docs too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement `tree` explain for FilterExec [datafusion]

2025-03-04 Thread via GitHub


alamb commented on code in PR #15001:
URL: https://github.com/apache/datafusion/pull/15001#discussion_r1979586990


##
datafusion/sqllogictest/test_files/explain_tree.slt:
##
@@ -185,6 +188,32 @@ physical_plan
 26)│   DataSourceExec  ││   DataSourceExec  │
 27)└───┘└───┘
 
+# Long Filter (demonstrate what happens with wrapping)
+query TT
+explain SELECT int_col FROM table1
+WHERE string_col != 'foo' AND string_col != 'bar' AND string_col != 'a really 
long string constant'
+;
+
+logical_plan
+01)Projection: table1.int_col
+02)--Filter: table1.string_col != Utf8("foo") AND table1.string_col != 
Utf8("bar") AND table1.string_col != Utf8("a really long string constant")
+03)TableScan: table1 projection=[int_col, string_col], 
partial_filters=[table1.string_col != Utf8("foo"), table1.string_col != 
Utf8("bar"), table1.string_col != Utf8("a really long string constant")]
+physical_plan
+01)┌───┐
+02)│CoalesceBatchesExec│
+03)└─┬─┘
+04)┌─┴─┐
+05)│ FilterExec│
+06)│   │
+07)│ predicate:│
+08)│string_col@1 != foo AND ...│

Review Comment:
   Awesome -- thank you. Maybe we can file a ticket to track the idea?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Weekly Plan (Andrew Lamb) March 3, 2025 [datafusion]

2025-03-04 Thread via GitHub


alamb commented on issue #14978:
URL: https://github.com/apache/datafusion/issues/14978#issuecomment-2697852067

   
   DataFusion: Bugs/UX/Performance
   - [ ] https://github.com/apache/datafusion/pull/14331
   - [ ] https://github.com/apache/datafusion/pull/14935
   - [ ] https://github.com/apache/datafusion/pull/14952
   - [ ] https://github.com/apache/datafusion/pull/14995
   - [ ] https://github.com/apache/datafusion/pull/14994
   
   
   DataFusion: New Features/ Functions
   - [ ] https://github.com/apache/datafusion/pull/14951
   - [ ] https://github.com/apache/datafusion/pull/14954
   - [ ] https://github.com/apache/datafusion/pull/14413
   - [ ] https://github.com/apache/datafusion/pull/14595
   - [ ] https://github.com/apache/datafusion/pull/14196
   - [ ] https://github.com/apache/datafusion/pull/14687
   - [ ] https://github.com/apache/datafusion/pull/14412
   - [ ] https://github.com/apache/datafusion/pull/14837
   
   arrow-rs
   - [ ] https://github.com/apache/arrow-rs/pull/
   - [ ] https://github.com/apache/arrow-rs/pull/6965
   - [ ] https://github.com/apache/arrow-rs/pull/6637
   - [ ] https://github.com/apache/arrow-rs/pull/7177
   - [ ] https://github.com/apache/arrow-rs/pull/7179
   - [ ] https://github.com/apache/arrow-rs/pull/7191
   
   
   Stuck PRs that need help pushing to completion
   - [ ] https://github.com/apache/datafusion/pull/14194 
   - [ ] https://github.com/apache/datafusion/pull/14222
   - [ ] https://github.com/apache/datafusion/pull/14362 / 
https://github.com/apache/datafusion/pull/14057 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Expr simplifier doesn't simplify exprs that are same if you swap lhs with rhs regardless of cycles [datafusion]

2025-03-04 Thread via GitHub


alamb commented on issue #14943:
URL: https://github.com/apache/datafusion/issues/14943#issuecomment-2697865461

   @jayzhan211 has a PR to fix this here:
   - https://github.com/apache/datafusion/pull/14994


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Expr simplifier doesn't simplify exprs that are same if you swap lhs with rhs regardless of cycles [datafusion]

2025-03-04 Thread via GitHub


alamb commented on issue #14943:
URL: https://github.com/apache/datafusion/issues/14943#issuecomment-2697880588

   > We simplify A = B and B = A in common_sub_expression_eliminate but not 
simplify_expressions
   
   I wonder if this is another example of code that would be beneficial to pull 
into simplify_expressions 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Simplify Between expression to Eq [datafusion]

2025-03-04 Thread via GitHub


alamb commented on code in PR #14994:
URL: https://github.com/apache/datafusion/pull/14994#discussion_r1979605743


##
datafusion/sqllogictest/test_files/simplify_expr.slt:
##
@@ -0,0 +1,34 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+

Review Comment:
   I like the idea of using slt for simplification testing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] add manual trigger for extended tests in pull requests [datafusion]

2025-03-04 Thread via GitHub


alamb commented on code in PR #14331:
URL: https://github.com/apache/datafusion/pull/14331#discussion_r1979614246


##
.github/workflows/extended.yml:
##
@@ -33,16 +33,46 @@ on:
   push:
 branches:
   - main
+  issue_comment:
+types: [created]
+
+permissions:
+  pull-requests: write
 
 jobs:
+  # Check issue comment and notify that extended tests are running
+  check_issue_comment:
+name: Check issue comment
+runs-on: ubuntu-latest
+if: github.event.issue.pull_request && github.event.comment.body == 'run 
extended tests'
+steps:
+  - uses: actions/github-script@v7
+with:
+  github-token: ${{secrets.GITHUB_TOKEN}}
+  script: |
+github.rest.issues.createComment({
+  issue_number: context.issue.number,
+  owner: context.repo.owner,
+  repo: context.repo.repo,
+  body: "Running extended tests..."
+})
+
   # Check crate compiles and base cargo check passes
   linux-build-lib:
 name: linux build test
 runs-on: ubuntu-latest
 container:
   image: amd64/rust
+if: |
+  github.event_name == 'push' ||
+  (github.event_name == 'issue_comment' && github.event.issue.pull_request 
&& github.event.comment.body == 'run extended tests')
 steps:
   - uses: actions/checkout@v4
+with:
+  # Check out the pull request branch if triggered by a comment
+  ref: ${{ github.event_name == 'issue_comment' && 
github.event.issue.pull_request.head.ref || github.ref }}

Review Comment:
   🤔  
   - https://github.com/danila-b/datafusion/pull/3 doesn't seem to have run the 
extended tests
   
   https://github.com/danila-b/datafusion/pull/3#issuecomment-2697904032



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Expr simplifier doesn't simplify exprs that are same if you swap lhs with rhs regardless of cycles [datafusion]

2025-03-04 Thread via GitHub


ion-elgreco commented on issue #14943:
URL: https://github.com/apache/datafusion/issues/14943#issuecomment-2698150878

   > [@jayzhan211](https://github.com/jayzhan211) has a PR to fix this here:
   > 
   > * [Simplify Between expression to Eq 
#14994](https://github.com/apache/datafusion/pull/14994)
   
   If the fix gets included in DF46, then we can give it a spin to see if it 
fully resolves the issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [EPIC] Complete `SQL EXPLAIN` Tree Rendering [datafusion]

2025-03-04 Thread via GitHub


milenkovicm commented on issue #14914:
URL: https://github.com/apache/datafusion/issues/14914#issuecomment-2698392966

   This looks great! Would it be possible to extend this with svg output? It 
would be great for ballista ui 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] [Question] Skip Partial stage aggregation stage for sorted input [datafusion]

2025-03-04 Thread via GitHub


rluvaton opened a new issue, #15002:
URL: https://github.com/apache/datafusion/issues/15002

   If the input is sorted by the group keys can't we just skip partial 
aggregation stage?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix: to_char Function Now Correctly Handles DATE Values in DataFusion [datafusion]

2025-03-04 Thread via GitHub


alamb commented on code in PR #14970:
URL: https://github.com/apache/datafusion/pull/14970#discussion_r1979911036


##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -234,15 +226,21 @@ fn _to_char_scalar(
 };
 
 let formatter = ArrayFormatter::try_new(array.as_ref(), &format_options)?;
-let formatted: Result, ArrowError> = (0..array.len())
-.map(|i| formatter.value(i).try_to_string())
+let formatted: Result>, ArrowError> = (0..array.len())

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Improve EnforceDistribution testings. [datafusion]

2025-03-04 Thread via GitHub


wiedld commented on issue #15003:
URL: https://github.com/apache/datafusion/issues/15003#issuecomment-2698352844

   take
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Scrollable python notebook table rendering [datafusion-python]

2025-03-04 Thread via GitHub


kevinjqliu commented on code in PR #1036:
URL: 
https://github.com/apache/datafusion-python/pull/1036#discussion_r1979772203


##
src/dataframe.rs:
##
@@ -100,46 +106,153 @@ impl PyDataFrame {
 }
 
 fn _repr_html_(&self, py: Python) -> PyDataFusionResult {
-let mut html_str = "\n".to_string();
-
-let df = self.df.as_ref().clone().limit(0, Some(10))?;
-let batches = wait_for_future(py, df.collect())?;
-
+let (batches, mut has_more) =
+wait_for_future(py, 
get_first_few_record_batches(self.df.as_ref().clone()))?;
+let Some(batches) = batches else {
+return Ok("No data to display".to_string());
+};
 if batches.is_empty() {
-html_str.push_str("\n");
-return Ok(html_str);
+// This should not be reached, but do it for safety since we index 
into the vector below
+return Ok("No data to display".to_string());
 }
 
+let table_uuid = uuid::Uuid::new_v4().to_string();
+
+let mut html_str = "
+
+.expandable-container {
+display: inline-block;
+max-width: 200px;
+}
+.expandable {
+white-space: nowrap;
+overflow: hidden;
+text-overflow: ellipsis;
+display: block;
+}
+.full-text {
+display: none;
+white-space: normal;
+}
+.expand-btn {
+cursor: pointer;
+color: blue;
+text-decoration: underline;
+border: none;
+background: none;
+font-size: inherit;
+display: block;
+margin-top: 5px;
+}
+
+
+
+
+\n".to_string();
+
 let schema = batches[0].schema();
 
 let mut header = Vec::new();
 for field in schema.fields() {
-header.push(format!("{}", field.name()));
+header.push(format!("{}", field.name()));
 }
 let header_str = header.join("");
-html_str.push_str(&format!("{}\n", header_str));
+html_str.push_str(&format!("{}\n", 
header_str));
+
+let batch_formatters = batches
+.iter()
+.map(|batch| {
+batch
+.columns()
+.iter()
+.map(|c| ArrayFormatter::try_new(c.as_ref(), 
&FormatOptions::default()))
+.map(|c| {
+c.map_err(|e| PyValueError::new_err(format!("Error: 
{:?}", e.to_string(
+})
+.collect::, _>>()
+})
+.collect::, _>>()?;
+
+let total_memory: usize = batches
+.iter()
+.map(|batch| batch.get_array_memory_size())
+.sum();
+let rows_per_batch = batches.iter().map(|batch| batch.num_rows());
+let total_rows = rows_per_batch.clone().sum();
+
+// let (total_memory, total_rows) = batches.iter().fold((0, 0), |acc, 
batch| {
+// (acc.0 + batch.get_array_memory_size(), acc.1 + 
batch.num_rows())
+// });

Review Comment:
   nit remove commented out code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] _repr_ and _html_repr_ show '... and additional rows' message [datafusion-python]

2025-03-04 Thread via GitHub


kevinjqliu commented on PR #1041:
URL: 
https://github.com/apache/datafusion-python/pull/1041#issuecomment-2698204265

   btw #1036 also changes `_repr_html_` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] add manual trigger for extended tests in pull requests [datafusion]

2025-03-04 Thread via GitHub


danila-b commented on code in PR #14331:
URL: https://github.com/apache/datafusion/pull/14331#discussion_r1979865173


##
.github/workflows/extended.yml:
##
@@ -33,16 +33,46 @@ on:
   push:
 branches:
   - main
+  issue_comment:
+types: [created]
+
+permissions:
+  pull-requests: write
 
 jobs:
+  # Check issue comment and notify that extended tests are running
+  check_issue_comment:
+name: Check issue comment
+runs-on: ubuntu-latest
+if: github.event.issue.pull_request && github.event.comment.body == 'run 
extended tests'
+steps:
+  - uses: actions/github-script@v7
+with:
+  github-token: ${{secrets.GITHUB_TOKEN}}
+  script: |
+github.rest.issues.createComment({
+  issue_number: context.issue.number,
+  owner: context.repo.owner,
+  repo: context.repo.repo,
+  body: "Running extended tests..."
+})
+
   # Check crate compiles and base cargo check passes
   linux-build-lib:
 name: linux build test
 runs-on: ubuntu-latest
 container:
   image: amd64/rust
+if: |
+  github.event_name == 'push' ||
+  (github.event_name == 'issue_comment' && github.event.issue.pull_request 
&& github.event.comment.body == 'run extended tests')
 steps:
   - uses: actions/checkout@v4
+with:
+  # Check out the pull request branch if triggered by a comment
+  ref: ${{ github.event_name == 'issue_comment' && 
github.event.issue.pull_request.head.ref || github.ref }}

Review Comment:
   Apparently it did run them, it's just not visible in GitHub UI. I have just 
double checked it with a bit different implementation on the same branch. 
   
   Unfortunately, it seems that `issue_comment` event trigger is associated 
with default branch, and hence it does not add anything to the checks or UI of 
the PR itself. I can take a look if there is a better way to launch check 
associated with PR during the weekend, but from the top of my head, there are a 
couple of options:
   
   1. Use `workflow_dispatch` event and trigger extended tests for the branch 
from the GitHub actions UI tab instead of doing so by comment on the PR
   2. Use [3rd party 
action](https://github.com/myrotvorets/set-commit-status-action) to set check 
statuses when the run is triggered by a comment
   3. Maybe [slash command 3rd party 
action](https://github.com/marketplace/actions/slash-command-dispatch) can also 
work for triggering commands a bit better since it uses dispatch event under 
the hood as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [WIP] chore: Add detailed error for sum::coerce_type [datafusion]

2025-03-04 Thread via GitHub


dentiny commented on PR #14710:
URL: https://github.com/apache/datafusion/pull/14710#issuecomment-2698883962

   I'm really really sorry for the latency, get stuck with something else; I 
will get back to the PR by EoW.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Release sqlparser-rs version `0.55.0` [datafusion-sqlparser-rs]

2025-03-04 Thread via GitHub


comphead commented on issue #1671:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/issues/1671#issuecomment-2698903812

   +1
   Verified on M3 Mac
   Thanks Andrew
   27 February 2025, 09:37:51, From "Andrew Lamb" ***@***.***>:
   I made a release candidate. Here is the draft email. However, the 
dist.apache.org site appears to be down so I will try again later
   Hi,
   
   I would like to propose a release of Apache DataFusion sqlparser-rs version 
0.55.0.
   
   This release candidate is based on commit: 
ed416548dcfe4a73a3240bbf625fb9010a4925c8 [1]
   The proposed release tarball and signatures are hosted at [2].
   The changelog is located at [3].
   
   Please download, verify checksums and signatures, run the unit tests, and 
vote
   on the release. The vote will be open for at least 72 hours.
   
   Only votes from PMC members are binding, but all members of the community are
   encouraged to test the release and vote with "(non-binding)".
   
   The standard verification procedure is documented at 
https://github.com/apache/datafusion-sqlparser-rs/blob/main/dev/release/README.md#verifying-release-candidates.
   
   [ ] +1 Release this as Apache DataFusion sqlparser-rs 0.55.0
   [ ] +0
   [ ] -1 Do not release this as Apache DataFusion sqlparser-rs 0.55.0 
because...
   
   Here is my vote:
   
   +1
   
   [1]: 
https://github.com/apache/datafusion-sqlparser-rs/tree/ed416548dcfe4a73a3240bbf625fb9010a4925c8
   [2]: 
https://dist.apache.org/repos/dist/dev/datafusion/apache-datafusion-sqlparser-rs-0.55.0-rc1
   [3]: 
https://github.com/apache/datafusion-sqlparser-rs/blob/ed416548dcfe4a73a3240bbf625fb9010a4925c8/CHANGELOG.md
   -
   — Reply to this email directly, view it on GitHub, or unsubscribe. You are 
receiving this because you are subscribed to this thread.Message ID: 
***@***.***>
   alamb left a comment (apache/datafusion-sqlparser-rs#1671)
   I made a release candidate. Here is the draft email. However, the 
dist.apache.org site appears to be down so I will try again later
   Hi,
   
   I would like to propose a release of Apache DataFusion sqlparser-rs version 
0.55.0.
   
   This release candidate is based on commit: 
ed416548dcfe4a73a3240bbf625fb9010a4925c8 [1]
   The proposed release tarball and signatures are hosted at [2].
   The changelog is located at [3].
   
   Please download, verify checksums and signatures, run the unit tests, and 
vote
   on the release. The vote will be open for at least 72 hours.
   
   Only votes from PMC members are binding, but all members of the community are
   encouraged to test the release and vote with "(non-binding)".
   
   The standard verification procedure is documented at 
https://github.com/apache/datafusion-sqlparser-rs/blob/main/dev/release/README.md#verifying-release-candidates.
   
   [ ] +1 Release this as Apache DataFusion sqlparser-rs 0.55.0
   [ ] +0
   [ ] -1 Do not release this as Apache DataFusion sqlparser-rs 0.55.0 
because...
   
   Here is my vote:
   
   +1
   
   [1]: 
https://github.com/apache/datafusion-sqlparser-rs/tree/ed416548dcfe4a73a3240bbf625fb9010a4925c8
   [2]: 
https://dist.apache.org/repos/dist/dev/datafusion/apache-datafusion-sqlparser-rs-0.55.0-rc1
   [3]: 
https://github.com/apache/datafusion-sqlparser-rs/blob/ed416548dcfe4a73a3240bbf625fb9010a4925c8/CHANGELOG.md
   -
   —
   Reply to this email directly, view it on GitHub, or unsubscribe.
   You are receiving this because you are subscribed to this thread.Message ID: 
***@***.***>
   
   --=-qaapr9mzYVx2tjS93AZl
   Content-Type: text/html; charset="UTF-8"
   Content-Transfer-Encoding: base64
   
   PGh0bWw+PGJvZHk+PHNwYW4gc3R5bGU9ImRpc3BsYXk6YmxvY2s7IiBjbGFzcz0ieGZtXzk5NjI3
   NTIxIj48ZGl2PjxzcGFuIHN0eWxlPSJmb250LXNpemU6MTJwdDtsaW5lLWhlaWdodDoxNHB0O2Zv
   bnQtZmFtaWx5OkFyaWFsOyIgY2xhc3M9InhmbWMxIj4rMTwvc3Bhbj48YnIvPjwvZGl2Pg0KPGRp
   dj48c3BhbiBzdHlsZT0iZm9udC1zaXplOjEycHQ7bGluZS1oZWlnaHQ6MTRwdDtmb250LWZhbWls
   eTpBcmlhbDsiIGNsYXNzPSJ4Zm1jMSI+PGJyIGRhdGEtbWNlLWJvZ3VzPSIxIi8+PC9zcGFuPjwv
   ZGl2Pg0KPGRpdj48c3BhbiBzdHlsZT0iZm9udC1zaXplOjEycHQ7bGluZS1oZWlnaHQ6MTRwdDtm
   b250LWZhbWlseTpBcmlhbDsiIGNsYXNzPSJ4Zm1jMSI+VmVyaWZpZWQgb24gTTMgTWFjPC9zcGFu
   PjwvZGl2Pg0KPGRpdj48c3BhbiBzdHlsZT0iZm9udC1zaXplOjEycHQ7bGluZS1oZWlnaHQ6MTRw
   dDtmb250LWZhbWlseTpBcmlhbDsiIGNsYXNzPSJ4Zm1jMSI+PGJyIGRhdGEtbWNlLWJvZ3VzPSIx
   Ii8+PC9zcGFuPjwvZGl2Pg0KPGRpdj48c3BhbiBzdHlsZT0iZm9udC1zaXplOjEycHQ7bGluZS1o
   ZWlnaHQ6MTRwdDtmb250LWZhbWlseTpBcmlhbDsiIGNsYXNzPSJ4Zm1jMSI+VGhhbmtzIEFuZHJl
   dzwvc3Bhbj48L2Rpdj4NCjxkaXY+PGJyLz48L2Rpdj4NCjxkaXY+PGk+PHNwYW4gc3R5bGU9ImZv
   bnQtc2l6ZToxMHB0O2xpbmUtaGVpZ2h0OjEycHQ7Ij48c3BhbiBzdHlsZT0iZm9udC1mYW1pbHk6
   QXJpYWw7Ij4yNyBGZWJydWFyeSAyMDI1LCAwOTozNzo1MSwgRnJvbSAiQW5kcmV3IExhbWIiICZs
   dDs8L3NwYW4+PGEgaHJlZj0ibWFpbHRvOm5vdGlmaWNhdGlvbnNAZ2l0aHViLmNvbSIgdGFyZ2V0
   PSJfYmxhbmsiPjxzcGFuIHN0eWxlPSJmb250LWZhbWlseTpBcmlhbDsiPm5vdGlmaWN

Re: [I] Expose to `AccumulatorArgs` whether all the groups are sorted [datafusion]

2025-03-04 Thread via GitHub


alamb commented on issue #14991:
URL: https://github.com/apache/datafusion/issues/14991#issuecomment-2698391616

   ```sql
   > CREATE TABLE test_table (
   col_i32 INT,
   col_u32 INT UNSIGNED
   ) as VALUES
   ( NULL,NULL),
   ( -2147483648, 0),
   ( -2147483648, 0),
   ( 100, 100),
   ( 2147483647,  4294967295),
   ( NULL,NULL),
   ( -2147483648, 0),
   ( -2147483648, 0),
   ( 100, 100),
   ( 2147483646,  4294967294),
   ( 2147483647,  4294967295 )
   
   ;
   0 row(s) fetched.
   Elapsed 0.010 seconds.
   
   > explain select col_i32, sum(col_u32) sum_col_u32 from (select * from 
test_table order by col_i32 limit 10) group by col_i32;
   
+---+---+
   | plan_type | plan   
   |
   
+---+---+
   | logical_plan  | Projection: test_table.col_i32, sum(test_table.col_u32) AS 
sum_col_u32|
   |   |   Aggregate: groupBy=[[test_table.col_i32]], 
aggr=[[sum(CAST(test_table.col_u32 AS UInt64))]] |
   |   | Sort: test_table.col_i32 ASC NULLS LAST, fetch=10  
   |
   |   |   TableScan: test_table projection=[col_i32, col_u32]  
   |
   | physical_plan | ProjectionExec: expr=[col_i32@0 as col_i32, 
sum(test_table.col_u32)@1 as sum_col_u32] |
   |   |   AggregateExec: mode=FinalPartitioned, gby=[col_i32@0 as 
col_i32], aggr=[sum(test_table.col_u32)], ordering_mode=Sorted  |
   |   | SortExec: expr=[col_i32@0 ASC NULLS LAST], 
preserve_partitioning=[true]   |
   |   |   CoalesceBatchesExec: target_batch_size=8192  
   |
   |   | RepartitionExec: partitioning=Hash([col_i32@0], 
16), input_partitions=16  |
   |   |   RepartitionExec: 
partitioning=RoundRobinBatch(16), input_partitions=1
   |
   |   | AggregateExec: mode=Partial, gby=[col_i32@0 as 
col_i32], aggr=[sum(test_table.col_u32)], ordering_mode=Sorted |
   |   |   SortExec: TopK(fetch=10), expr=[col_i32@0 
ASC NULLS LAST], preserve_partitioning=[false]|
   |   | DataSourceExec: partitions=1, 
partition_sizes=[1] |
   |   |
   |
   
+---+---+
   2 row(s) fetched.
   Elapsed 0.006 seconds.
   ```
   
   I guess I forgot that we did eventually unify the implementation 🎉 
   
   
   In that case then I think it would make sense to pass existing ordering 
information along to the aggregates 👍  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Table function supports non-literal args [datafusion]

2025-03-04 Thread via GitHub


Lordworms commented on issue #14958:
URL: https://github.com/apache/datafusion/issues/14958#issuecomment-2698453390

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Improved error for expand wildcard rule [datafusion]

2025-03-04 Thread via GitHub


Jiashu-Hu commented on issue #15004:
URL: https://github.com/apache/datafusion/issues/15004#issuecomment-2698791268

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Refactor CometScanRule to avoid duplication and improve fallback messages [datafusion-comet]

2025-03-04 Thread via GitHub


andygrove commented on PR #1474:
URL: 
https://github.com/apache/datafusion-comet/pull/1474#issuecomment-2698801229

   @mbutrovich, Could you review this since this affects the new experimental 
native scan support?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]

2025-03-04 Thread via GitHub


vbarua commented on code in PR #13511:
URL: https://github.com/apache/datafusion/pull/13511#discussion_r1979981659


##
datafusion/expr/src/expr.rs:
##
@@ -295,6 +295,8 @@ pub enum Expr {
 /// See also [`ExprFunctionExt`] to set these fields.
 ///
 /// [`ExprFunctionExt`]: crate::expr_fn::ExprFunctionExt
+///
+/// cf. `WITHIN GROUP` is converted to `ORDER BY` internally in 
`datafusion/sql/src/expr/function.rs`

Review Comment:
   minor/opinionated: I'm not sure if it's worth mentioning this at all here. 
`WITHIN GROUP` is effectively an `ORDER BY` specified differently. This only 
matters at the SQL layer, and you handle and explain it there already.



##
datafusion/sql/src/expr/function.rs:
##
@@ -349,15 +365,49 @@ impl SqlToRel<'_, S> {
 } else {
 // User defined aggregate functions (UDAF) have precedence in case 
it has the same name as a scalar built-in function
 if let Some(fm) = self.context_provider.get_aggregate_meta(&name) {
-let order_by = self.order_by_to_sort_expr(
-order_by,
-schema,
-planner_context,
-true,
-None,
-)?;
-let order_by = (!order_by.is_empty()).then_some(order_by);
-let args = self.function_args_to_expr(args, schema, 
planner_context)?;
+if fm.is_ordered_set_aggregate() && within_group.is_empty() {
+return plan_err!("WITHIN GROUP clause is required when 
calling ordered set aggregate function({})", fm.name());
+}
+
+if null_treatment.is_some() && 
!fm.supports_null_handling_clause() {
+return plan_err!(
+"[IGNORE | RESPECT] NULLS are not permitted for {}",
+fm.name()
+);
+}

Review Comment:
   Per my point about `[IGNORE | RESPECT] NULLS` being a property of _window 
functions_, I don't think we need this check here.



##
datafusion/sql/src/expr/function.rs:
##
@@ -177,8 +180,14 @@ impl FunctionArgs {
 }
 }
 
-if !within_group.is_empty() {
-return not_impl_err!("WITHIN GROUP is not supported yet: 
{within_group:?}");
+if !within_group.is_empty() && order_by.is_some() {
+return plan_err!("ORDER BY clause is only permitted in WITHIN 
GROUP clause when a WITHIN GROUP is used");
+}
+
+if within_group.len() > 1 {
+return not_impl_err!(
+"Multiple column ordering in WITHIN GROUP clause is not 
supported"

Review Comment:
   Minor wording suggestion
   ```
   Only a single ordering expression is permitted in a WITHIN GROUP clause
   ```
   which explicitly points users to what they should do, instead of telling 
them what they can't.
   



##
datafusion/functions-aggregate/src/approx_percentile_cont.rs:
##
@@ -51,29 +52,39 @@ create_func!(ApproxPercentileCont, 
approx_percentile_cont_udaf);
 
 /// Computes the approximate percentile continuous of a set of numbers
 pub fn approx_percentile_cont(
-expression: Expr,
+within_group: Sort,

Review Comment:
   minor/opinionated: I think `order_by` would be a clearer name for this, as 
the `WITHIN GROUP` is really just a wrapper around the `ORDER BY` clause.



##
datafusion/sql/src/expr/function.rs:
##
@@ -177,8 +180,14 @@ impl FunctionArgs {
 }
 }
 
-if !within_group.is_empty() {
-return not_impl_err!("WITHIN GROUP is not supported yet: 
{within_group:?}");
+if !within_group.is_empty() && order_by.is_some() {
+return plan_err!("ORDER BY clause is only permitted in WITHIN 
GROUP clause when a WITHIN GROUP is used");

Review Comment:
   minor: I just noticed that in the block above there is a check for duplicate 
order bys. I think it would be good to fold this into that check
   ```rust
   FunctionArgumentClause::OrderBy(oby) => {
   if order_by.is_some() { // can check for within group 
here
   return not_impl_err!("Calling {name}: Duplicated 
ORDER BY clause in function arguments");
   }
   order_by = Some(oby);
   }
   ```
   to consolidate the handling into one place. 
   



##
datafusion/functions-aggregate/src/approx_percentile_cont.rs:
##
@@ -131,6 +147,19 @@ impl ApproxPercentileCont {
 args: AccumulatorArgs,
 ) -> Result {
 let percentile = validate_input_percentile_expr(&args.exprs[1])?;
+
+let is_descending = args
+.ordering_req
+.first()
+.map(|sort_expr| sort_expr.options.descending)
+.unwrap_or(false);
+
+let percentile = if is_descending {
+1.0 - percentile
+} el

Re: [PR] Refactor test suite in EnforceDistribution, to use standard test config. [datafusion]

2025-03-04 Thread via GitHub


wiedld commented on code in PR #15010:
URL: https://github.com/apache/datafusion/pull/15010#discussion_r1980473165


##
datafusion/core/tests/physical_optimizer/enforce_distribution.rs:
##
@@ -371,46 +371,91 @@ macro_rules! plans_matches_expected {
 }
 }
 
+fn test_suite_default_config_options() -> ConfigOptions {
+let mut config = ConfigOptions::new();
+
+// By default, will not repartition / resort data if it is already sorted.
+config.optimizer.prefer_existing_sort = false;
+
+// By default, will attempt to convert Union to Interleave.
+config.optimizer.prefer_existing_union = false;
+
+// By default, will not repartition file scans.
+config.optimizer.repartition_file_scans = false;
+config.optimizer.repartition_file_min_size = 1024;
+
+// By default, set query execution concurrency to 10.
+config.execution.target_partitions = 10;
+
+// Use a small batch size, to trigger RoundRobin in tests
+config.execution.batch_size = 1;
+
+config
+}
+
+/// How the optimizers are run.
+#[derive(PartialEq, Clone)]
+enum DoFirst {
+/// Runs: (EnforceDistribution, EnforceDistribution, EnforceSorting)
+Distribution,
+/// Runs: (EnforceSorting, EnforceDistribution, EnforceDistribution)
+Sorting,
+}
+
+#[derive(Clone)]
+struct TestConfig {
+config: ConfigOptions,
+optimizers_to_run: DoFirst,
+}
+
+impl TestConfig {
+fn new(optimizers_to_run: DoFirst) -> Self {
+Self {
+config: test_suite_default_config_options(),
+optimizers_to_run,
+}
+}
+
+/// If preferred, will not repartition / resort data if it is already 
sorted.
+fn with_prefer_existing_sort(mut self) -> Self {
+self.config.optimizer.prefer_existing_sort = true;
+self
+}
+
+/// If preferred, will not attempt to convert Union to Interleave.
+fn with_prefer_existing_union(mut self) -> Self {
+self.config.optimizer.prefer_existing_union = true;
+self
+}
+
+/// If preferred, will repartition file scans.
+/// Accepts a minimum file size to repartition.
+fn with_prefer_repartition_file_scans(mut self, file_min_size: usize) -> 
Self {
+self.config.optimizer.repartition_file_scans = true;
+self.config.optimizer.repartition_file_min_size = file_min_size;
+self
+}
+
+/// Set the preferred target partitions for query execution concurrency.
+fn with_query_execution_partitions(mut self, target_partitions: usize) -> 
Self {
+self.config.execution.target_partitions = target_partitions;
+self
+}
+}
+
 /// Runs the repartition optimizer and asserts the plan against the expected
 /// Arguments
 /// * `EXPECTED_LINES` - Expected output plan
 /// * `PLAN` - Input plan
-/// * `FIRST_ENFORCE_DIST` -
-/// true: (EnforceDistribution, EnforceDistribution,  EnforceSorting)
-/// false: else runs (EnforceSorting, EnforceDistribution, 
EnforceDistribution)
-/// * `PREFER_EXISTING_SORT` (optional) - if true, will not repartition / 
resort data if it is already sorted
-/// * `TARGET_PARTITIONS` (optional) - number of partitions to repartition to
-/// * `REPARTITION_FILE_SCANS` (optional) - if true, will repartition file 
scans
-/// * `REPARTITION_FILE_MIN_SIZE` (optional) - minimum file size to repartition
-/// * `PREFER_EXISTING_UNION` (optional) - if true, will not attempt to 
convert Union to Interleave

Review Comment:
   These are now replaced by the required `DoFirst` (for one-of 2 possible 
optimizer run patterns),
   then all the optional settings.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]

2025-03-04 Thread via GitHub


vbarua commented on PR #13511:
URL: https://github.com/apache/datafusion/pull/13511#issuecomment-2699344130

   Pinging @Dandandan for commiter review as they filed the ticket this fix is 
for.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Incorrect URLs in Cargo.toml files [datafusion-ballista]

2025-03-04 Thread via GitHub


andygrove opened a new issue, #1193:
URL: https://github.com/apache/datafusion-ballista/issues/1193

   **Describe the bug**
   We still use Arrow URLs in some places:
   
   ```
   homepage = "https://github.com/apache/arrow-ballista";
   repository = "https://github.com/apache/arrow-ballista";
   ```
   
   **To Reproduce**
   
   **Expected behavior**
   
   **Additional context**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Support datatype cast for insert api same as insert into sql [datafusion]

2025-03-04 Thread via GitHub


zhuqi-lucas commented on issue #15015:
URL: https://github.com/apache/datafusion/issues/15015#issuecomment-2699799674

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Minor: cleanup unused code [datafusion]

2025-03-04 Thread via GitHub


qazxcdswe123 opened a new pull request, #15016:
URL: https://github.com/apache/datafusion/pull/15016

   ## Which issue does this PR close?
   
   
   
   - Closes nothing
   
   ## Rationale for this change
   
   
   
   cleanup unused deadcode
   
   ## What changes are included in this PR?
   cleanup unused deadcode
   
   
   
   ## Are these changes tested?
   yes
   
   
   ## Are there any user-facing changes?
   no
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] BUG: schema_force_view_type configuration not working for CREATE EXTERNAL TABLE [datafusion]

2025-03-04 Thread via GitHub


zhuqi-lucas commented on code in PR #14922:
URL: https://github.com/apache/datafusion/pull/14922#discussion_r1980680580


##
datafusion/sqllogictest/test_files/insert_to_external.slt:
##
@@ -456,13 +456,16 @@ explain insert into table_without_values select c1 from 
aggregate_test_100 order
 
 logical_plan
 01)Dml: op=[Insert Into] table=[table_without_values]
-02)--Projection: aggregate_test_100.c1 AS c1
+02)--Projection: CAST(aggregate_test_100.c1 AS Utf8View) AS c1

Review Comment:
   Created a follow-up ticket for it, we'd better to support cast also for 
insert into api way consistent with insert into sql.
   https://github.com/apache/datafusion/issues/15015



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Bug: Fix multi-lines printing issue for datafusion-cli [datafusion]

2025-03-04 Thread via GitHub


zhuqi-lucas commented on code in PR #14954:
URL: https://github.com/apache/datafusion/pull/14954#discussion_r1980745507


##
datafusion-cli/tests/cli_integration.rs:
##
@@ -51,6 +51,163 @@ fn init() {
 ["--command", "show datafusion.execution.batch_size", "--format", "json", 
"-q", "-b", "1"],
 "[{\"name\":\"datafusion.execution.batch_size\",\"value\":\"1\"}]\n"
 )]
+

Review Comment:
   Add rich testing cases in cli_integration, it can avoid regression for 
datafusion-cli changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]

2025-03-04 Thread via GitHub


Garamda commented on code in PR #13511:
URL: https://github.com/apache/datafusion/pull/13511#discussion_r1980826096


##
datafusion/sql/src/expr/function.rs:
##
@@ -349,15 +365,49 @@ impl SqlToRel<'_, S> {
 } else {
 // User defined aggregate functions (UDAF) have precedence in case 
it has the same name as a scalar built-in function
 if let Some(fm) = self.context_provider.get_aggregate_meta(&name) {
-let order_by = self.order_by_to_sort_expr(
-order_by,
-schema,
-planner_context,
-true,
-None,
-)?;
-let order_by = (!order_by.is_empty()).then_some(order_by);
-let args = self.function_args_to_expr(args, schema, 
planner_context)?;
+if fm.is_ordered_set_aggregate() && within_group.is_empty() {
+return plan_err!("WITHIN GROUP clause is required when 
calling ordered set aggregate function({})", fm.name());
+}
+
+if null_treatment.is_some() && 
!fm.supports_null_handling_clause() {
+return plan_err!(
+"[IGNORE | RESPECT] NULLS are not permitted for {}",
+fm.name()
+);
+}

Review Comment:
   > This was smelling odd, so I dug a bit deeper. I think you've inadvertantly 
stumbled into something even weirder than you anticipated
   > ...
   > The RESPECT NULLS | IGNORE NULLS options is only a property of certain 
window functions, hence we shouldn't need to track it for aggregate functions.
   >
   > I'm going to file a ticket for the above.
   > ...
   > Per my point about [IGNORE | RESPECT] NULLS being a property of window 
functions, I don't think we need this check here.
   
   I understood and agree with your guidance.
   I will track what is decided in the issue you filed, and will remove some 
codes out after determination.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]

2025-03-04 Thread via GitHub


Garamda commented on code in PR #13511:
URL: https://github.com/apache/datafusion/pull/13511#discussion_r1980817866


##
datafusion/sql/src/expr/function.rs:
##
@@ -349,15 +365,49 @@ impl SqlToRel<'_, S> {
 } else {
 // User defined aggregate functions (UDAF) have precedence in case 
it has the same name as a scalar built-in function
 if let Some(fm) = self.context_provider.get_aggregate_meta(&name) {
-let order_by = self.order_by_to_sort_expr(
-order_by,
-schema,
-planner_context,
-true,
-None,
-)?;
-let order_by = (!order_by.is_empty()).then_some(order_by);
-let args = self.function_args_to_expr(args, schema, 
planner_context)?;
+if fm.is_ordered_set_aggregate() && within_group.is_empty() {
+return plan_err!("WITHIN GROUP clause is required when 
calling ordered set aggregate function({})", fm.name());
+}
+
+if null_treatment.is_some() && 
!fm.supports_null_handling_clause() {
+return plan_err!(
+"[IGNORE | RESPECT] NULLS are not permitted for {}",
+fm.name()
+);
+}

Review Comment:
   > One big question we might need to answer before merging is if we need a 
migration strategy for this. Because we now require WITHIN GROUP for these 
functions, any users who have queries stored outside of DataFusion will 
experience breakages that they can't work around. If we want to provide a 
migration path, we may need to support having both forms of calling these 
functions, as in
   > 
   > SELECT approx_percentile_cont(column_name, 0.75, 100) FROM table_name;
   > SELECT approx_percentile_cont(0.75, 100) WITHIN GROUP (ORDER BY 
column_name) FROM table_name;
   > 
   > for at least 1 release so folks can migrate their queries.
   
   This is one of the biggest concerns when I started to work on this feature.
   If the community decides the migration strategy like that, then I will make 
both syntax supported.
   Also, I will file an issue to track the plan so that the current syntax can 
be excluded as scheduled. (if I am authorized to do so)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Table function supports non-literal args [datafusion]

2025-03-04 Thread via GitHub


Lordworms commented on issue #14958:
URL: https://github.com/apache/datafusion/issues/14958#issuecomment-2700016967

   I have done some basic research on how Postgres deal with table function 
with column, for example, something like this would work in Postgresql
   ```
   SELECT t.id, t.start_value, t.end_value, g.num
   FROM table_a t,
   LATERAL generate_series(t.start_value, t.end_value) AS g(num);
   ```
   So I think the first step is to implement support for LATERAL, I see some PR 
like https://github.com/apache/datafusion/pull/14595 to support basic LATERAL 
JOIN using decorrelation subqueries
   
   but I think we should also support some other LATERAL use cases like 
   ```
   > SELECT u.id, u.name, sub.age_plus_10
   FROM users u, LATERAL (SELECT u.age + 10 AS age_plus_10) sub;
   This feature is not implemented: Physical plan does not support logical 
expression OuterReferenceColumn(Int32, Column { relation: Some(Bare { table: 
"u" }), name: "age" })
   ```
   
   After we have the support for Laterral, we can push any table scan down to 
table function scan or pull up the table function scan up to support input for 
Tablescan.
   
   I'll probably start with supporting basic Lateral.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Table function supports non-literal args [datafusion]

2025-03-04 Thread via GitHub


alamb commented on issue #14958:
URL: https://github.com/apache/datafusion/issues/14958#issuecomment-2699172543

   I think this one is likely pretty tricky -- it might be worth working on a 
design / writeup at first of how it would work / what the plans would look like


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add array reading support to native_datafusion scan [datafusion-comet]

2025-03-04 Thread via GitHub


andygrove closed pull request #1324: feat: Add array reading support to 
native_datafusion scan
URL: https://github.com/apache/datafusion-comet/pull/1324


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Add dependency checks to verify-release-candidate script [datafusion]

2025-03-04 Thread via GitHub


waynexia opened a new pull request, #15009:
URL: https://github.com/apache/datafusion/pull/15009

   ## Which issue does this PR close?
   
   
   
   - Closes #.
   
   ## Rationale for this change
   
   
   
   Check all build dependencies are available before running the script.
   
   Background: I'll use a separate docker to verify the release candidate. And 
I always forget to install some of the build dependencies. Then I need to 
install it and restart from the very beginning. I list all the things I'm aware 
of (or need to be installed in a fresh ubuntu image) and check them before the 
build starts.
   
   ## What changes are included in this PR?
   
   A new function `check_dependencies` in 
`dev/release/verify-release-candidate.sh` that checks build dependencies in 
advance.
   
   
   
   ## Are these changes tested?
   
   
   
   Yes, I use this version to verify 46.0.0 RC2 release
   
   ## Are there any user-facing changes?
   
   
   
   no 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Expr simplifier doesn't simplify exprs that are same if you swap lhs with rhs regardless of cycles [datafusion]

2025-03-04 Thread via GitHub


alamb commented on issue #14943:
URL: https://github.com/apache/datafusion/issues/14943#issuecomment-2698393138

   I think it will not be present in DataFusion 46 (we have the RC out for 
voting now)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Simplify Between expression to Eq [datafusion]

2025-03-04 Thread via GitHub


jayzhan211 merged PR #14994:
URL: https://github.com/apache/datafusion/pull/14994


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Simplify Between expression to Eq [datafusion]

2025-03-04 Thread via GitHub


jayzhan211 commented on PR #14994:
URL: https://github.com/apache/datafusion/pull/14994#issuecomment-2699326547

   > I don't think it closes https://github.com/apache/datafusion/issues/14943
   
   I agree it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]

2025-03-04 Thread via GitHub


vbarua commented on code in PR #13511:
URL: https://github.com/apache/datafusion/pull/13511#discussion_r1980465883


##
datafusion/expr/src/udaf.rs:
##
@@ -845,6 +861,19 @@ pub trait AggregateUDFImpl: Debug + Send + Sync {
 ScalarValue::try_from(data_type)
 }
 
+/// If this function supports `[IGNORE NULLS | RESPECT NULLS]` clause, 
return true
+/// If the function does not, return false
+/// Otherwise return None (the default)
+fn supports_null_handling_clause(&self) -> Option {
+None
+}

Review Comment:
   Filed https://github.com/apache/datafusion/issues/15006



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Refactor test suite in EnforceDistribution, to use standard test config. [datafusion]

2025-03-04 Thread via GitHub


wiedld opened a new pull request, #15010:
URL: https://github.com/apache/datafusion/pull/15010

   ## Which issue does this PR close?
   
   Part of #15003 
   
   ## Rationale for this change
   
   Make it easier to determine the differences in test configuration, for each 
test case.
   Also clearly defines what are the defaults used for the test suite.
   
   ## What changes are included in this PR?
   
   Define `TestConfig`, which requires one-of `DoFirst` optimizer run configs. 
Most test cases uses this test config with the default settings.
   
   Provide a set of interfaces to define which configurations get tweaked.
   
   ## Are these changes tested?
   
   Yes
   
   ## Are there any user-facing changes?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Refactor test suite in EnforceDistribution, to use standard test config. [datafusion]

2025-03-04 Thread via GitHub


wiedld commented on code in PR #15010:
URL: https://github.com/apache/datafusion/pull/15010#discussion_r1980468330


##
datafusion/core/tests/physical_optimizer/enforce_distribution.rs:
##
@@ -371,46 +371,95 @@ macro_rules! plans_matches_expected {
 }
 }
 
+fn test_suite_default_config_options() -> ConfigOptions {
+let mut config = ConfigOptions::new();
+
+// By default, will not repartition / resort data if it is already sorted.
+config.optimizer.prefer_existing_sort = false;
+
+// By default, will attempt to convert Union to Interleave.
+config.optimizer.prefer_existing_union = false;
+
+// By default, will not repartition file scans.
+config.optimizer.repartition_file_scans = false;
+config.optimizer.repartition_file_min_size = 1024;
+
+// By default, set query execution concurrency to 10.
+config.execution.target_partitions = 10;

Review Comment:
   These match the defaults on main: 
https://github.com/apache/datafusion/blob/c61e7e597f80a3abd5015c996dc896a084112ab6/datafusion/core/tests/physical_optimizer/enforce_distribution.rs#L388



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Refactor test suite in EnforceDistribution, to use standard test config. [datafusion]

2025-03-04 Thread via GitHub


wiedld commented on code in PR #15010:
URL: https://github.com/apache/datafusion/pull/15010#discussion_r1980470202


##
datafusion/core/tests/physical_optimizer/enforce_distribution.rs:
##
@@ -371,46 +371,95 @@ macro_rules! plans_matches_expected {
 }
 }
 
+fn test_suite_default_config_options() -> ConfigOptions {
+let mut config = ConfigOptions::new();
+
+// By default, will not repartition / resort data if it is already sorted.
+config.optimizer.prefer_existing_sort = false;
+
+// By default, will attempt to convert Union to Interleave.
+config.optimizer.prefer_existing_union = false;
+
+// By default, will not repartition file scans.
+config.optimizer.repartition_file_scans = false;
+config.optimizer.repartition_file_min_size = 1024;
+
+// By default, set query execution concurrency to 10.
+config.execution.target_partitions = 10;
+
+// Use a small batch size, to trigger RoundRobin in tests
+config.execution.batch_size = 1;
+
+config
+}
+
+/// How the optimizers are run.
+#[derive(PartialEq, Clone)]
+enum DoFirst {
+/// Runs: (EnforceDistribution, EnforceDistribution, EnforceSorting)
+Distribution,
+/// Runs: (EnforceSorting, EnforceDistribution, EnforceDistribution)
+Sorting,
+}
+
+#[derive(Clone)]
+struct TestConfig {
+config: ConfigOptions,
+optimizers_to_run: DoFirst,
+}
+
+impl TestConfig {
+fn new(optimizers_to_run: DoFirst) -> Self {
+Self {
+config: test_suite_default_config_options(),
+optimizers_to_run,
+}
+}
+
+/// If preferred, will not repartition / resort data if it is already 
sorted.
+fn with_prefer_existing_sort(mut self) -> Self {
+self.config.optimizer.prefer_existing_sort = true;
+self
+}
+
+/// If preferred, will not attempt to convert Union to Interleave.
+fn with_prefer_existing_union(mut self) -> Self {
+self.config.optimizer.prefer_existing_union = true;
+self
+}
+
+/// If preferred, will repartition file scans.
+/// Accepts a minimum file size to repartition.
+fn with_prefer_repartition_file_scans(mut self, file_min_size: usize) -> 
Self {
+self.config.optimizer.repartition_file_scans = true;
+self.config.optimizer.repartition_file_min_size = file_min_size;
+self
+}
+
+/// Set the preferred target partitions for query execution concurrency.
+fn with_query_execution_partitions(mut self, target_partitions: usize) -> 
Self {
+self.config.execution.target_partitions = target_partitions;
+self
+}
+}
+
 /// Runs the repartition optimizer and asserts the plan against the expected
 /// Arguments
 /// * `EXPECTED_LINES` - Expected output plan
 /// * `PLAN` - Input plan
-/// * `FIRST_ENFORCE_DIST` -
-/// true: (EnforceDistribution, EnforceDistribution,  EnforceSorting)
-/// false: else runs (EnforceSorting, EnforceDistribution, 
EnforceDistribution)
-/// * `PREFER_EXISTING_SORT` (optional) - if true, will not repartition / 
resort data if it is already sorted
-/// * `TARGET_PARTITIONS` (optional) - number of partitions to repartition to
-/// * `REPARTITION_FILE_SCANS` (optional) - if true, will repartition file 
scans
-/// * `REPARTITION_FILE_MIN_SIZE` (optional) - minimum file size to repartition
-/// * `PREFER_EXISTING_UNION` (optional) - if true, will not attempt to 
convert Union to Interleave
+/// * `CONFIG` - [`TestConfig`]
 macro_rules! assert_optimized {
-($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr) => {
-assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, false, 
10, false, 1024, false);
-};
-
-($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr, 
$PREFER_EXISTING_SORT: expr) => {
-assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, 
$PREFER_EXISTING_SORT, 10, false, 1024, false);
-};
-
-($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr, 
$PREFER_EXISTING_SORT: expr, $PREFER_EXISTING_UNION: expr) => {
-assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, 
$PREFER_EXISTING_SORT, 10, false, 1024, $PREFER_EXISTING_UNION);
-};
-
-($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr, 
$PREFER_EXISTING_SORT: expr, $TARGET_PARTITIONS: expr, $REPARTITION_FILE_SCANS: 
expr, $REPARTITION_FILE_MIN_SIZE: expr) => {
-assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, 
$PREFER_EXISTING_SORT, $TARGET_PARTITIONS, $REPARTITION_FILE_SCANS, 
$REPARTITION_FILE_MIN_SIZE, false);
+($EXPECTED_LINES: expr, $PLAN: expr) => {
+assert_optimized!($EXPECTED_LINES, $PLAN, 
&TestConfig::new(DoFirst::Distribution));

Review Comment:
   Note: this option is technically not used, so I could remove it. It's 
basically a 1-to-1 replacement of the "default options" provided for this macro 
on main.



-- 
This is an automated message from the Apache Git Service.
To respond to 

Re: [PR] Refactor test suite in EnforceDistribution, to use standard test config. [datafusion]

2025-03-04 Thread via GitHub


wiedld commented on code in PR #15010:
URL: https://github.com/apache/datafusion/pull/15010#discussion_r1980470202


##
datafusion/core/tests/physical_optimizer/enforce_distribution.rs:
##
@@ -371,46 +371,95 @@ macro_rules! plans_matches_expected {
 }
 }
 
+fn test_suite_default_config_options() -> ConfigOptions {
+let mut config = ConfigOptions::new();
+
+// By default, will not repartition / resort data if it is already sorted.
+config.optimizer.prefer_existing_sort = false;
+
+// By default, will attempt to convert Union to Interleave.
+config.optimizer.prefer_existing_union = false;
+
+// By default, will not repartition file scans.
+config.optimizer.repartition_file_scans = false;
+config.optimizer.repartition_file_min_size = 1024;
+
+// By default, set query execution concurrency to 10.
+config.execution.target_partitions = 10;
+
+// Use a small batch size, to trigger RoundRobin in tests
+config.execution.batch_size = 1;
+
+config
+}
+
+/// How the optimizers are run.
+#[derive(PartialEq, Clone)]
+enum DoFirst {
+/// Runs: (EnforceDistribution, EnforceDistribution, EnforceSorting)
+Distribution,
+/// Runs: (EnforceSorting, EnforceDistribution, EnforceDistribution)
+Sorting,
+}
+
+#[derive(Clone)]
+struct TestConfig {
+config: ConfigOptions,
+optimizers_to_run: DoFirst,
+}
+
+impl TestConfig {
+fn new(optimizers_to_run: DoFirst) -> Self {
+Self {
+config: test_suite_default_config_options(),
+optimizers_to_run,
+}
+}
+
+/// If preferred, will not repartition / resort data if it is already 
sorted.
+fn with_prefer_existing_sort(mut self) -> Self {
+self.config.optimizer.prefer_existing_sort = true;
+self
+}
+
+/// If preferred, will not attempt to convert Union to Interleave.
+fn with_prefer_existing_union(mut self) -> Self {
+self.config.optimizer.prefer_existing_union = true;
+self
+}
+
+/// If preferred, will repartition file scans.
+/// Accepts a minimum file size to repartition.
+fn with_prefer_repartition_file_scans(mut self, file_min_size: usize) -> 
Self {
+self.config.optimizer.repartition_file_scans = true;
+self.config.optimizer.repartition_file_min_size = file_min_size;
+self
+}
+
+/// Set the preferred target partitions for query execution concurrency.
+fn with_query_execution_partitions(mut self, target_partitions: usize) -> 
Self {
+self.config.execution.target_partitions = target_partitions;
+self
+}
+}
+
 /// Runs the repartition optimizer and asserts the plan against the expected
 /// Arguments
 /// * `EXPECTED_LINES` - Expected output plan
 /// * `PLAN` - Input plan
-/// * `FIRST_ENFORCE_DIST` -
-/// true: (EnforceDistribution, EnforceDistribution,  EnforceSorting)
-/// false: else runs (EnforceSorting, EnforceDistribution, 
EnforceDistribution)
-/// * `PREFER_EXISTING_SORT` (optional) - if true, will not repartition / 
resort data if it is already sorted
-/// * `TARGET_PARTITIONS` (optional) - number of partitions to repartition to
-/// * `REPARTITION_FILE_SCANS` (optional) - if true, will repartition file 
scans
-/// * `REPARTITION_FILE_MIN_SIZE` (optional) - minimum file size to repartition
-/// * `PREFER_EXISTING_UNION` (optional) - if true, will not attempt to 
convert Union to Interleave
+/// * `CONFIG` - [`TestConfig`]
 macro_rules! assert_optimized {
-($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr) => {
-assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, false, 
10, false, 1024, false);
-};
-
-($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr, 
$PREFER_EXISTING_SORT: expr) => {
-assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, 
$PREFER_EXISTING_SORT, 10, false, 1024, false);
-};
-
-($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr, 
$PREFER_EXISTING_SORT: expr, $PREFER_EXISTING_UNION: expr) => {
-assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, 
$PREFER_EXISTING_SORT, 10, false, 1024, $PREFER_EXISTING_UNION);
-};
-
-($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr, 
$PREFER_EXISTING_SORT: expr, $TARGET_PARTITIONS: expr, $REPARTITION_FILE_SCANS: 
expr, $REPARTITION_FILE_MIN_SIZE: expr) => {
-assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, 
$PREFER_EXISTING_SORT, $TARGET_PARTITIONS, $REPARTITION_FILE_SCANS, 
$REPARTITION_FILE_MIN_SIZE, false);
+($EXPECTED_LINES: expr, $PLAN: expr) => {
+assert_optimized!($EXPECTED_LINES, $PLAN, 
&TestConfig::new(DoFirst::Distribution));

Review Comment:
   Note: this option is technically not used, so I could remove it. It's 
basically a 1-to-1 replacement of the "default options" provided for this macro 
on main.



-- 
This is an automated message from the Apache Git Service.
To respond to 

[I] Support datatype cast for insert api same as insert into sql [datafusion]

2025-03-04 Thread via GitHub


zhuqi-lucas opened a new issue, #15015:
URL: https://github.com/apache/datafusion/issues/15015

   ### Is your feature request related to a problem or challenge?
   
   Now  insert into sql we will automatically case the datatype to consistent 
with the table datatype, we'd better to support  datatype cast for insert api 
also. So we will make the insert into behaviour consistent.
   
   
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Project Ideas for GSoC 2025 (Google Summer of Code) [datafusion]

2025-03-04 Thread via GitHub


oznur-synnada commented on issue #14478:
URL: https://github.com/apache/datafusion/issues/14478#issuecomment-2700032123

   Hi @mkarbo and @waynexia - we have been approved as a mentoring organization 
for GSoC 2025. I'm going to invite you to the GSoC portal as mentors so could 
you share your email addresses with me? You can reach me at 
oznur.ha...@synnada.ai 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] BUG: schema_force_view_type configuration not working for CREATE EXTERNAL TABLE [datafusion]

2025-03-04 Thread via GitHub


2010YOUY01 commented on PR #14922:
URL: https://github.com/apache/datafusion/pull/14922#issuecomment-2700038867

   > Thank you @2010YOUY01 for review, addressed comments in latsest PR.
   
   LGTM, thank you.
   I don't know this code well so let's wait for others to approve it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]

2025-03-04 Thread via GitHub


Garamda commented on code in PR #13511:
URL: https://github.com/apache/datafusion/pull/13511#discussion_r1980817133


##
datafusion/sql/src/expr/function.rs:
##
@@ -349,15 +365,49 @@ impl SqlToRel<'_, S> {
 } else {
 // User defined aggregate functions (UDAF) have precedence in case 
it has the same name as a scalar built-in function
 if let Some(fm) = self.context_provider.get_aggregate_meta(&name) {
-let order_by = self.order_by_to_sort_expr(
-order_by,
-schema,
-planner_context,
-true,
-None,
-)?;
-let order_by = (!order_by.is_empty()).then_some(order_by);
-let args = self.function_args_to_expr(args, schema, 
planner_context)?;
+if fm.is_ordered_set_aggregate() && within_group.is_empty() {
+return plan_err!("WITHIN GROUP clause is required when 
calling ordered set aggregate function({})", fm.name());
+}
+
+if null_treatment.is_some() && 
!fm.supports_null_handling_clause() {
+return plan_err!(
+"[IGNORE | RESPECT] NULLS are not permitted for {}",
+fm.name()
+);
+}

Review Comment:
   > Thanks for bearing with me as a reviewed this, it was a good chance for me 
to look at new parts of DataFusion.
   
   I appreciate your elaborate review again. 👍
   This PR has become much simpler, clearer, and better now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]

2025-03-04 Thread via GitHub


Garamda commented on code in PR #13511:
URL: https://github.com/apache/datafusion/pull/13511#discussion_r1980829367


##
datafusion/sql/src/expr/function.rs:
##
@@ -349,15 +365,49 @@ impl SqlToRel<'_, S> {
 } else {
 // User defined aggregate functions (UDAF) have precedence in case 
it has the same name as a scalar built-in function
 if let Some(fm) = self.context_provider.get_aggregate_meta(&name) {
-let order_by = self.order_by_to_sort_expr(
-order_by,
-schema,
-planner_context,
-true,
-None,
-)?;
-let order_by = (!order_by.is_empty()).then_some(order_by);
-let args = self.function_args_to_expr(args, schema, 
planner_context)?;
+if fm.is_ordered_set_aggregate() && within_group.is_empty() {
+return plan_err!("WITHIN GROUP clause is required when 
calling ordered set aggregate function({})", fm.name());
+}
+
+if null_treatment.is_some() && 
!fm.supports_null_handling_clause() {
+return plan_err!(
+"[IGNORE | RESPECT] NULLS are not permitted for {}",
+fm.name()
+);
+}

Review Comment:
   cf) I have applied all reviews that you tagged 'minor', since I was also 
convinced. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: instrument spawned tasks with current tracing span when `tracing` feature is enabled [datafusion]

2025-03-04 Thread via GitHub


geoffreyclaude commented on PR #14547:
URL: https://github.com/apache/datafusion/pull/14547#issuecomment-2697021679

   @alamb: I've gone ahead and refactored to allow "injecting" the tracing 
behavior at runtime. As predicted, the code is a bit scary looking, especially 
due to the Box/Unbox dance needed to get static Sized closures.
   
   But it works (see `datafusion-examples/examples/tracing.rs`) without needing 
the `tracing` crate (optionally or not) inside `datafusion-common-runtime`!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix doc logo [datafusion]

2025-03-04 Thread via GitHub


khushishukla2813 commented on code in PR #14989:
URL: https://github.com/apache/datafusion/pull/14989#discussion_r1979134788


##
datafusion/macros/src/macros.rs:
##
@@ -0,0 +1,10 @@
+#[macro_export]

Review Comment:
   > Maybe add a `proc-macros` crate, move the proc macros there and re-export 
them from the current macros crate?
   
   well ...I removed proc-macro = true so macro_rules! works from a normal 
library crate. Let me know if needs any other changes!"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Bug: calling "with_new_exprs" on join after optimization unexpectedly fails [datafusion]

2025-03-04 Thread via GitHub


niebayes opened a new issue, #14999:
URL: https://github.com/apache/datafusion/issues/14999

   ### Describe the bug
   
   Before optimization, specifically the `ExtractEquijoinPredicate` rule, 
calling `with_new_exprs` succeeds.
   However, after optimized by `ExtractEquijoinPredicate`, calling 
`with_new_exprs` unexpectedly fails with the following error:
   ```
   "The front part expressions should be an binary equality 
expression, actual:t1.a"
   ```
   
   ### To Reproduce
   
   ``` rust
#[tokio::test]
   async fn test_join_with_new_exprs() -> Result<()> {
   // Creates a default session context.
   let session_context = SessionContext::new();
   
   // Registers two tables.
   session_context.register_table(
   "t1",
   Arc::new(EmptyTable::new(
   Schema::new(vec![Field::new("a", DataType::Int32, 
true)]).into(),
   )),
   )?;
   session_context.register_table(
   "t2",
   Arc::new(EmptyTable::new(
   Schema::new(vec![Field::new("a", DataType::Int32, 
true)]).into(),
   )),
   )?;
   
   // Creates a plan consisting of a join operator and extracts the 
join operator.
   let join = session_context
   .sql("select * from t1 join t2 on t1.a = t2.a")
   .await?
   .logical_plan()
   .inputs()
   .remove(0)
   .clone();
   assert!(matches!(join, LogicalPlan::Join(_)));
   
   let LogicalPlan::Join(before) = &join else {
   unreachable!()
   };
   // Should be: "[]".
   println!("on before opt: {:?}", before.on);
   // Should be: "Some(BinaryExpr(BinaryExpr { left: Column(Column { 
relation: Some(Bare { table: "t1" }), name: "a" }), op: Eq,
   //  right: Column(Column { 
relation: Some(Bare { table: "t2" }), name: "a" }) }))".
   println!("filter before opt: {:?}", before.filter);
   // Invokes `with_new_exprs` before optimization should succeed.
   let res = join.with_new_exprs(
   join.expressions(),
   join.inputs().into_iter().map(|x| x.clone()).collect(),
   );
   assert!(res.is_ok());
   
   print!("\n\n");
   
   // Optimizes join.
   let join = session_context.state().optimize(&join)?;
   let LogicalPlan::Join(after) = &join else {
   unreachable!()
   };
   // Should be: "[(Column(Column { relation: Some(Bare { table: "t1" 
}), name: "a" }), Column(Column { relation: Some(Bare { table: "t2" }), name: 
"a" }))]".
   println!("on after opt: {:?}", after.on);
   // Should be: "None".
   println!("filter after opt: {:?}", after.filter);
   // Invokes `with_new_exprs` after optimization unexpectedly fails.
   let res = join.with_new_exprs(
   join.expressions(),
   join.inputs().into_iter().map(|x| x.clone()).collect(),
   );
   assert!(res.is_err());
   assert_contains!(
   res.unwrap_err().to_string(),
   "The front part expressions should be an binary equality 
expression, actual:t1.a"
   );
   
   Ok(())
   }
   ```
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Bug: Fix multi-lines printing issue for datafusion-cli [datafusion]

2025-03-04 Thread via GitHub


zhuqi-lucas commented on PR #14954:
URL: https://github.com/apache/datafusion/pull/14954#issuecomment-2696829174

   I am continuing polish the code besides the 
https://github.com/apache/datafusion/issues/14886 which will add the streaming 
state struct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Split out avro, parquet, json and csv into individual crates [datafusion]

2025-03-04 Thread via GitHub


AdamGS commented on code in PR #14951:
URL: https://github.com/apache/datafusion/pull/14951#discussion_r1979201859


##
datafusion/core/src/datasource/file_format/avro.rs:
##
@@ -15,163 +15,31 @@
 // specific language governing permissions and limitations
 // under the License.
 
-//! [`AvroFormat`] Apache Avro [`FileFormat`] abstractions
-
-use std::any::Any;
-use std::collections::HashMap;
-use std::fmt;
-use std::sync::Arc;
-
-use super::file_compression_type::FileCompressionType;
-use super::FileFormat;
-use super::FileFormatFactory;
-use crate::datasource::avro_to_arrow::read_avro_schema_from_reader;
-use crate::datasource::physical_plan::AvroSource;
-use crate::error::Result;
-use crate::physical_plan::ExecutionPlan;
-use crate::physical_plan::Statistics;
-
-use arrow::datatypes::Schema;
-use arrow::datatypes::SchemaRef;
-use async_trait::async_trait;
-use datafusion_catalog::Session;
-use datafusion_common::internal_err;
-use datafusion_common::parsers::CompressionTypeVariant;
-use datafusion_common::GetExt;
-use datafusion_common::DEFAULT_AVRO_EXTENSION;
-use datafusion_datasource::file::FileSource;
-use datafusion_datasource::file_scan_config::FileScanConfig;
-use datafusion_physical_expr::PhysicalExpr;
-use object_store::{GetResultPayload, ObjectMeta, ObjectStore};
-
-#[derive(Default)]
-/// Factory struct used to create [AvroFormat]
-pub struct AvroFormatFactory;
-
-impl AvroFormatFactory {
-/// Creates an instance of [AvroFormatFactory]
-pub fn new() -> Self {
-Self {}
-}
-}
-
-impl FileFormatFactory for AvroFormatFactory {
-fn create(
-&self,
-_state: &dyn Session,
-_format_options: &HashMap,
-) -> Result> {
-Ok(Arc::new(AvroFormat))
-}
+//! Re-exports the [`datafusion_datasource_avro::file_format`] module, and 
contains tests for it.
 
-fn default(&self) -> Arc {
-Arc::new(AvroFormat)
-}
-
-fn as_any(&self) -> &dyn Any {
-self
-}
-}
-
-impl fmt::Debug for AvroFormatFactory {
-fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
-f.debug_struct("AvroFormatFactory").finish()
-}
-}
-
-impl GetExt for AvroFormatFactory {
-fn get_ext(&self) -> String {
-// Removes the dot, i.e. ".parquet" -> "parquet"
-DEFAULT_AVRO_EXTENSION[1..].to_string()
-}
-}
-
-/// Avro `FileFormat` implementation.
-#[derive(Default, Debug)]
-pub struct AvroFormat;
-
-#[async_trait]
-impl FileFormat for AvroFormat {
-fn as_any(&self) -> &dyn Any {
-self
-}
-
-fn get_ext(&self) -> String {
-AvroFormatFactory::new().get_ext()
-}
-
-fn get_ext_with_compression(
-&self,
-file_compression_type: &FileCompressionType,
-) -> Result {
-let ext = self.get_ext();
-match file_compression_type.get_variant() {
-CompressionTypeVariant::UNCOMPRESSED => Ok(ext),
-_ => internal_err!("Avro FileFormat does not support 
compression."),
-}
-}
-
-async fn infer_schema(
-&self,
-_state: &dyn Session,
-store: &Arc,
-objects: &[ObjectMeta],
-) -> Result {
-let mut schemas = vec![];
-for object in objects {
-let r = store.as_ref().get(&object.location).await?;
-let schema = match r.payload {
-GetResultPayload::File(mut file, _) => {
-read_avro_schema_from_reader(&mut file)?
-}
-GetResultPayload::Stream(_) => {
-// TODO: Fetching entire file to get schema is potentially 
wasteful
-let data = r.bytes().await?;
-read_avro_schema_from_reader(&mut data.as_ref())?
-}
-};
-schemas.push(schema);
-}
-let merged_schema = Schema::try_merge(schemas)?;
-Ok(Arc::new(merged_schema))
-}
-
-async fn infer_stats(
-&self,
-_state: &dyn Session,
-_store: &Arc,
-table_schema: SchemaRef,
-_object: &ObjectMeta,
-) -> Result {
-Ok(Statistics::new_unknown(&table_schema))
-}
-
-async fn create_physical_plan(
-&self,
-_state: &dyn Session,
-conf: FileScanConfig,
-_filters: Option<&Arc>,
-) -> Result> {
-Ok(conf.with_source(self.file_source()).build())
-}
-
-fn file_source(&self) -> Arc {
-Arc::new(AvroSource::new())
-}
-}
+pub use datafusion_datasource_avro::file_format::*;
 
 #[cfg(test)]
-#[cfg(feature = "avro")]
 mod tests {

Review Comment:
   Oh I didn't realize that was a thing. Glad to move everything there (in this 
PR or a different one).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please c

Re: [I] Allow sorting to improve `FixedSizeBinary` filtering [datafusion]

2025-03-04 Thread via GitHub


samuelcolvin closed issue #11170: Allow sorting to improve `FixedSizeBinary` 
filtering
URL: https://github.com/apache/datafusion/issues/11170


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Allow sorting to improve `FixedSizeBinary` filtering [datafusion]

2025-03-04 Thread via GitHub


samuelcolvin commented on issue #11170:
URL: https://github.com/apache/datafusion/issues/11170#issuecomment-2696717653

   Closed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Split out avro, parquet, json and csv into individual crates [datafusion]

2025-03-04 Thread via GitHub


alamb commented on PR #14951:
URL: https://github.com/apache/datafusion/pull/14951#issuecomment-2697208596

   Thanks again @AdamGS  and @logan-keede  -- this is amazing progress. 
Something I have thought was important for the last t


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Split out avro, parquet, json and csv into individual crates [datafusion]

2025-03-04 Thread via GitHub


alamb merged PR #14951:
URL: https://github.com/apache/datafusion/pull/14951


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Split out avro, parquet, json and csv into individual crates [datafusion]

2025-03-04 Thread via GitHub


alamb commented on PR #14951:
URL: https://github.com/apache/datafusion/pull/14951#issuecomment-2697209311

   I am merging this one in so it doesn't accumulate conflicts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



  1   2   >