Re: [I] `SessionContext::create_physical_expr` does not unwrap casts (and thus is not always optimal) [datafusion]
jayzhan211 commented on issue #14987: URL: https://github.com/apache/datafusion/issues/14987#issuecomment-2696573839 What do you mean "simplify", is it `SimplifyExpressions` or other logic. If it is `SimplifyExpressions`, then you have logical plan optimization involved, if not and you go directly to `create_physical_expr` then I think this is the issue similar to Comet where they don't have coerced types for `create_physical_expr` and they need to done the coercion part themselves. Alternative idea is that we need to introduce another `create_physical_expr_from_non_resolved` that allows inputs that are "not resolved" (so we need to do coercion or type related casting things) and keep current `create_physical_expr` for inputs that are "resolved" and used for DataFusion or other projects that has DataFusion logical plan optimization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] _repr_ and _html_repr_ show '... and additional rows' message [datafusion-python]
kosiew commented on code in PR #1041: URL: https://github.com/apache/datafusion-python/pull/1041#discussion_r1978735197 ## src/dataframe.rs: ## @@ -90,59 +90,108 @@ impl PyDataFrame { } fn __repr__(&self, py: Python) -> PyDataFusionResult { -let df = self.df.as_ref().clone().limit(0, Some(10))?; +// Get 11 rows to check if there are more than 10 +let df = self.df.as_ref().clone().limit(0, Some(11))?; let batches = wait_for_future(py, df.collect())?; -let batches_as_string = pretty::pretty_format_batches(&batches); +let num_rows = batches.iter().map(|batch| batch.num_rows()).sum::(); + +// Flatten batches into a single batch for the first 10 rows +let mut all_rows = Vec::new(); +let mut total_rows = 0; + +for batch in &batches { +let num_rows_to_take = if total_rows + batch.num_rows() > 10 { +10 - total_rows +} else { +batch.num_rows() +}; + +if num_rows_to_take > 0 { +let sliced_batch = batch.slice(0, num_rows_to_take); +all_rows.push(sliced_batch); +total_rows += num_rows_to_take; +} + +if total_rows >= 10 { +break; +} +} + +let batches_as_string = pretty::pretty_format_batches(&all_rows); + Review Comment: You can simplify `batches_as_string` to: ```rust // First get just the first 10 rows let preview_df = self.df.as_ref().clone().limit(0, Some(10))?; let preview_batches = wait_for_future(py, preview_df.collect())?; // Check if there are more rows by trying to get the 11th row let has_more_rows = { let check_df = self.df.as_ref().clone().limit(10, Some(1))?; let check_batch = wait_for_future(py, check_df.collect())?; !check_batch.is_empty() }; let batches_as_string = pretty::pretty_format_batches(&preview_batches); ``` This directly retrieves just the first 10 rows, eliminating the need for manual row tracking and slicing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] _repr_ and _html_repr_ show '... and additional rows' message [datafusion-python]
kosiew commented on code in PR #1041: URL: https://github.com/apache/datafusion-python/pull/1041#discussion_r1978864670 ## src/dataframe.rs: ## @@ -90,59 +90,108 @@ impl PyDataFrame { } fn __repr__(&self, py: Python) -> PyDataFusionResult { -let df = self.df.as_ref().clone().limit(0, Some(10))?; +// Get 11 rows to check if there are more than 10 +let df = self.df.as_ref().clone().limit(0, Some(11))?; let batches = wait_for_future(py, df.collect())?; -let batches_as_string = pretty::pretty_format_batches(&batches); +let num_rows = batches.iter().map(|batch| batch.num_rows()).sum::(); + +// Flatten batches into a single batch for the first 10 rows +let mut all_rows = Vec::new(); +let mut total_rows = 0; + +for batch in &batches { +let num_rows_to_take = if total_rows + batch.num_rows() > 10 { +10 - total_rows +} else { +batch.num_rows() +}; + +if num_rows_to_take > 0 { +let sliced_batch = batch.slice(0, num_rows_to_take); +all_rows.push(sliced_batch); +total_rows += num_rows_to_take; +} + +if total_rows >= 10 { +break; +} +} + +let batches_as_string = pretty::pretty_format_batches(&all_rows); + match batches_as_string { -Ok(batch) => Ok(format!("DataFrame()\n{batch}")), +Ok(batch) => { +if num_rows > 10 { +Ok(format!("DataFrame()\n{batch}\n... and additional rows")) +} else { +Ok(format!("DataFrame()\n{batch}")) +} +} Err(err) => Ok(format!("Error: {:?}", err.to_string())), } } + + fn _repr_html_(&self, py: Python) -> PyDataFusionResult { let mut html_str = "\n".to_string(); - -let df = self.df.as_ref().clone().limit(0, Some(10))?; + +// Limit to the first 11 rows +let df = self.df.as_ref().clone().limit(0, Some(11))?; let batches = wait_for_future(py, df.collect())?; - + +// If there are no rows, close the table and return if batches.is_empty() { html_str.push_str("\n"); return Ok(html_str); } - + +// Get schema for headers let schema = batches[0].schema(); - + let mut header = Vec::new(); for field in schema.fields() { -header.push(format!("{}", field.name())); +header.push(format!("{}", field.name())); } let header_str = header.join(""); html_str.push_str(&format!("{}\n", header_str)); - -for batch in batches { + +// Flatten rows and format them as HTML +let mut total_rows = 0; +for batch in &batches { +total_rows += batch.num_rows(); let formatters = batch .columns() .iter() .map(|c| ArrayFormatter::try_new(c.as_ref(), &FormatOptions::default())) -.map(|c| { -c.map_err(|e| PyValueError::new_err(format!("Error: {:?}", e.to_string( -}) +.map(|c| c.map_err(|e| PyValueError::new_err(format!("Error: {:?}", e.to_string() .collect::, _>>()?; - -for row in 0..batch.num_rows() { + +let num_rows_to_render = if total_rows > 10 { 10 } else { batch.num_rows() }; + +for row in 0..num_rows_to_render { let mut cells = Vec::new(); for formatter in &formatters { cells.push(format!("{}", formatter.value(row))); } let row_str = cells.join(""); html_str.push_str(&format!("{}\n", row_str)); } -} +if total_rows >= 10 { +break; +} Review Comment: How about simplifying to: ```rust let rows_remaining = 10 - total_rows; let rows_in_batch = batch.num_rows().min(rows_remaining); for row in 0..rows_in_batch { html_str.push_str(""); for col in batch.columns() { let formatter = ArrayFormatter::try_new(col.as_ref(), &FormatOptions::default())?; html_str.push_str(""); html_str.push_str(&formatter.value(row).to_string()); html_str.push_str(""); } html_str.push_str("\n"); } total_rows += rows_in_batch; ``` Reasons: 1. More Accurate Row Limiting: Before: total_rows was updated before checking the row limit, which could resul
Re: [I] `SessionContext::create_physical_expr` does not unwrap casts (and thus is not always optimal) [datafusion]
ion-elgreco commented on issue #14987: URL: https://github.com/apache/datafusion/issues/14987#issuecomment-2696634942 @jayzhan211 its using the expression simplifier: ``` let logical_filter = self.filter.map(|expr| { // Simplify the expression first let props = ExecutionProps::new(); let simplify_context = SimplifyContext::new(&props).with_schema(df_schema.clone().into()); let simplifier = ExprSimplifier::new(simplify_context).with_max_cycles(10); let simplified = simplifier.simplify(expr).unwrap(); context .create_physical_expr(simplified, &df_schema) .unwrap() }); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] add method SessionStateBuilder::new_with_defaults() [datafusion]
alamb commented on code in PR #14998: URL: https://github.com/apache/datafusion/pull/14998#discussion_r1979267227 ## datafusion/core/src/execution/session_state.rs: ## @@ -1109,6 +1109,22 @@ impl SessionStateBuilder { .with_table_function_list(SessionStateDefaults::default_table_functions()) } +/// Returns a new [`SessionStateBuilder`] with default features. +/// +/// This is equivalent to calling [`Self::new()`] followed by [`Self::with_default_features()`]. +/// +/// ``` +/// use datafusion::execution::session_state::SessionStateBuilder; +/// +/// // Create a new SessionState with default features +/// let session_state = SessionStateBuilder::new_with_defaults() +/// .with_session_id("my_session".to_string()) +/// .build(); +/// ``` +pub fn new_with_defaults() -> Self { Review Comment: To keep the name consistent (and make it harder to confuse this with `default()`) could we please name this `default_features`? Like: ```suggestion pub fn new_with_default_features() -> Self { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Update `SessionStateBuilder::with_default_features` does not replace existing features [datafusion]
alamb merged PR #14935: URL: https://github.com/apache/datafusion/pull/14935 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] minor: `SessionStateBuilder::with_default_features` ergonomics [datafusion]
alamb closed issue #14899: minor: `SessionStateBuilder::with_default_features` ergonomics URL: https://github.com/apache/datafusion/issues/14899 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] add method SessionStateBuilder::new_with_defaults() [datafusion]
alamb commented on PR #14998: URL: https://github.com/apache/datafusion/pull/14998#issuecomment-2697248930 > Why don't we just implement Default trait? There is some more backstory here: - https://github.com/apache/datafusion/pull/14935#pullrequestreview-2654150748 Basically the question is what "defaults" means -- is default an empty builder or does it have all the default functions, etc. An alternate design might be to always create SessionStateBuilder with all default functions and have a `clear()` method or something to reset the state. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Update `SessionStateBuilder::with_default_features` does not replace existing features [datafusion]
alamb commented on PR #14935: URL: https://github.com/apache/datafusion/pull/14935#issuecomment-2697244168 Thanks again @irenjj and @milenkovicm It seems as if @shruti2522 already has a PR to implement the new_with_default_features - https://github.com/apache/datafusion/pull/14998 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix documentation warnings and error if anymore occur [datafusion]
AmosAidoo commented on PR #14952: URL: https://github.com/apache/datafusion/pull/14952#issuecomment-2697309878 @alamb my pleasure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Implement `tree` explain for `FilterExec` [datafusion]
alamb opened a new issue, #15000: URL: https://github.com/apache/datafusion/issues/15000 ### Is your feature request related to a problem or challenge? - Part of https://github.com/apache/datafusion/issues/14914 @irenjj added a new `tree` explain mode in https://github.com/apache/datafusion/pull/14677. Now we need to add support for different types of operators. ``` set datafusion.explain.format = 'tree'; create table foo(x int, y int) as values (1,2), (3,4); explain select * from foo where x = 4; +---++ | plan_type | plan | +---++ | logical_plan | Filter: foo.x = Int32(4) | | | TableScan: foo projection=[x, y] | | physical_plan | ┌───┐ | | | │CoalesceBatchesExec│ | | | └─┬─┘ | | | ┌─┴─┐ | | | │ FilterExec│ | | | └─┬─┘ | | | ┌─┴─┐ | | | │ DataSourceExec │ | | | │ │ | | | │partition_sizes: [1] │ | | | │ partitions: 1 │ | | | └───┘ | | || +---++ ``` ### Describe the solution you'd like Add `tree` format to the named ExecutionPlan Roughly speaking the process goes like: 1. Add the relevant code to the operator 2. Add / update explain_tree.slt test to show the new code in action You can run the tests like ```shell cargo test --test sqllogictests -- explain_tree ``` You can update the test like this: ```shell cargo test --test sqllogictests -- explain_tree --complete ``` ### Describe alternatives you've considered Here is an example PR: TODO ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Executor memory overhead overriding [datafusion-comet]
wForget commented on code in PR #1462: URL: https://github.com/apache/datafusion-comet/pull/1462#discussion_r1979412638 ## spark/src/test/scala/org/apache/spark/CometPluginsSuite.scala: ## @@ -143,3 +143,36 @@ class CometPluginsNonOverrideSuite extends CometTestBase { assert(execMemOverhead4 == "2G") } } + +class CometPluginsUnifiedModeOverrideSuite extends CometTestBase { + override protected def sparkConf: SparkConf = { +val conf = new SparkConf() +conf.set("spark.driver.memory", "1G") +conf.set("spark.executor.memory", "1G") +conf.set("spark.executor.memoryOverhead", "1G") +conf.set("spark.plugins", "org.apache.spark.CometPlugin") +conf.set("spark.comet.enabled", "true") +conf.set("spark.memory.offHeap.enabled", "true") +conf.set("spark.memory.offHeap.size", "2G") +conf.set("spark.comet.exec.shuffle.enabled", "true") +conf.set("spark.comet.exec.enabled", "true") +conf.set("spark.comet.memory.overhead.factor", "0.5") +conf + } + + /** Since using unified memory, executor memory should not be overridden */ Review Comment: > That one should be fine, they passed with current changes to overriding spark executor memory, but fail without them. Indeed, I successfully run this test locally, but this comment seems incorrect -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement `tree` explain for FilterExec [datafusion]
irenjj commented on code in PR #15001: URL: https://github.com/apache/datafusion/pull/15001#discussion_r1979377790 ## datafusion/sqllogictest/test_files/explain_tree.slt: ## @@ -185,6 +188,32 @@ physical_plan 26)│ DataSourceExec ││ DataSourceExec │ 27)└───┘└───┘ +# Long Filter (demonstrate what happens with wrapping) +query TT +explain SELECT int_col FROM table1 +WHERE string_col != 'foo' AND string_col != 'bar' AND string_col != 'a really long string constant' +; + +logical_plan +01)Projection: table1.int_col +02)--Filter: table1.string_col != Utf8("foo") AND table1.string_col != Utf8("bar") AND table1.string_col != Utf8("a really long string constant") +03)TableScan: table1 projection=[int_col, string_col], partial_filters=[table1.string_col != Utf8("foo"), table1.string_col != Utf8("bar"), table1.string_col != Utf8("a really long string constant")] +physical_plan +01)┌───┐ +02)│CoalesceBatchesExec│ +03)└─┬─┘ +04)┌─┴─┐ +05)│ FilterExec│ +06)│ │ +07)│ predicate:│ +08)│string_col@1 != foo AND ...│ Review Comment: This is a very nice case for `// TODO: check every line is less than MAX_LINE_RENDER_SIZE`. In DuckDB, we can get the following result: ``` D explain select * from t1 where c != 'foo' and c != 'bar' and c != 'a really long string constant'; ┌─┐ │┌───┐│ ││ Physical Plan ││ │└───┘│ └─┘ ┌───┐ │ SEQ_SCAN │ │ │ │ Table: t1 │ │ Type: Sequential Scan │ │ Projections: c │ │ │ │ Filters: │ │ c!='foo' AND c!='bar' AND │ │ c!='a really long string │ │ constant'│ │ │ │ ~1 Rows │ └───┘ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Expose to `AccumulatorArgs` whether all the groups are sorted [datafusion]
alamb commented on issue #14991: URL: https://github.com/apache/datafusion/issues/14991#issuecomment-2697939824 In the case where the data is already sorted on the group expressions, it should use a different grouping operator, specifically https://docs.rs/datafusion/latest/datafusion/physical_plan/aggregates/order/struct.GroupOrderingFull.html I don't think `GroupOrderingFull` even uses GroupsAccumulator (it just uses Accumulator I think) So in other words, I don't think passing the order to groups accumulator will help There is another version where the data is grouped on a prefix of the grouping keys: https://docs.rs/datafusion/latest/datafusion/physical_plan/aggregates/order/struct.GroupOrderingPartial.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: remove code duplication in native_datafusion and native_iceberg_compat implementations [datafusion-comet]
parthchandra commented on PR #1443: URL: https://github.com/apache/datafusion-comet/pull/1443#issuecomment-2698007418 > Would it be difficult to split this PR? The title makes it seem like a minor refactor to the ParquetExec instantiation, but it's actually bringing in a lot of the object store-related logic. Let me do that to make it easier for you (will take me a few days to get to it). @comphead I'll also address your comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add read array support [datafusion-comet]
andygrove commented on PR #1456: URL: https://github.com/apache/datafusion-comet/pull/1456#issuecomment-2698077082 Some tests are failing due to https://github.com/apache/datafusion-comet/issues/1289 I think the root cause is that we are trying to shuffle with arrays and Comet shuffle does not support arrays yet. We need to fall back to Spark for these shuffles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Unsupported Arrow Vector for export: class org.apache.arrow.vector.complex.ListVector [datafusion-comet]
andygrove commented on issue #1289: URL: https://github.com/apache/datafusion-comet/issues/1289#issuecomment-2698079511 I think the root cause is that we are trying to shuffle with complex types and Comet shuffle does not support complex types yet. We need to fall back to Spark for these shuffles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add read array support [datafusion-comet]
andygrove commented on PR #1456: URL: https://github.com/apache/datafusion-comet/pull/1456#issuecomment-2698094548 In `CometExecRule` we check to see if we support the partitioning types for the shuffle but do not check that we support the types of other columns. @comphead Do you want to update these checks as part of this PR and see if it resolves the issue? ```scala case class CometExecRule(session: SparkSession) extends Rule[SparkPlan] { private def applyCometShuffle(plan: SparkPlan): SparkPlan = { plan.transformUp { case s: ShuffleExchangeExec if isCometPlan(s.child) && isCometNativeShuffleMode(conf) && QueryPlanSerde.supportPartitioning(s.child.output, s.outputPartitioning)._1 => ... case s: ShuffleExchangeExec if (!s.child.supportsColumnar || isCometPlan(s.child)) && isCometJVMShuffleMode( conf) && QueryPlanSerde.supportPartitioningTypes(s.child.output, s.outputPartitioning)._1 && !isShuffleOperator(s.child) => ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add all missing table options to be handled in any order [datafusion-sqlparser-rs]
iffyio commented on PR #1747: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1747#issuecomment-2697471150 Marking the PR as draft in the meantime as it is not currently pending review, @benrsatori feel free to undraft and ping when ready! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] add manual trigger for extended tests in pull requests [datafusion]
alamb commented on code in PR #14331: URL: https://github.com/apache/datafusion/pull/14331#discussion_r1979614246 ## .github/workflows/extended.yml: ## @@ -33,16 +33,46 @@ on: push: branches: - main + issue_comment: +types: [created] + +permissions: + pull-requests: write jobs: + # Check issue comment and notify that extended tests are running + check_issue_comment: +name: Check issue comment +runs-on: ubuntu-latest +if: github.event.issue.pull_request && github.event.comment.body == 'run extended tests' +steps: + - uses: actions/github-script@v7 +with: + github-token: ${{secrets.GITHUB_TOKEN}} + script: | +github.rest.issues.createComment({ + issue_number: context.issue.number, + owner: context.repo.owner, + repo: context.repo.repo, + body: "Running extended tests..." +}) + # Check crate compiles and base cargo check passes linux-build-lib: name: linux build test runs-on: ubuntu-latest container: image: amd64/rust +if: | + github.event_name == 'push' || + (github.event_name == 'issue_comment' && github.event.issue.pull_request && github.event.comment.body == 'run extended tests') steps: - uses: actions/checkout@v4 +with: + # Check out the pull request branch if triggered by a comment + ref: ${{ github.event_name == 'issue_comment' && github.event.issue.pull_request.head.ref || github.ref }} Review Comment: Thanks @danila-b 🙏 🤔 - https://github.com/danila-b/datafusion/pull/3 doesn't seem to have run the extended tests https://github.com/danila-b/datafusion/pull/3#issuecomment-2697904032 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [EPIC] Complete `SQL EXPLAIN` Tree Rendering [datafusion]
alamb commented on issue #14914: URL: https://github.com/apache/datafusion/issues/14914#issuecomment-2697451179 I made a first PR to show how to add tree explain - https://github.com/apache/datafusion/pull/15001 Once we get that one in I plan to file a bunch of other tickets for the remaining `ExplainPlan` nodes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Snowflake: Support dollar quoted comment of table, view and field [datafusion-sqlparser-rs]
7phs opened a new pull request, #1755: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1755 Snowflake [supports](https://docs.snowflake.com/en/sql-reference/literals-table) dollar-quoted comments as string literals when creating tables, views, and their fields. For example: ``` CREATE OR REPLACE TEMPORARY VIEW foo.bar.baz ( "COL_1" COMMENT $$comment 1$$ ) COMMENT = $$view comment$$ AS ( SELECT 1 ) ``` This pull request adds support for this type of comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] add method SessionStateBuilder::new_with_defaults() [datafusion]
shruti2522 commented on code in PR #14998: URL: https://github.com/apache/datafusion/pull/14998#discussion_r1979327789 ## datafusion/core/src/execution/session_state.rs: ## @@ -1109,6 +1109,22 @@ impl SessionStateBuilder { .with_table_function_list(SessionStateDefaults::default_table_functions()) } +/// Returns a new [`SessionStateBuilder`] with default features. +/// +/// This is equivalent to calling [`Self::new()`] followed by [`Self::with_default_features()`]. +/// +/// ``` +/// use datafusion::execution::session_state::SessionStateBuilder; +/// +/// // Create a new SessionState with default features +/// let session_state = SessionStateBuilder::new_with_defaults() +/// .with_session_id("my_session".to_string()) +/// .build(); +/// ``` +pub fn new_with_defaults() -> Self { Review Comment: done @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Implement `tree` explain for FilterExec [datafusion]
alamb opened a new pull request, #15001: URL: https://github.com/apache/datafusion/pull/15001 ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/15000 ## Rationale for this change Let's have nice explain plans! I wanted an example of how to make a tree explain plan so that I can file a bunch of follow on tickets for https://github.com/apache/datafusion/issues/14914 to share the work ## What changes are included in this PR? 1. Update comments about `ExplainFormat` 2. Implement for `FilterExec` 3. Update tests ## Are these changes tested? Yes ## Are there any user-facing changes? tree explain is better -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Make `create_ordering` pub and add doc for it [datafusion]
xudong963 merged PR #14996: URL: https://github.com/apache/datafusion/pull/14996 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Executor memory overhead overriding [datafusion-comet]
LukMRVC commented on PR #1462: URL: https://github.com/apache/datafusion-comet/pull/1462#issuecomment-2697480797 That one should be fine, they passed with current changes to overriding spark executor memory, but fail without them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement `tree` explain for FilterExec [datafusion]
irenjj commented on code in PR #15001: URL: https://github.com/apache/datafusion/pull/15001#discussion_r1979381479 ## datafusion/sqllogictest/test_files/explain_tree.slt: ## @@ -185,6 +188,32 @@ physical_plan 26)│ DataSourceExec ││ DataSourceExec │ 27)└───┘└───┘ +# Long Filter (demonstrate what happens with wrapping) +query TT +explain SELECT int_col FROM table1 +WHERE string_col != 'foo' AND string_col != 'bar' AND string_col != 'a really long string constant' +; + +logical_plan +01)Projection: table1.int_col +02)--Filter: table1.string_col != Utf8("foo") AND table1.string_col != Utf8("bar") AND table1.string_col != Utf8("a really long string constant") +03)TableScan: table1 projection=[int_col, string_col], partial_filters=[table1.string_col != Utf8("foo"), table1.string_col != Utf8("bar"), table1.string_col != Utf8("a really long string constant")] +physical_plan +01)┌───┐ +02)│CoalesceBatchesExec│ +03)└─┬─┘ +04)┌─┴─┐ +05)│ FilterExec│ +06)│ │ +07)│ predicate:│ +08)│string_col@1 != foo AND ...│ Review Comment: I will try to make it the same as DuckDB later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Implement Nicer / DuckDB style explain plans [datafusion]
alamb closed issue #9371: Implement Nicer / DuckDB style explain plans URL: https://github.com/apache/datafusion/issues/9371 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add `tree` / pretty explain mode [datafusion]
alamb merged PR #14677: URL: https://github.com/apache/datafusion/pull/14677 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] `SessionContext::create_physical_expr` does not unwrap casts (and thus is not always optimal) [datafusion]
jayzhan211 commented on issue #14987: URL: https://github.com/apache/datafusion/issues/14987#issuecomment-2697401042 I see. We can try moving `UnwrapCastInComparison` into `Simplifier` large pattern matching rule -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Release DataFusion `46.0.0` [datafusion]
alamb commented on issue #14123: URL: https://github.com/apache/datafusion/issues/14123#issuecomment-2697426444 We have merged the fix here - https://github.com/apache/datafusion/pull/14990 To simplify votiing I suggest we backport that fix to the `branch-46` line and make a second release candidate (RC2) Is this something you can do @xudong963 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add `tree` / pretty explain mode [datafusion]
alamb commented on PR #14677: URL: https://github.com/apache/datafusion/pull/14677#issuecomment-2697397568 🚀 thank you again so much @irenjj -- this is so cool. A great step towards better explain plans. I am working on enhancing the follow on ticket you filed in https://github.com/apache/datafusion/issues/14914 ❤️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Refactor SortPushdown using the standard top-down visitor and using `EquivalenceProperties` [datafusion]
alamb commented on PR #14821: URL: https://github.com/apache/datafusion/pull/14821#issuecomment-2698190011 @berkaysynnada @wiedld and I had a brief meeting. The outcomes from my perspective is: 1. @berkaysynnada plans to review this PR and test on the synnada fork to ensure it still work 2. In parallel, @wiedld and I will work on increasing the test coverage of EnforceDistribution to ensure it covers our InfluxDB Iox plans -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Executor memory overhead overriding [datafusion-comet]
LukMRVC commented on code in PR #1462: URL: https://github.com/apache/datafusion-comet/pull/1462#discussion_r1979775668 ## spark/src/test/scala/org/apache/spark/CometPluginsSuite.scala: ## @@ -143,3 +143,36 @@ class CometPluginsNonOverrideSuite extends CometTestBase { assert(execMemOverhead4 == "2G") } } + +class CometPluginsUnifiedModeOverrideSuite extends CometTestBase { + override protected def sparkConf: SparkConf = { +val conf = new SparkConf() +conf.set("spark.driver.memory", "1G") +conf.set("spark.executor.memory", "1G") +conf.set("spark.executor.memoryOverhead", "1G") +conf.set("spark.plugins", "org.apache.spark.CometPlugin") +conf.set("spark.comet.enabled", "true") +conf.set("spark.memory.offHeap.enabled", "true") +conf.set("spark.memory.offHeap.size", "2G") +conf.set("spark.comet.exec.shuffle.enabled", "true") +conf.set("spark.comet.exec.enabled", "true") +conf.set("spark.comet.memory.overhead.factor", "0.5") +conf + } + + /** Since using unified memory, executor memory should not be overridden */ Review Comment: Yes, you are right. I overlooked it. I changed the comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Expose to `AccumulatorArgs` whether all the groups are sorted [datafusion]
rluvaton commented on issue #14991: URL: https://github.com/apache/datafusion/issues/14991#issuecomment-2698202576 Actually it uses GroupAccumulator even if it is fully sorted. you can see by adding breakpoint to https://github.com/apache/datafusion/blob/ac79ef3442e65f6197c7234da9fad964895b9101/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/prim_op.rs#L118 and run the following `slt`: ```slt statement ok CREATE TABLE test_table ( col_i32 INT, col_u32 INT UNSIGNED ) as VALUES ( NULL,NULL), ( -2147483648, 0), ( -2147483648, 0), ( 100, 100), ( 2147483647, 4294967295), ( NULL,NULL), ( -2147483648, 0), ( -2147483648, 0), ( 100, 100), ( 2147483646, 4294967294), ( 2147483647, 4294967295 ) query II select col_i32, sum(col_u32) sum_col_u32 from (select * from test_table order by col_i32 limit 10) group by col_i32 2147483647 8589934590 -2147483648 0 100 200 2147483646 4294967294 NULL NULL ``` you will see that even though `InputOrderMode` is `Sorted` the `GroupAccumulator` is still used. (I think we should be using group accumulator for sorted or partial sorted data to avoid combining all scalars from `Accumulator`s to array -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Epic: Statistics improvements [datafusion]
alamb commented on issue #8227: URL: https://github.com/apache/datafusion/issues/8227#issuecomment-2698552710 @clflushopt I don't really know as I am not driving the statistics rework myself and thus don't have much visibility into what is planned The only remaining thing I know of that comes to mind is - https://github.com/apache/datafusion/issues/8229 but that is more of an API exercise rather than algorithms. Another thing that comes to mind is to look at / extend the existing https://github.com/apache/datafusion/blob/main/datafusion/physical-expr/src/intervals/cp_solver.rs code / show how it is useful I was playing around with duckdb this morning. Somehow it uses its constrait solver to simplify away redundant filters. For example, it correctly deduces `x > 1 AND x > 2 AND x > 3 AND x < 6 AND x < 7 AND x < 8` is the same as `x>3 AND x<6` ```sql create table foo(x int); D insert into foo values (1); D insert into foo values (5); D explain SELECT * from foo where x > 1 AND x > 2 AND x > 3 AND x < 6 AND x < 7 AND x < 8; ┌─┐ │┌───┐│ ││ Physical Plan ││ │└───┘│ └─┘ ┌───┐ │ SEQ_SCAN │ │ │ │ Table: foo│ │ Type: Sequential Scan │ │ Projections: x │ │Filters: x>3 AND x<6 │ │ │ │ ~1 Rows │ └───┘ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Substrait plan read relation baseSchema does not include the struct with type information [datafusion]
amoeba commented on issue #12244: URL: https://github.com/apache/datafusion/issues/12244#issuecomment-2699036156 The behavior was modified in #12245 and the original issue looks addressed but the plans I'm getting don't validate. DataFusion currently hardcodes struct nullability to `NULLABILITY_UNSPECIFIED`, see https://github.com/apache/datafusion/blob/dd0fd889ea603f929accb99002e2f99280823f5c/datafusion/substrait/src/logical_plan/producer.rs#L1042 The `baseSchema` I get with the latest datafusion looks like this, ```json "baseSchema": { "names": [ "species", "island", "bill_length_mm", "bill_depth_mm", "body_mass_g", "sex", "year" ], "struct": { "types": [ { "string": { "nullability": "NULLABILITY_NULLABLE" } }, { "string": { "nullability": "NULLABILITY_NULLABLE" } }, { "fp64": { "nullability": "NULLABILITY_NULLABLE" } }, { "fp64": { "nullability": "NULLABILITY_NULLABLE" } }, { "i32": { "nullability": "NULLABILITY_NULLABLE" } }, { "string": { "nullability": "NULLABILITY_NULLABLE" } }, { "i32": { "nullability": "NULLABILITY_NULLABLE" } } ] } }, ``` The above implicitly sets the nullability on the baseSchema to `NULLABILITY_UNSPECIFIED` which, when validated as part of a larger plan, errors out with: ``` Error (code 0002): at plan.relations[0].rel_type.input.rel_type.input.rel_type.base_schema.struct.nullability: illegal value: nullability information is required in this context (code 0002) ``` I can get the plan to validate if I set nullability to `NULLABILITY_REQUIRED`, ``` "baseSchema": { "names": [...], "struct": { "types": [...], "nullability": "NULLABILITY_REQUIRED" } }, ``` Is the validator right and should DataFusion change its behavior here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Add identation to EnforceDistribution test plans. [datafusion]
alamb commented on code in PR #15007: URL: https://github.com/apache/datafusion/pull/15007#discussion_r1980306194 ## datafusion/core/tests/physical_optimizer/enforce_distribution.rs: ## @@ -572,33 +573,33 @@ fn multi_hash_joins() -> Result<()> { // Should include 3 RepartitionExecs JoinType::Inner | JoinType::Left | JoinType::LeftSemi | JoinType::LeftAnti | JoinType::LeftMark => vec![ top_join_plan.as_str(), -join_plan.as_str(), -"RepartitionExec: partitioning=Hash([a@0], 10), input_partitions=10", -"RepartitionExec: partitioning=RoundRobinBatch(10), input_partitions=1", -"DataSourceExec: file_groups={1 group: [[x]]}, projection=[a, b, c, d, e], file_type=parquet", -"RepartitionExec: partitioning=Hash([b1@1], 10), input_partitions=10", -"RepartitionExec: partitioning=RoundRobinBatch(10), input_partitions=1", -"ProjectionExec: expr=[a@0 as a1, b@1 as b1, c@2 as c1, d@3 as d1, e@4 as e1]", -"DataSourceExec: file_groups={1 group: [[x]]}, projection=[a, b, c, d, e], file_type=parquet", -"RepartitionExec: partitioning=Hash([c@2], 10), input_partitions=10", -"RepartitionExec: partitioning=RoundRobinBatch(10), input_partitions=1", -"DataSourceExec: file_groups={1 group: [[x]]}, projection=[a, b, c, d, e], file_type=parquet", +&join_plan_indent2, Review Comment: ❤️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Some aggregates silently ignore `IGNORE NULLS` and `ORDER BY` on arguments [datafusion]
vbarua commented on issue #9924: URL: https://github.com/apache/datafusion/issues/9924#issuecomment-2698960345 I opened up an issue that's potentially related to this when it comes to supporting IGNORE NULLS https://github.com/apache/datafusion/issues/15006 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Substrait plan read relation baseSchema does not include the struct with type information [datafusion]
Blizzara commented on issue #12244: URL: https://github.com/apache/datafusion/issues/12244#issuecomment-2699060966 I didn‘t see something in substrait.io where it‘d be explicitly said, but the example there (https://substrait.io/tutorial/sql_to_substrait/#types-and-schemas) does use NULLABILITY_REQUIRED. I don’t think it makes any difference in reality, given it’s the fields inside that struct type that matter, not the root struct itself. If the validator is happier with NULLABILITY_REQUIRED, fine by me to change it to that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Upgrade to DataFusion 46.0.0-rc2 [datafusion-comet]
andygrove commented on PR #1423: URL: https://github.com/apache/datafusion-comet/pull/1423#issuecomment-2699132289 @kazuyukitanimura I'd like to go ahead and merge this now to give us a few days to test with D 46 before the final release. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Do not double alias Exprs [datafusion]
alamb opened a new pull request, #15008: URL: https://github.com/apache/datafusion/pull/15008 ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/14895 ## Rationale for this change Nested `Expr::Alias` is confusing for display and for dealing with code ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Substrait plan read relation baseSchema does not include the struct with type information [datafusion]
amoeba commented on issue #12244: URL: https://github.com/apache/datafusion/issues/12244#issuecomment-2699149322 Thanks @Blizzara. I'll file a PR to change this soon and we can close this issue out. I asked for some clarification on the Substrait Slack so we can be sure it's the right change and I'll create a PR once I'm sure. Hopefully also update the Substrait docs too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement `tree` explain for FilterExec [datafusion]
alamb commented on code in PR #15001: URL: https://github.com/apache/datafusion/pull/15001#discussion_r1979586990 ## datafusion/sqllogictest/test_files/explain_tree.slt: ## @@ -185,6 +188,32 @@ physical_plan 26)│ DataSourceExec ││ DataSourceExec │ 27)└───┘└───┘ +# Long Filter (demonstrate what happens with wrapping) +query TT +explain SELECT int_col FROM table1 +WHERE string_col != 'foo' AND string_col != 'bar' AND string_col != 'a really long string constant' +; + +logical_plan +01)Projection: table1.int_col +02)--Filter: table1.string_col != Utf8("foo") AND table1.string_col != Utf8("bar") AND table1.string_col != Utf8("a really long string constant") +03)TableScan: table1 projection=[int_col, string_col], partial_filters=[table1.string_col != Utf8("foo"), table1.string_col != Utf8("bar"), table1.string_col != Utf8("a really long string constant")] +physical_plan +01)┌───┐ +02)│CoalesceBatchesExec│ +03)└─┬─┘ +04)┌─┴─┐ +05)│ FilterExec│ +06)│ │ +07)│ predicate:│ +08)│string_col@1 != foo AND ...│ Review Comment: Awesome -- thank you. Maybe we can file a ticket to track the idea? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Weekly Plan (Andrew Lamb) March 3, 2025 [datafusion]
alamb commented on issue #14978: URL: https://github.com/apache/datafusion/issues/14978#issuecomment-2697852067 DataFusion: Bugs/UX/Performance - [ ] https://github.com/apache/datafusion/pull/14331 - [ ] https://github.com/apache/datafusion/pull/14935 - [ ] https://github.com/apache/datafusion/pull/14952 - [ ] https://github.com/apache/datafusion/pull/14995 - [ ] https://github.com/apache/datafusion/pull/14994 DataFusion: New Features/ Functions - [ ] https://github.com/apache/datafusion/pull/14951 - [ ] https://github.com/apache/datafusion/pull/14954 - [ ] https://github.com/apache/datafusion/pull/14413 - [ ] https://github.com/apache/datafusion/pull/14595 - [ ] https://github.com/apache/datafusion/pull/14196 - [ ] https://github.com/apache/datafusion/pull/14687 - [ ] https://github.com/apache/datafusion/pull/14412 - [ ] https://github.com/apache/datafusion/pull/14837 arrow-rs - [ ] https://github.com/apache/arrow-rs/pull/ - [ ] https://github.com/apache/arrow-rs/pull/6965 - [ ] https://github.com/apache/arrow-rs/pull/6637 - [ ] https://github.com/apache/arrow-rs/pull/7177 - [ ] https://github.com/apache/arrow-rs/pull/7179 - [ ] https://github.com/apache/arrow-rs/pull/7191 Stuck PRs that need help pushing to completion - [ ] https://github.com/apache/datafusion/pull/14194 - [ ] https://github.com/apache/datafusion/pull/14222 - [ ] https://github.com/apache/datafusion/pull/14362 / https://github.com/apache/datafusion/pull/14057 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Expr simplifier doesn't simplify exprs that are same if you swap lhs with rhs regardless of cycles [datafusion]
alamb commented on issue #14943: URL: https://github.com/apache/datafusion/issues/14943#issuecomment-2697865461 @jayzhan211 has a PR to fix this here: - https://github.com/apache/datafusion/pull/14994 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Expr simplifier doesn't simplify exprs that are same if you swap lhs with rhs regardless of cycles [datafusion]
alamb commented on issue #14943: URL: https://github.com/apache/datafusion/issues/14943#issuecomment-2697880588 > We simplify A = B and B = A in common_sub_expression_eliminate but not simplify_expressions I wonder if this is another example of code that would be beneficial to pull into simplify_expressions 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Simplify Between expression to Eq [datafusion]
alamb commented on code in PR #14994: URL: https://github.com/apache/datafusion/pull/14994#discussion_r1979605743 ## datafusion/sqllogictest/test_files/simplify_expr.slt: ## @@ -0,0 +1,34 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + Review Comment: I like the idea of using slt for simplification testing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] add manual trigger for extended tests in pull requests [datafusion]
alamb commented on code in PR #14331: URL: https://github.com/apache/datafusion/pull/14331#discussion_r1979614246 ## .github/workflows/extended.yml: ## @@ -33,16 +33,46 @@ on: push: branches: - main + issue_comment: +types: [created] + +permissions: + pull-requests: write jobs: + # Check issue comment and notify that extended tests are running + check_issue_comment: +name: Check issue comment +runs-on: ubuntu-latest +if: github.event.issue.pull_request && github.event.comment.body == 'run extended tests' +steps: + - uses: actions/github-script@v7 +with: + github-token: ${{secrets.GITHUB_TOKEN}} + script: | +github.rest.issues.createComment({ + issue_number: context.issue.number, + owner: context.repo.owner, + repo: context.repo.repo, + body: "Running extended tests..." +}) + # Check crate compiles and base cargo check passes linux-build-lib: name: linux build test runs-on: ubuntu-latest container: image: amd64/rust +if: | + github.event_name == 'push' || + (github.event_name == 'issue_comment' && github.event.issue.pull_request && github.event.comment.body == 'run extended tests') steps: - uses: actions/checkout@v4 +with: + # Check out the pull request branch if triggered by a comment + ref: ${{ github.event_name == 'issue_comment' && github.event.issue.pull_request.head.ref || github.ref }} Review Comment: 🤔 - https://github.com/danila-b/datafusion/pull/3 doesn't seem to have run the extended tests https://github.com/danila-b/datafusion/pull/3#issuecomment-2697904032 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Expr simplifier doesn't simplify exprs that are same if you swap lhs with rhs regardless of cycles [datafusion]
ion-elgreco commented on issue #14943: URL: https://github.com/apache/datafusion/issues/14943#issuecomment-2698150878 > [@jayzhan211](https://github.com/jayzhan211) has a PR to fix this here: > > * [Simplify Between expression to Eq #14994](https://github.com/apache/datafusion/pull/14994) If the fix gets included in DF46, then we can give it a spin to see if it fully resolves the issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [EPIC] Complete `SQL EXPLAIN` Tree Rendering [datafusion]
milenkovicm commented on issue #14914: URL: https://github.com/apache/datafusion/issues/14914#issuecomment-2698392966 This looks great! Would it be possible to extend this with svg output? It would be great for ballista ui -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] [Question] Skip Partial stage aggregation stage for sorted input [datafusion]
rluvaton opened a new issue, #15002: URL: https://github.com/apache/datafusion/issues/15002 If the input is sorted by the group keys can't we just skip partial aggregation stage? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix: to_char Function Now Correctly Handles DATE Values in DataFusion [datafusion]
alamb commented on code in PR #14970: URL: https://github.com/apache/datafusion/pull/14970#discussion_r1979911036 ## datafusion/functions/src/datetime/to_char.rs: ## @@ -234,15 +226,21 @@ fn _to_char_scalar( }; let formatter = ArrayFormatter::try_new(array.as_ref(), &format_options)?; -let formatted: Result, ArrowError> = (0..array.len()) -.map(|i| formatter.value(i).try_to_string()) +let formatted: Result>, ArrowError> = (0..array.len()) Review Comment: 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Improve EnforceDistribution testings. [datafusion]
wiedld commented on issue #15003: URL: https://github.com/apache/datafusion/issues/15003#issuecomment-2698352844 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Scrollable python notebook table rendering [datafusion-python]
kevinjqliu commented on code in PR #1036: URL: https://github.com/apache/datafusion-python/pull/1036#discussion_r1979772203 ## src/dataframe.rs: ## @@ -100,46 +106,153 @@ impl PyDataFrame { } fn _repr_html_(&self, py: Python) -> PyDataFusionResult { -let mut html_str = "\n".to_string(); - -let df = self.df.as_ref().clone().limit(0, Some(10))?; -let batches = wait_for_future(py, df.collect())?; - +let (batches, mut has_more) = +wait_for_future(py, get_first_few_record_batches(self.df.as_ref().clone()))?; +let Some(batches) = batches else { +return Ok("No data to display".to_string()); +}; if batches.is_empty() { -html_str.push_str("\n"); -return Ok(html_str); +// This should not be reached, but do it for safety since we index into the vector below +return Ok("No data to display".to_string()); } +let table_uuid = uuid::Uuid::new_v4().to_string(); + +let mut html_str = " + +.expandable-container { +display: inline-block; +max-width: 200px; +} +.expandable { +white-space: nowrap; +overflow: hidden; +text-overflow: ellipsis; +display: block; +} +.full-text { +display: none; +white-space: normal; +} +.expand-btn { +cursor: pointer; +color: blue; +text-decoration: underline; +border: none; +background: none; +font-size: inherit; +display: block; +margin-top: 5px; +} + + + + +\n".to_string(); + let schema = batches[0].schema(); let mut header = Vec::new(); for field in schema.fields() { -header.push(format!("{}", field.name())); +header.push(format!("{}", field.name())); } let header_str = header.join(""); -html_str.push_str(&format!("{}\n", header_str)); +html_str.push_str(&format!("{}\n", header_str)); + +let batch_formatters = batches +.iter() +.map(|batch| { +batch +.columns() +.iter() +.map(|c| ArrayFormatter::try_new(c.as_ref(), &FormatOptions::default())) +.map(|c| { +c.map_err(|e| PyValueError::new_err(format!("Error: {:?}", e.to_string( +}) +.collect::, _>>() +}) +.collect::, _>>()?; + +let total_memory: usize = batches +.iter() +.map(|batch| batch.get_array_memory_size()) +.sum(); +let rows_per_batch = batches.iter().map(|batch| batch.num_rows()); +let total_rows = rows_per_batch.clone().sum(); + +// let (total_memory, total_rows) = batches.iter().fold((0, 0), |acc, batch| { +// (acc.0 + batch.get_array_memory_size(), acc.1 + batch.num_rows()) +// }); Review Comment: nit remove commented out code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] _repr_ and _html_repr_ show '... and additional rows' message [datafusion-python]
kevinjqliu commented on PR #1041: URL: https://github.com/apache/datafusion-python/pull/1041#issuecomment-2698204265 btw #1036 also changes `_repr_html_` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] add manual trigger for extended tests in pull requests [datafusion]
danila-b commented on code in PR #14331: URL: https://github.com/apache/datafusion/pull/14331#discussion_r1979865173 ## .github/workflows/extended.yml: ## @@ -33,16 +33,46 @@ on: push: branches: - main + issue_comment: +types: [created] + +permissions: + pull-requests: write jobs: + # Check issue comment and notify that extended tests are running + check_issue_comment: +name: Check issue comment +runs-on: ubuntu-latest +if: github.event.issue.pull_request && github.event.comment.body == 'run extended tests' +steps: + - uses: actions/github-script@v7 +with: + github-token: ${{secrets.GITHUB_TOKEN}} + script: | +github.rest.issues.createComment({ + issue_number: context.issue.number, + owner: context.repo.owner, + repo: context.repo.repo, + body: "Running extended tests..." +}) + # Check crate compiles and base cargo check passes linux-build-lib: name: linux build test runs-on: ubuntu-latest container: image: amd64/rust +if: | + github.event_name == 'push' || + (github.event_name == 'issue_comment' && github.event.issue.pull_request && github.event.comment.body == 'run extended tests') steps: - uses: actions/checkout@v4 +with: + # Check out the pull request branch if triggered by a comment + ref: ${{ github.event_name == 'issue_comment' && github.event.issue.pull_request.head.ref || github.ref }} Review Comment: Apparently it did run them, it's just not visible in GitHub UI. I have just double checked it with a bit different implementation on the same branch. Unfortunately, it seems that `issue_comment` event trigger is associated with default branch, and hence it does not add anything to the checks or UI of the PR itself. I can take a look if there is a better way to launch check associated with PR during the weekend, but from the top of my head, there are a couple of options: 1. Use `workflow_dispatch` event and trigger extended tests for the branch from the GitHub actions UI tab instead of doing so by comment on the PR 2. Use [3rd party action](https://github.com/myrotvorets/set-commit-status-action) to set check statuses when the run is triggered by a comment 3. Maybe [slash command 3rd party action](https://github.com/marketplace/actions/slash-command-dispatch) can also work for triggering commands a bit better since it uses dispatch event under the hood as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [WIP] chore: Add detailed error for sum::coerce_type [datafusion]
dentiny commented on PR #14710: URL: https://github.com/apache/datafusion/pull/14710#issuecomment-2698883962 I'm really really sorry for the latency, get stuck with something else; I will get back to the PR by EoW. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Release sqlparser-rs version `0.55.0` [datafusion-sqlparser-rs]
comphead commented on issue #1671: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1671#issuecomment-2698903812 +1 Verified on M3 Mac Thanks Andrew 27 February 2025, 09:37:51, From "Andrew Lamb" ***@***.***>: I made a release candidate. Here is the draft email. However, the dist.apache.org site appears to be down so I will try again later Hi, I would like to propose a release of Apache DataFusion sqlparser-rs version 0.55.0. This release candidate is based on commit: ed416548dcfe4a73a3240bbf625fb9010a4925c8 [1] The proposed release tarball and signatures are hosted at [2]. The changelog is located at [3]. Please download, verify checksums and signatures, run the unit tests, and vote on the release. The vote will be open for at least 72 hours. Only votes from PMC members are binding, but all members of the community are encouraged to test the release and vote with "(non-binding)". The standard verification procedure is documented at https://github.com/apache/datafusion-sqlparser-rs/blob/main/dev/release/README.md#verifying-release-candidates. [ ] +1 Release this as Apache DataFusion sqlparser-rs 0.55.0 [ ] +0 [ ] -1 Do not release this as Apache DataFusion sqlparser-rs 0.55.0 because... Here is my vote: +1 [1]: https://github.com/apache/datafusion-sqlparser-rs/tree/ed416548dcfe4a73a3240bbf625fb9010a4925c8 [2]: https://dist.apache.org/repos/dist/dev/datafusion/apache-datafusion-sqlparser-rs-0.55.0-rc1 [3]: https://github.com/apache/datafusion-sqlparser-rs/blob/ed416548dcfe4a73a3240bbf625fb9010a4925c8/CHANGELOG.md - — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***> alamb left a comment (apache/datafusion-sqlparser-rs#1671) I made a release candidate. Here is the draft email. However, the dist.apache.org site appears to be down so I will try again later Hi, I would like to propose a release of Apache DataFusion sqlparser-rs version 0.55.0. This release candidate is based on commit: ed416548dcfe4a73a3240bbf625fb9010a4925c8 [1] The proposed release tarball and signatures are hosted at [2]. The changelog is located at [3]. Please download, verify checksums and signatures, run the unit tests, and vote on the release. The vote will be open for at least 72 hours. Only votes from PMC members are binding, but all members of the community are encouraged to test the release and vote with "(non-binding)". The standard verification procedure is documented at https://github.com/apache/datafusion-sqlparser-rs/blob/main/dev/release/README.md#verifying-release-candidates. [ ] +1 Release this as Apache DataFusion sqlparser-rs 0.55.0 [ ] +0 [ ] -1 Do not release this as Apache DataFusion sqlparser-rs 0.55.0 because... Here is my vote: +1 [1]: https://github.com/apache/datafusion-sqlparser-rs/tree/ed416548dcfe4a73a3240bbf625fb9010a4925c8 [2]: https://dist.apache.org/repos/dist/dev/datafusion/apache-datafusion-sqlparser-rs-0.55.0-rc1 [3]: https://github.com/apache/datafusion-sqlparser-rs/blob/ed416548dcfe4a73a3240bbf625fb9010a4925c8/CHANGELOG.md - — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***> --=-qaapr9mzYVx2tjS93AZl Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: base64 PGh0bWw+PGJvZHk+PHNwYW4gc3R5bGU9ImRpc3BsYXk6YmxvY2s7IiBjbGFzcz0ieGZtXzk5NjI3 NTIxIj48ZGl2PjxzcGFuIHN0eWxlPSJmb250LXNpemU6MTJwdDtsaW5lLWhlaWdodDoxNHB0O2Zv bnQtZmFtaWx5OkFyaWFsOyIgY2xhc3M9InhmbWMxIj4rMTwvc3Bhbj48YnIvPjwvZGl2Pg0KPGRp dj48c3BhbiBzdHlsZT0iZm9udC1zaXplOjEycHQ7bGluZS1oZWlnaHQ6MTRwdDtmb250LWZhbWls eTpBcmlhbDsiIGNsYXNzPSJ4Zm1jMSI+PGJyIGRhdGEtbWNlLWJvZ3VzPSIxIi8+PC9zcGFuPjwv ZGl2Pg0KPGRpdj48c3BhbiBzdHlsZT0iZm9udC1zaXplOjEycHQ7bGluZS1oZWlnaHQ6MTRwdDtm b250LWZhbWlseTpBcmlhbDsiIGNsYXNzPSJ4Zm1jMSI+VmVyaWZpZWQgb24gTTMgTWFjPC9zcGFu PjwvZGl2Pg0KPGRpdj48c3BhbiBzdHlsZT0iZm9udC1zaXplOjEycHQ7bGluZS1oZWlnaHQ6MTRw dDtmb250LWZhbWlseTpBcmlhbDsiIGNsYXNzPSJ4Zm1jMSI+PGJyIGRhdGEtbWNlLWJvZ3VzPSIx Ii8+PC9zcGFuPjwvZGl2Pg0KPGRpdj48c3BhbiBzdHlsZT0iZm9udC1zaXplOjEycHQ7bGluZS1o ZWlnaHQ6MTRwdDtmb250LWZhbWlseTpBcmlhbDsiIGNsYXNzPSJ4Zm1jMSI+VGhhbmtzIEFuZHJl dzwvc3Bhbj48L2Rpdj4NCjxkaXY+PGJyLz48L2Rpdj4NCjxkaXY+PGk+PHNwYW4gc3R5bGU9ImZv bnQtc2l6ZToxMHB0O2xpbmUtaGVpZ2h0OjEycHQ7Ij48c3BhbiBzdHlsZT0iZm9udC1mYW1pbHk6 QXJpYWw7Ij4yNyBGZWJydWFyeSAyMDI1LCAwOTozNzo1MSwgRnJvbSAiQW5kcmV3IExhbWIiICZs dDs8L3NwYW4+PGEgaHJlZj0ibWFpbHRvOm5vdGlmaWNhdGlvbnNAZ2l0aHViLmNvbSIgdGFyZ2V0 PSJfYmxhbmsiPjxzcGFuIHN0eWxlPSJmb250LWZhbWlseTpBcmlhbDsiPm5vdGlmaWN
Re: [I] Expose to `AccumulatorArgs` whether all the groups are sorted [datafusion]
alamb commented on issue #14991: URL: https://github.com/apache/datafusion/issues/14991#issuecomment-2698391616 ```sql > CREATE TABLE test_table ( col_i32 INT, col_u32 INT UNSIGNED ) as VALUES ( NULL,NULL), ( -2147483648, 0), ( -2147483648, 0), ( 100, 100), ( 2147483647, 4294967295), ( NULL,NULL), ( -2147483648, 0), ( -2147483648, 0), ( 100, 100), ( 2147483646, 4294967294), ( 2147483647, 4294967295 ) ; 0 row(s) fetched. Elapsed 0.010 seconds. > explain select col_i32, sum(col_u32) sum_col_u32 from (select * from test_table order by col_i32 limit 10) group by col_i32; +---+---+ | plan_type | plan | +---+---+ | logical_plan | Projection: test_table.col_i32, sum(test_table.col_u32) AS sum_col_u32| | | Aggregate: groupBy=[[test_table.col_i32]], aggr=[[sum(CAST(test_table.col_u32 AS UInt64))]] | | | Sort: test_table.col_i32 ASC NULLS LAST, fetch=10 | | | TableScan: test_table projection=[col_i32, col_u32] | | physical_plan | ProjectionExec: expr=[col_i32@0 as col_i32, sum(test_table.col_u32)@1 as sum_col_u32] | | | AggregateExec: mode=FinalPartitioned, gby=[col_i32@0 as col_i32], aggr=[sum(test_table.col_u32)], ordering_mode=Sorted | | | SortExec: expr=[col_i32@0 ASC NULLS LAST], preserve_partitioning=[true] | | | CoalesceBatchesExec: target_batch_size=8192 | | | RepartitionExec: partitioning=Hash([col_i32@0], 16), input_partitions=16 | | | RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1 | | | AggregateExec: mode=Partial, gby=[col_i32@0 as col_i32], aggr=[sum(test_table.col_u32)], ordering_mode=Sorted | | | SortExec: TopK(fetch=10), expr=[col_i32@0 ASC NULLS LAST], preserve_partitioning=[false]| | | DataSourceExec: partitions=1, partition_sizes=[1] | | | | +---+---+ 2 row(s) fetched. Elapsed 0.006 seconds. ``` I guess I forgot that we did eventually unify the implementation 🎉 In that case then I think it would make sense to pass existing ordering information along to the aggregates 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Table function supports non-literal args [datafusion]
Lordworms commented on issue #14958: URL: https://github.com/apache/datafusion/issues/14958#issuecomment-2698453390 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Improved error for expand wildcard rule [datafusion]
Jiashu-Hu commented on issue #15004: URL: https://github.com/apache/datafusion/issues/15004#issuecomment-2698791268 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Refactor CometScanRule to avoid duplication and improve fallback messages [datafusion-comet]
andygrove commented on PR #1474: URL: https://github.com/apache/datafusion-comet/pull/1474#issuecomment-2698801229 @mbutrovich, Could you review this since this affects the new experimental native scan support? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]
vbarua commented on code in PR #13511: URL: https://github.com/apache/datafusion/pull/13511#discussion_r1979981659 ## datafusion/expr/src/expr.rs: ## @@ -295,6 +295,8 @@ pub enum Expr { /// See also [`ExprFunctionExt`] to set these fields. /// /// [`ExprFunctionExt`]: crate::expr_fn::ExprFunctionExt +/// +/// cf. `WITHIN GROUP` is converted to `ORDER BY` internally in `datafusion/sql/src/expr/function.rs` Review Comment: minor/opinionated: I'm not sure if it's worth mentioning this at all here. `WITHIN GROUP` is effectively an `ORDER BY` specified differently. This only matters at the SQL layer, and you handle and explain it there already. ## datafusion/sql/src/expr/function.rs: ## @@ -349,15 +365,49 @@ impl SqlToRel<'_, S> { } else { // User defined aggregate functions (UDAF) have precedence in case it has the same name as a scalar built-in function if let Some(fm) = self.context_provider.get_aggregate_meta(&name) { -let order_by = self.order_by_to_sort_expr( -order_by, -schema, -planner_context, -true, -None, -)?; -let order_by = (!order_by.is_empty()).then_some(order_by); -let args = self.function_args_to_expr(args, schema, planner_context)?; +if fm.is_ordered_set_aggregate() && within_group.is_empty() { +return plan_err!("WITHIN GROUP clause is required when calling ordered set aggregate function({})", fm.name()); +} + +if null_treatment.is_some() && !fm.supports_null_handling_clause() { +return plan_err!( +"[IGNORE | RESPECT] NULLS are not permitted for {}", +fm.name() +); +} Review Comment: Per my point about `[IGNORE | RESPECT] NULLS` being a property of _window functions_, I don't think we need this check here. ## datafusion/sql/src/expr/function.rs: ## @@ -177,8 +180,14 @@ impl FunctionArgs { } } -if !within_group.is_empty() { -return not_impl_err!("WITHIN GROUP is not supported yet: {within_group:?}"); +if !within_group.is_empty() && order_by.is_some() { +return plan_err!("ORDER BY clause is only permitted in WITHIN GROUP clause when a WITHIN GROUP is used"); +} + +if within_group.len() > 1 { +return not_impl_err!( +"Multiple column ordering in WITHIN GROUP clause is not supported" Review Comment: Minor wording suggestion ``` Only a single ordering expression is permitted in a WITHIN GROUP clause ``` which explicitly points users to what they should do, instead of telling them what they can't. ## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ## @@ -51,29 +52,39 @@ create_func!(ApproxPercentileCont, approx_percentile_cont_udaf); /// Computes the approximate percentile continuous of a set of numbers pub fn approx_percentile_cont( -expression: Expr, +within_group: Sort, Review Comment: minor/opinionated: I think `order_by` would be a clearer name for this, as the `WITHIN GROUP` is really just a wrapper around the `ORDER BY` clause. ## datafusion/sql/src/expr/function.rs: ## @@ -177,8 +180,14 @@ impl FunctionArgs { } } -if !within_group.is_empty() { -return not_impl_err!("WITHIN GROUP is not supported yet: {within_group:?}"); +if !within_group.is_empty() && order_by.is_some() { +return plan_err!("ORDER BY clause is only permitted in WITHIN GROUP clause when a WITHIN GROUP is used"); Review Comment: minor: I just noticed that in the block above there is a check for duplicate order bys. I think it would be good to fold this into that check ```rust FunctionArgumentClause::OrderBy(oby) => { if order_by.is_some() { // can check for within group here return not_impl_err!("Calling {name}: Duplicated ORDER BY clause in function arguments"); } order_by = Some(oby); } ``` to consolidate the handling into one place. ## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ## @@ -131,6 +147,19 @@ impl ApproxPercentileCont { args: AccumulatorArgs, ) -> Result { let percentile = validate_input_percentile_expr(&args.exprs[1])?; + +let is_descending = args +.ordering_req +.first() +.map(|sort_expr| sort_expr.options.descending) +.unwrap_or(false); + +let percentile = if is_descending { +1.0 - percentile +} el
Re: [PR] Refactor test suite in EnforceDistribution, to use standard test config. [datafusion]
wiedld commented on code in PR #15010: URL: https://github.com/apache/datafusion/pull/15010#discussion_r1980473165 ## datafusion/core/tests/physical_optimizer/enforce_distribution.rs: ## @@ -371,46 +371,91 @@ macro_rules! plans_matches_expected { } } +fn test_suite_default_config_options() -> ConfigOptions { +let mut config = ConfigOptions::new(); + +// By default, will not repartition / resort data if it is already sorted. +config.optimizer.prefer_existing_sort = false; + +// By default, will attempt to convert Union to Interleave. +config.optimizer.prefer_existing_union = false; + +// By default, will not repartition file scans. +config.optimizer.repartition_file_scans = false; +config.optimizer.repartition_file_min_size = 1024; + +// By default, set query execution concurrency to 10. +config.execution.target_partitions = 10; + +// Use a small batch size, to trigger RoundRobin in tests +config.execution.batch_size = 1; + +config +} + +/// How the optimizers are run. +#[derive(PartialEq, Clone)] +enum DoFirst { +/// Runs: (EnforceDistribution, EnforceDistribution, EnforceSorting) +Distribution, +/// Runs: (EnforceSorting, EnforceDistribution, EnforceDistribution) +Sorting, +} + +#[derive(Clone)] +struct TestConfig { +config: ConfigOptions, +optimizers_to_run: DoFirst, +} + +impl TestConfig { +fn new(optimizers_to_run: DoFirst) -> Self { +Self { +config: test_suite_default_config_options(), +optimizers_to_run, +} +} + +/// If preferred, will not repartition / resort data if it is already sorted. +fn with_prefer_existing_sort(mut self) -> Self { +self.config.optimizer.prefer_existing_sort = true; +self +} + +/// If preferred, will not attempt to convert Union to Interleave. +fn with_prefer_existing_union(mut self) -> Self { +self.config.optimizer.prefer_existing_union = true; +self +} + +/// If preferred, will repartition file scans. +/// Accepts a minimum file size to repartition. +fn with_prefer_repartition_file_scans(mut self, file_min_size: usize) -> Self { +self.config.optimizer.repartition_file_scans = true; +self.config.optimizer.repartition_file_min_size = file_min_size; +self +} + +/// Set the preferred target partitions for query execution concurrency. +fn with_query_execution_partitions(mut self, target_partitions: usize) -> Self { +self.config.execution.target_partitions = target_partitions; +self +} +} + /// Runs the repartition optimizer and asserts the plan against the expected /// Arguments /// * `EXPECTED_LINES` - Expected output plan /// * `PLAN` - Input plan -/// * `FIRST_ENFORCE_DIST` - -/// true: (EnforceDistribution, EnforceDistribution, EnforceSorting) -/// false: else runs (EnforceSorting, EnforceDistribution, EnforceDistribution) -/// * `PREFER_EXISTING_SORT` (optional) - if true, will not repartition / resort data if it is already sorted -/// * `TARGET_PARTITIONS` (optional) - number of partitions to repartition to -/// * `REPARTITION_FILE_SCANS` (optional) - if true, will repartition file scans -/// * `REPARTITION_FILE_MIN_SIZE` (optional) - minimum file size to repartition -/// * `PREFER_EXISTING_UNION` (optional) - if true, will not attempt to convert Union to Interleave Review Comment: These are now replaced by the required `DoFirst` (for one-of 2 possible optimizer run patterns), then all the optional settings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]
vbarua commented on PR #13511: URL: https://github.com/apache/datafusion/pull/13511#issuecomment-2699344130 Pinging @Dandandan for commiter review as they filed the ticket this fix is for. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Incorrect URLs in Cargo.toml files [datafusion-ballista]
andygrove opened a new issue, #1193: URL: https://github.com/apache/datafusion-ballista/issues/1193 **Describe the bug** We still use Arrow URLs in some places: ``` homepage = "https://github.com/apache/arrow-ballista"; repository = "https://github.com/apache/arrow-ballista"; ``` **To Reproduce** **Expected behavior** **Additional context** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support datatype cast for insert api same as insert into sql [datafusion]
zhuqi-lucas commented on issue #15015: URL: https://github.com/apache/datafusion/issues/15015#issuecomment-2699799674 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Minor: cleanup unused code [datafusion]
qazxcdswe123 opened a new pull request, #15016: URL: https://github.com/apache/datafusion/pull/15016 ## Which issue does this PR close? - Closes nothing ## Rationale for this change cleanup unused deadcode ## What changes are included in this PR? cleanup unused deadcode ## Are these changes tested? yes ## Are there any user-facing changes? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] BUG: schema_force_view_type configuration not working for CREATE EXTERNAL TABLE [datafusion]
zhuqi-lucas commented on code in PR #14922: URL: https://github.com/apache/datafusion/pull/14922#discussion_r1980680580 ## datafusion/sqllogictest/test_files/insert_to_external.slt: ## @@ -456,13 +456,16 @@ explain insert into table_without_values select c1 from aggregate_test_100 order logical_plan 01)Dml: op=[Insert Into] table=[table_without_values] -02)--Projection: aggregate_test_100.c1 AS c1 +02)--Projection: CAST(aggregate_test_100.c1 AS Utf8View) AS c1 Review Comment: Created a follow-up ticket for it, we'd better to support cast also for insert into api way consistent with insert into sql. https://github.com/apache/datafusion/issues/15015 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Bug: Fix multi-lines printing issue for datafusion-cli [datafusion]
zhuqi-lucas commented on code in PR #14954: URL: https://github.com/apache/datafusion/pull/14954#discussion_r1980745507 ## datafusion-cli/tests/cli_integration.rs: ## @@ -51,6 +51,163 @@ fn init() { ["--command", "show datafusion.execution.batch_size", "--format", "json", "-q", "-b", "1"], "[{\"name\":\"datafusion.execution.batch_size\",\"value\":\"1\"}]\n" )] + Review Comment: Add rich testing cases in cli_integration, it can avoid regression for datafusion-cli changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]
Garamda commented on code in PR #13511: URL: https://github.com/apache/datafusion/pull/13511#discussion_r1980826096 ## datafusion/sql/src/expr/function.rs: ## @@ -349,15 +365,49 @@ impl SqlToRel<'_, S> { } else { // User defined aggregate functions (UDAF) have precedence in case it has the same name as a scalar built-in function if let Some(fm) = self.context_provider.get_aggregate_meta(&name) { -let order_by = self.order_by_to_sort_expr( -order_by, -schema, -planner_context, -true, -None, -)?; -let order_by = (!order_by.is_empty()).then_some(order_by); -let args = self.function_args_to_expr(args, schema, planner_context)?; +if fm.is_ordered_set_aggregate() && within_group.is_empty() { +return plan_err!("WITHIN GROUP clause is required when calling ordered set aggregate function({})", fm.name()); +} + +if null_treatment.is_some() && !fm.supports_null_handling_clause() { +return plan_err!( +"[IGNORE | RESPECT] NULLS are not permitted for {}", +fm.name() +); +} Review Comment: > This was smelling odd, so I dug a bit deeper. I think you've inadvertantly stumbled into something even weirder than you anticipated > ... > The RESPECT NULLS | IGNORE NULLS options is only a property of certain window functions, hence we shouldn't need to track it for aggregate functions. > > I'm going to file a ticket for the above. > ... > Per my point about [IGNORE | RESPECT] NULLS being a property of window functions, I don't think we need this check here. I understood and agree with your guidance. I will track what is decided in the issue you filed, and will remove some codes out after determination. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]
Garamda commented on code in PR #13511: URL: https://github.com/apache/datafusion/pull/13511#discussion_r1980817866 ## datafusion/sql/src/expr/function.rs: ## @@ -349,15 +365,49 @@ impl SqlToRel<'_, S> { } else { // User defined aggregate functions (UDAF) have precedence in case it has the same name as a scalar built-in function if let Some(fm) = self.context_provider.get_aggregate_meta(&name) { -let order_by = self.order_by_to_sort_expr( -order_by, -schema, -planner_context, -true, -None, -)?; -let order_by = (!order_by.is_empty()).then_some(order_by); -let args = self.function_args_to_expr(args, schema, planner_context)?; +if fm.is_ordered_set_aggregate() && within_group.is_empty() { +return plan_err!("WITHIN GROUP clause is required when calling ordered set aggregate function({})", fm.name()); +} + +if null_treatment.is_some() && !fm.supports_null_handling_clause() { +return plan_err!( +"[IGNORE | RESPECT] NULLS are not permitted for {}", +fm.name() +); +} Review Comment: > One big question we might need to answer before merging is if we need a migration strategy for this. Because we now require WITHIN GROUP for these functions, any users who have queries stored outside of DataFusion will experience breakages that they can't work around. If we want to provide a migration path, we may need to support having both forms of calling these functions, as in > > SELECT approx_percentile_cont(column_name, 0.75, 100) FROM table_name; > SELECT approx_percentile_cont(0.75, 100) WITHIN GROUP (ORDER BY column_name) FROM table_name; > > for at least 1 release so folks can migrate their queries. This is one of the biggest concerns when I started to work on this feature. If the community decides the migration strategy like that, then I will make both syntax supported. Also, I will file an issue to track the plan so that the current syntax can be excluded as scheduled. (if I am authorized to do so) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Table function supports non-literal args [datafusion]
Lordworms commented on issue #14958: URL: https://github.com/apache/datafusion/issues/14958#issuecomment-2700016967 I have done some basic research on how Postgres deal with table function with column, for example, something like this would work in Postgresql ``` SELECT t.id, t.start_value, t.end_value, g.num FROM table_a t, LATERAL generate_series(t.start_value, t.end_value) AS g(num); ``` So I think the first step is to implement support for LATERAL, I see some PR like https://github.com/apache/datafusion/pull/14595 to support basic LATERAL JOIN using decorrelation subqueries but I think we should also support some other LATERAL use cases like ``` > SELECT u.id, u.name, sub.age_plus_10 FROM users u, LATERAL (SELECT u.age + 10 AS age_plus_10) sub; This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn(Int32, Column { relation: Some(Bare { table: "u" }), name: "age" }) ``` After we have the support for Laterral, we can push any table scan down to table function scan or pull up the table function scan up to support input for Tablescan. I'll probably start with supporting basic Lateral. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Table function supports non-literal args [datafusion]
alamb commented on issue #14958: URL: https://github.com/apache/datafusion/issues/14958#issuecomment-2699172543 I think this one is likely pretty tricky -- it might be worth working on a design / writeup at first of how it would work / what the plans would look like -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add array reading support to native_datafusion scan [datafusion-comet]
andygrove closed pull request #1324: feat: Add array reading support to native_datafusion scan URL: https://github.com/apache/datafusion-comet/pull/1324 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Add dependency checks to verify-release-candidate script [datafusion]
waynexia opened a new pull request, #15009: URL: https://github.com/apache/datafusion/pull/15009 ## Which issue does this PR close? - Closes #. ## Rationale for this change Check all build dependencies are available before running the script. Background: I'll use a separate docker to verify the release candidate. And I always forget to install some of the build dependencies. Then I need to install it and restart from the very beginning. I list all the things I'm aware of (or need to be installed in a fresh ubuntu image) and check them before the build starts. ## What changes are included in this PR? A new function `check_dependencies` in `dev/release/verify-release-candidate.sh` that checks build dependencies in advance. ## Are these changes tested? Yes, I use this version to verify 46.0.0 RC2 release ## Are there any user-facing changes? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Expr simplifier doesn't simplify exprs that are same if you swap lhs with rhs regardless of cycles [datafusion]
alamb commented on issue #14943: URL: https://github.com/apache/datafusion/issues/14943#issuecomment-2698393138 I think it will not be present in DataFusion 46 (we have the RC out for voting now) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Simplify Between expression to Eq [datafusion]
jayzhan211 merged PR #14994: URL: https://github.com/apache/datafusion/pull/14994 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Simplify Between expression to Eq [datafusion]
jayzhan211 commented on PR #14994: URL: https://github.com/apache/datafusion/pull/14994#issuecomment-2699326547 > I don't think it closes https://github.com/apache/datafusion/issues/14943 I agree it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]
vbarua commented on code in PR #13511: URL: https://github.com/apache/datafusion/pull/13511#discussion_r1980465883 ## datafusion/expr/src/udaf.rs: ## @@ -845,6 +861,19 @@ pub trait AggregateUDFImpl: Debug + Send + Sync { ScalarValue::try_from(data_type) } +/// If this function supports `[IGNORE NULLS | RESPECT NULLS]` clause, return true +/// If the function does not, return false +/// Otherwise return None (the default) +fn supports_null_handling_clause(&self) -> Option { +None +} Review Comment: Filed https://github.com/apache/datafusion/issues/15006 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Refactor test suite in EnforceDistribution, to use standard test config. [datafusion]
wiedld opened a new pull request, #15010: URL: https://github.com/apache/datafusion/pull/15010 ## Which issue does this PR close? Part of #15003 ## Rationale for this change Make it easier to determine the differences in test configuration, for each test case. Also clearly defines what are the defaults used for the test suite. ## What changes are included in this PR? Define `TestConfig`, which requires one-of `DoFirst` optimizer run configs. Most test cases uses this test config with the default settings. Provide a set of interfaces to define which configurations get tweaked. ## Are these changes tested? Yes ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Refactor test suite in EnforceDistribution, to use standard test config. [datafusion]
wiedld commented on code in PR #15010: URL: https://github.com/apache/datafusion/pull/15010#discussion_r1980468330 ## datafusion/core/tests/physical_optimizer/enforce_distribution.rs: ## @@ -371,46 +371,95 @@ macro_rules! plans_matches_expected { } } +fn test_suite_default_config_options() -> ConfigOptions { +let mut config = ConfigOptions::new(); + +// By default, will not repartition / resort data if it is already sorted. +config.optimizer.prefer_existing_sort = false; + +// By default, will attempt to convert Union to Interleave. +config.optimizer.prefer_existing_union = false; + +// By default, will not repartition file scans. +config.optimizer.repartition_file_scans = false; +config.optimizer.repartition_file_min_size = 1024; + +// By default, set query execution concurrency to 10. +config.execution.target_partitions = 10; Review Comment: These match the defaults on main: https://github.com/apache/datafusion/blob/c61e7e597f80a3abd5015c996dc896a084112ab6/datafusion/core/tests/physical_optimizer/enforce_distribution.rs#L388 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Refactor test suite in EnforceDistribution, to use standard test config. [datafusion]
wiedld commented on code in PR #15010: URL: https://github.com/apache/datafusion/pull/15010#discussion_r1980470202 ## datafusion/core/tests/physical_optimizer/enforce_distribution.rs: ## @@ -371,46 +371,95 @@ macro_rules! plans_matches_expected { } } +fn test_suite_default_config_options() -> ConfigOptions { +let mut config = ConfigOptions::new(); + +// By default, will not repartition / resort data if it is already sorted. +config.optimizer.prefer_existing_sort = false; + +// By default, will attempt to convert Union to Interleave. +config.optimizer.prefer_existing_union = false; + +// By default, will not repartition file scans. +config.optimizer.repartition_file_scans = false; +config.optimizer.repartition_file_min_size = 1024; + +// By default, set query execution concurrency to 10. +config.execution.target_partitions = 10; + +// Use a small batch size, to trigger RoundRobin in tests +config.execution.batch_size = 1; + +config +} + +/// How the optimizers are run. +#[derive(PartialEq, Clone)] +enum DoFirst { +/// Runs: (EnforceDistribution, EnforceDistribution, EnforceSorting) +Distribution, +/// Runs: (EnforceSorting, EnforceDistribution, EnforceDistribution) +Sorting, +} + +#[derive(Clone)] +struct TestConfig { +config: ConfigOptions, +optimizers_to_run: DoFirst, +} + +impl TestConfig { +fn new(optimizers_to_run: DoFirst) -> Self { +Self { +config: test_suite_default_config_options(), +optimizers_to_run, +} +} + +/// If preferred, will not repartition / resort data if it is already sorted. +fn with_prefer_existing_sort(mut self) -> Self { +self.config.optimizer.prefer_existing_sort = true; +self +} + +/// If preferred, will not attempt to convert Union to Interleave. +fn with_prefer_existing_union(mut self) -> Self { +self.config.optimizer.prefer_existing_union = true; +self +} + +/// If preferred, will repartition file scans. +/// Accepts a minimum file size to repartition. +fn with_prefer_repartition_file_scans(mut self, file_min_size: usize) -> Self { +self.config.optimizer.repartition_file_scans = true; +self.config.optimizer.repartition_file_min_size = file_min_size; +self +} + +/// Set the preferred target partitions for query execution concurrency. +fn with_query_execution_partitions(mut self, target_partitions: usize) -> Self { +self.config.execution.target_partitions = target_partitions; +self +} +} + /// Runs the repartition optimizer and asserts the plan against the expected /// Arguments /// * `EXPECTED_LINES` - Expected output plan /// * `PLAN` - Input plan -/// * `FIRST_ENFORCE_DIST` - -/// true: (EnforceDistribution, EnforceDistribution, EnforceSorting) -/// false: else runs (EnforceSorting, EnforceDistribution, EnforceDistribution) -/// * `PREFER_EXISTING_SORT` (optional) - if true, will not repartition / resort data if it is already sorted -/// * `TARGET_PARTITIONS` (optional) - number of partitions to repartition to -/// * `REPARTITION_FILE_SCANS` (optional) - if true, will repartition file scans -/// * `REPARTITION_FILE_MIN_SIZE` (optional) - minimum file size to repartition -/// * `PREFER_EXISTING_UNION` (optional) - if true, will not attempt to convert Union to Interleave +/// * `CONFIG` - [`TestConfig`] macro_rules! assert_optimized { -($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr) => { -assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, false, 10, false, 1024, false); -}; - -($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr, $PREFER_EXISTING_SORT: expr) => { -assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, $PREFER_EXISTING_SORT, 10, false, 1024, false); -}; - -($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr, $PREFER_EXISTING_SORT: expr, $PREFER_EXISTING_UNION: expr) => { -assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, $PREFER_EXISTING_SORT, 10, false, 1024, $PREFER_EXISTING_UNION); -}; - -($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr, $PREFER_EXISTING_SORT: expr, $TARGET_PARTITIONS: expr, $REPARTITION_FILE_SCANS: expr, $REPARTITION_FILE_MIN_SIZE: expr) => { -assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, $PREFER_EXISTING_SORT, $TARGET_PARTITIONS, $REPARTITION_FILE_SCANS, $REPARTITION_FILE_MIN_SIZE, false); +($EXPECTED_LINES: expr, $PLAN: expr) => { +assert_optimized!($EXPECTED_LINES, $PLAN, &TestConfig::new(DoFirst::Distribution)); Review Comment: Note: this option is technically not used, so I could remove it. It's basically a 1-to-1 replacement of the "default options" provided for this macro on main. -- This is an automated message from the Apache Git Service. To respond to
Re: [PR] Refactor test suite in EnforceDistribution, to use standard test config. [datafusion]
wiedld commented on code in PR #15010: URL: https://github.com/apache/datafusion/pull/15010#discussion_r1980470202 ## datafusion/core/tests/physical_optimizer/enforce_distribution.rs: ## @@ -371,46 +371,95 @@ macro_rules! plans_matches_expected { } } +fn test_suite_default_config_options() -> ConfigOptions { +let mut config = ConfigOptions::new(); + +// By default, will not repartition / resort data if it is already sorted. +config.optimizer.prefer_existing_sort = false; + +// By default, will attempt to convert Union to Interleave. +config.optimizer.prefer_existing_union = false; + +// By default, will not repartition file scans. +config.optimizer.repartition_file_scans = false; +config.optimizer.repartition_file_min_size = 1024; + +// By default, set query execution concurrency to 10. +config.execution.target_partitions = 10; + +// Use a small batch size, to trigger RoundRobin in tests +config.execution.batch_size = 1; + +config +} + +/// How the optimizers are run. +#[derive(PartialEq, Clone)] +enum DoFirst { +/// Runs: (EnforceDistribution, EnforceDistribution, EnforceSorting) +Distribution, +/// Runs: (EnforceSorting, EnforceDistribution, EnforceDistribution) +Sorting, +} + +#[derive(Clone)] +struct TestConfig { +config: ConfigOptions, +optimizers_to_run: DoFirst, +} + +impl TestConfig { +fn new(optimizers_to_run: DoFirst) -> Self { +Self { +config: test_suite_default_config_options(), +optimizers_to_run, +} +} + +/// If preferred, will not repartition / resort data if it is already sorted. +fn with_prefer_existing_sort(mut self) -> Self { +self.config.optimizer.prefer_existing_sort = true; +self +} + +/// If preferred, will not attempt to convert Union to Interleave. +fn with_prefer_existing_union(mut self) -> Self { +self.config.optimizer.prefer_existing_union = true; +self +} + +/// If preferred, will repartition file scans. +/// Accepts a minimum file size to repartition. +fn with_prefer_repartition_file_scans(mut self, file_min_size: usize) -> Self { +self.config.optimizer.repartition_file_scans = true; +self.config.optimizer.repartition_file_min_size = file_min_size; +self +} + +/// Set the preferred target partitions for query execution concurrency. +fn with_query_execution_partitions(mut self, target_partitions: usize) -> Self { +self.config.execution.target_partitions = target_partitions; +self +} +} + /// Runs the repartition optimizer and asserts the plan against the expected /// Arguments /// * `EXPECTED_LINES` - Expected output plan /// * `PLAN` - Input plan -/// * `FIRST_ENFORCE_DIST` - -/// true: (EnforceDistribution, EnforceDistribution, EnforceSorting) -/// false: else runs (EnforceSorting, EnforceDistribution, EnforceDistribution) -/// * `PREFER_EXISTING_SORT` (optional) - if true, will not repartition / resort data if it is already sorted -/// * `TARGET_PARTITIONS` (optional) - number of partitions to repartition to -/// * `REPARTITION_FILE_SCANS` (optional) - if true, will repartition file scans -/// * `REPARTITION_FILE_MIN_SIZE` (optional) - minimum file size to repartition -/// * `PREFER_EXISTING_UNION` (optional) - if true, will not attempt to convert Union to Interleave +/// * `CONFIG` - [`TestConfig`] macro_rules! assert_optimized { -($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr) => { -assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, false, 10, false, 1024, false); -}; - -($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr, $PREFER_EXISTING_SORT: expr) => { -assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, $PREFER_EXISTING_SORT, 10, false, 1024, false); -}; - -($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr, $PREFER_EXISTING_SORT: expr, $PREFER_EXISTING_UNION: expr) => { -assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, $PREFER_EXISTING_SORT, 10, false, 1024, $PREFER_EXISTING_UNION); -}; - -($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr, $PREFER_EXISTING_SORT: expr, $TARGET_PARTITIONS: expr, $REPARTITION_FILE_SCANS: expr, $REPARTITION_FILE_MIN_SIZE: expr) => { -assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, $PREFER_EXISTING_SORT, $TARGET_PARTITIONS, $REPARTITION_FILE_SCANS, $REPARTITION_FILE_MIN_SIZE, false); +($EXPECTED_LINES: expr, $PLAN: expr) => { +assert_optimized!($EXPECTED_LINES, $PLAN, &TestConfig::new(DoFirst::Distribution)); Review Comment: Note: this option is technically not used, so I could remove it. It's basically a 1-to-1 replacement of the "default options" provided for this macro on main. -- This is an automated message from the Apache Git Service. To respond to
[I] Support datatype cast for insert api same as insert into sql [datafusion]
zhuqi-lucas opened a new issue, #15015: URL: https://github.com/apache/datafusion/issues/15015 ### Is your feature request related to a problem or challenge? Now insert into sql we will automatically case the datatype to consistent with the table datatype, we'd better to support datatype cast for insert api also. So we will make the insert into behaviour consistent. ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Project Ideas for GSoC 2025 (Google Summer of Code) [datafusion]
oznur-synnada commented on issue #14478: URL: https://github.com/apache/datafusion/issues/14478#issuecomment-2700032123 Hi @mkarbo and @waynexia - we have been approved as a mentoring organization for GSoC 2025. I'm going to invite you to the GSoC portal as mentors so could you share your email addresses with me? You can reach me at oznur.ha...@synnada.ai -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] BUG: schema_force_view_type configuration not working for CREATE EXTERNAL TABLE [datafusion]
2010YOUY01 commented on PR #14922: URL: https://github.com/apache/datafusion/pull/14922#issuecomment-2700038867 > Thank you @2010YOUY01 for review, addressed comments in latsest PR. LGTM, thank you. I don't know this code well so let's wait for others to approve it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]
Garamda commented on code in PR #13511: URL: https://github.com/apache/datafusion/pull/13511#discussion_r1980817133 ## datafusion/sql/src/expr/function.rs: ## @@ -349,15 +365,49 @@ impl SqlToRel<'_, S> { } else { // User defined aggregate functions (UDAF) have precedence in case it has the same name as a scalar built-in function if let Some(fm) = self.context_provider.get_aggregate_meta(&name) { -let order_by = self.order_by_to_sort_expr( -order_by, -schema, -planner_context, -true, -None, -)?; -let order_by = (!order_by.is_empty()).then_some(order_by); -let args = self.function_args_to_expr(args, schema, planner_context)?; +if fm.is_ordered_set_aggregate() && within_group.is_empty() { +return plan_err!("WITHIN GROUP clause is required when calling ordered set aggregate function({})", fm.name()); +} + +if null_treatment.is_some() && !fm.supports_null_handling_clause() { +return plan_err!( +"[IGNORE | RESPECT] NULLS are not permitted for {}", +fm.name() +); +} Review Comment: > Thanks for bearing with me as a reviewed this, it was a good chance for me to look at new parts of DataFusion. I appreciate your elaborate review again. 👍 This PR has become much simpler, clearer, and better now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support WITHIN GROUP syntax to standardize certain existing aggregate functions [datafusion]
Garamda commented on code in PR #13511: URL: https://github.com/apache/datafusion/pull/13511#discussion_r1980829367 ## datafusion/sql/src/expr/function.rs: ## @@ -349,15 +365,49 @@ impl SqlToRel<'_, S> { } else { // User defined aggregate functions (UDAF) have precedence in case it has the same name as a scalar built-in function if let Some(fm) = self.context_provider.get_aggregate_meta(&name) { -let order_by = self.order_by_to_sort_expr( -order_by, -schema, -planner_context, -true, -None, -)?; -let order_by = (!order_by.is_empty()).then_some(order_by); -let args = self.function_args_to_expr(args, schema, planner_context)?; +if fm.is_ordered_set_aggregate() && within_group.is_empty() { +return plan_err!("WITHIN GROUP clause is required when calling ordered set aggregate function({})", fm.name()); +} + +if null_treatment.is_some() && !fm.supports_null_handling_clause() { +return plan_err!( +"[IGNORE | RESPECT] NULLS are not permitted for {}", +fm.name() +); +} Review Comment: cf) I have applied all reviews that you tagged 'minor', since I was also convinced. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: instrument spawned tasks with current tracing span when `tracing` feature is enabled [datafusion]
geoffreyclaude commented on PR #14547: URL: https://github.com/apache/datafusion/pull/14547#issuecomment-2697021679 @alamb: I've gone ahead and refactored to allow "injecting" the tracing behavior at runtime. As predicted, the code is a bit scary looking, especially due to the Box/Unbox dance needed to get static Sized closures. But it works (see `datafusion-examples/examples/tracing.rs`) without needing the `tracing` crate (optionally or not) inside `datafusion-common-runtime`! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix doc logo [datafusion]
khushishukla2813 commented on code in PR #14989: URL: https://github.com/apache/datafusion/pull/14989#discussion_r1979134788 ## datafusion/macros/src/macros.rs: ## @@ -0,0 +1,10 @@ +#[macro_export] Review Comment: > Maybe add a `proc-macros` crate, move the proc macros there and re-export them from the current macros crate? well ...I removed proc-macro = true so macro_rules! works from a normal library crate. Let me know if needs any other changes!" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Bug: calling "with_new_exprs" on join after optimization unexpectedly fails [datafusion]
niebayes opened a new issue, #14999: URL: https://github.com/apache/datafusion/issues/14999 ### Describe the bug Before optimization, specifically the `ExtractEquijoinPredicate` rule, calling `with_new_exprs` succeeds. However, after optimized by `ExtractEquijoinPredicate`, calling `with_new_exprs` unexpectedly fails with the following error: ``` "The front part expressions should be an binary equality expression, actual:t1.a" ``` ### To Reproduce ``` rust #[tokio::test] async fn test_join_with_new_exprs() -> Result<()> { // Creates a default session context. let session_context = SessionContext::new(); // Registers two tables. session_context.register_table( "t1", Arc::new(EmptyTable::new( Schema::new(vec![Field::new("a", DataType::Int32, true)]).into(), )), )?; session_context.register_table( "t2", Arc::new(EmptyTable::new( Schema::new(vec![Field::new("a", DataType::Int32, true)]).into(), )), )?; // Creates a plan consisting of a join operator and extracts the join operator. let join = session_context .sql("select * from t1 join t2 on t1.a = t2.a") .await? .logical_plan() .inputs() .remove(0) .clone(); assert!(matches!(join, LogicalPlan::Join(_))); let LogicalPlan::Join(before) = &join else { unreachable!() }; // Should be: "[]". println!("on before opt: {:?}", before.on); // Should be: "Some(BinaryExpr(BinaryExpr { left: Column(Column { relation: Some(Bare { table: "t1" }), name: "a" }), op: Eq, // right: Column(Column { relation: Some(Bare { table: "t2" }), name: "a" }) }))". println!("filter before opt: {:?}", before.filter); // Invokes `with_new_exprs` before optimization should succeed. let res = join.with_new_exprs( join.expressions(), join.inputs().into_iter().map(|x| x.clone()).collect(), ); assert!(res.is_ok()); print!("\n\n"); // Optimizes join. let join = session_context.state().optimize(&join)?; let LogicalPlan::Join(after) = &join else { unreachable!() }; // Should be: "[(Column(Column { relation: Some(Bare { table: "t1" }), name: "a" }), Column(Column { relation: Some(Bare { table: "t2" }), name: "a" }))]". println!("on after opt: {:?}", after.on); // Should be: "None". println!("filter after opt: {:?}", after.filter); // Invokes `with_new_exprs` after optimization unexpectedly fails. let res = join.with_new_exprs( join.expressions(), join.inputs().into_iter().map(|x| x.clone()).collect(), ); assert!(res.is_err()); assert_contains!( res.unwrap_err().to_string(), "The front part expressions should be an binary equality expression, actual:t1.a" ); Ok(()) } ``` ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Bug: Fix multi-lines printing issue for datafusion-cli [datafusion]
zhuqi-lucas commented on PR #14954: URL: https://github.com/apache/datafusion/pull/14954#issuecomment-2696829174 I am continuing polish the code besides the https://github.com/apache/datafusion/issues/14886 which will add the streaming state struct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Split out avro, parquet, json and csv into individual crates [datafusion]
AdamGS commented on code in PR #14951: URL: https://github.com/apache/datafusion/pull/14951#discussion_r1979201859 ## datafusion/core/src/datasource/file_format/avro.rs: ## @@ -15,163 +15,31 @@ // specific language governing permissions and limitations // under the License. -//! [`AvroFormat`] Apache Avro [`FileFormat`] abstractions - -use std::any::Any; -use std::collections::HashMap; -use std::fmt; -use std::sync::Arc; - -use super::file_compression_type::FileCompressionType; -use super::FileFormat; -use super::FileFormatFactory; -use crate::datasource::avro_to_arrow::read_avro_schema_from_reader; -use crate::datasource::physical_plan::AvroSource; -use crate::error::Result; -use crate::physical_plan::ExecutionPlan; -use crate::physical_plan::Statistics; - -use arrow::datatypes::Schema; -use arrow::datatypes::SchemaRef; -use async_trait::async_trait; -use datafusion_catalog::Session; -use datafusion_common::internal_err; -use datafusion_common::parsers::CompressionTypeVariant; -use datafusion_common::GetExt; -use datafusion_common::DEFAULT_AVRO_EXTENSION; -use datafusion_datasource::file::FileSource; -use datafusion_datasource::file_scan_config::FileScanConfig; -use datafusion_physical_expr::PhysicalExpr; -use object_store::{GetResultPayload, ObjectMeta, ObjectStore}; - -#[derive(Default)] -/// Factory struct used to create [AvroFormat] -pub struct AvroFormatFactory; - -impl AvroFormatFactory { -/// Creates an instance of [AvroFormatFactory] -pub fn new() -> Self { -Self {} -} -} - -impl FileFormatFactory for AvroFormatFactory { -fn create( -&self, -_state: &dyn Session, -_format_options: &HashMap, -) -> Result> { -Ok(Arc::new(AvroFormat)) -} +//! Re-exports the [`datafusion_datasource_avro::file_format`] module, and contains tests for it. -fn default(&self) -> Arc { -Arc::new(AvroFormat) -} - -fn as_any(&self) -> &dyn Any { -self -} -} - -impl fmt::Debug for AvroFormatFactory { -fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { -f.debug_struct("AvroFormatFactory").finish() -} -} - -impl GetExt for AvroFormatFactory { -fn get_ext(&self) -> String { -// Removes the dot, i.e. ".parquet" -> "parquet" -DEFAULT_AVRO_EXTENSION[1..].to_string() -} -} - -/// Avro `FileFormat` implementation. -#[derive(Default, Debug)] -pub struct AvroFormat; - -#[async_trait] -impl FileFormat for AvroFormat { -fn as_any(&self) -> &dyn Any { -self -} - -fn get_ext(&self) -> String { -AvroFormatFactory::new().get_ext() -} - -fn get_ext_with_compression( -&self, -file_compression_type: &FileCompressionType, -) -> Result { -let ext = self.get_ext(); -match file_compression_type.get_variant() { -CompressionTypeVariant::UNCOMPRESSED => Ok(ext), -_ => internal_err!("Avro FileFormat does not support compression."), -} -} - -async fn infer_schema( -&self, -_state: &dyn Session, -store: &Arc, -objects: &[ObjectMeta], -) -> Result { -let mut schemas = vec![]; -for object in objects { -let r = store.as_ref().get(&object.location).await?; -let schema = match r.payload { -GetResultPayload::File(mut file, _) => { -read_avro_schema_from_reader(&mut file)? -} -GetResultPayload::Stream(_) => { -// TODO: Fetching entire file to get schema is potentially wasteful -let data = r.bytes().await?; -read_avro_schema_from_reader(&mut data.as_ref())? -} -}; -schemas.push(schema); -} -let merged_schema = Schema::try_merge(schemas)?; -Ok(Arc::new(merged_schema)) -} - -async fn infer_stats( -&self, -_state: &dyn Session, -_store: &Arc, -table_schema: SchemaRef, -_object: &ObjectMeta, -) -> Result { -Ok(Statistics::new_unknown(&table_schema)) -} - -async fn create_physical_plan( -&self, -_state: &dyn Session, -conf: FileScanConfig, -_filters: Option<&Arc>, -) -> Result> { -Ok(conf.with_source(self.file_source()).build()) -} - -fn file_source(&self) -> Arc { -Arc::new(AvroSource::new()) -} -} +pub use datafusion_datasource_avro::file_format::*; #[cfg(test)] -#[cfg(feature = "avro")] mod tests { Review Comment: Oh I didn't realize that was a thing. Glad to move everything there (in this PR or a different one). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please c
Re: [I] Allow sorting to improve `FixedSizeBinary` filtering [datafusion]
samuelcolvin closed issue #11170: Allow sorting to improve `FixedSizeBinary` filtering URL: https://github.com/apache/datafusion/issues/11170 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Allow sorting to improve `FixedSizeBinary` filtering [datafusion]
samuelcolvin commented on issue #11170: URL: https://github.com/apache/datafusion/issues/11170#issuecomment-2696717653 Closed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Split out avro, parquet, json and csv into individual crates [datafusion]
alamb commented on PR #14951: URL: https://github.com/apache/datafusion/pull/14951#issuecomment-2697208596 Thanks again @AdamGS and @logan-keede -- this is amazing progress. Something I have thought was important for the last t -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Split out avro, parquet, json and csv into individual crates [datafusion]
alamb merged PR #14951: URL: https://github.com/apache/datafusion/pull/14951 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Split out avro, parquet, json and csv into individual crates [datafusion]
alamb commented on PR #14951: URL: https://github.com/apache/datafusion/pull/14951#issuecomment-2697209311 I am merging this one in so it doesn't accumulate conflicts -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org