[ https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andy Grove closed ARROW-7480. ----------------------------- Resolution: Fixed Fixed by https://github.com/apache/arrow/pull/6625 > [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns > don't match the selected columns > ------------------------------------------------------------------------------------------------------------ > > Key: ARROW-7480 > URL: https://issues.apache.org/jira/browse/ARROW-7480 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion > Reporter: Kyle McCarthy > Assignee: Andy Grove > Priority: Major > Fix For: 1.0.0 > > > There are two scenarios that cause problems but are related to the queries > with aggregate expressions and the SQL planner. The aggregate_test_100 > dataset is used for both of the queries. > At a high level, the issue is basically that queries containing aggregate > expressions may generate the wrong schema. > > *Scenario 1* > Columns are grouped by but not selected. > Query: > {code:java} > SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code} > Error: > {noformat} > ArrowError(InvalidArgumentError("number of columns must match number of > fields in schema")){noformat} > While the error is an ArrowError, it actually looks like it is because the > wrong schema is generated. In the src/sql/planner.rs file the impl for > SqlToRel is defined. In the sql_to_rel method, it checks if the query > contains aggregate expressions, and if it does it generates the schema from > the columns included in group expressions and aggregate expressions. > This in turn generates the following schema: > {code:java} > Schema { > fields: [ > Field { > name: "c1", > data_type: Utf8, > nullable: false, > }, > Field { > name: "c13", > data_type: Utf8, > nullable: false, > }, > Field { > name: "MIN", > data_type: Float64, > nullable: true, > }, > ], > metadata: {}, > }{code} > I am not super familiar with how DataFusion works under the hood, but I would > assume that this schema is actually correct for the Aggregate logical plan, > but isn't projecting the data correctly thus resulting in the wrong query > result schema? > > *Senario 2* > Columns are selected, but not grouped or part of an aggregate function. This > query actually will run, but the wrong schema is produced. > Query: > {code:java} > SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code} > Schema generated: > {code:java} > Schema { > fields: [ > Field { > name: "c0", > data_type: Utf8, > nullable: true, > }, > Field { > name: "c1", > data_type: Float64, > nullable: true, > }, > Field { > name: "c1", > data_type: Float64, > nullable: true, > }, > ], > metadata: {}, > } {code} > This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, > Float64). > > ---- > Schema 2 is questionable since some DBMS will run the query (ex MySQL) but > others (Postgres) will require that all the columns must be in the GROUP BY > to be used in an aggregate function. -- This message was sent by Atlassian Jira (v8.3.4#803005)