Re: [PR] Remove inline table scan analyzer rule [datafusion]
alamb commented on PR #15201: URL: https://github.com/apache/datafusion/pull/15201#issuecomment-2740798896 > I will port the tests on our next iteration, but would appreciate a recommendation for a good suite to add them to as I rewrite them - they are currently only validating the `InlineTableScan` analyser, so would need to be ported to a suite that has access to `DataFrameTableProvider` and can do logical planning. Thanks! I think unit tests on InlineTableScan is likely not as helpful as something like building multiple nested levels of `Views` with DataFrame (or SQL or however you are creating your plans) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
alamb commented on PR #15201: URL: https://github.com/apache/datafusion/pull/15201#issuecomment-2738300722 @aditanase perhaps you can make a PR with whatever tests were breaking in your fork so we can make sure they work here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
aditanase commented on PR #15201: URL: https://github.com/apache/datafusion/pull/15201#issuecomment-2739305725 The fork was not compiling, we were on 46.0.0 but the latest changes made my rebase more difficult than usual, we'll figure it out. I will port the tests on our next iteration, but would appreciate a recommendation for a good suite to add them to as I rewrite them - they are currently only validating the `InlineTableScan` analyser, so would need to be ported to a suite that has access to `DataFrameTableProvider` and can do logical planning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
jayzhan211 commented on PR #15201: URL: https://github.com/apache/datafusion/pull/15201#issuecomment-2736677639 @aditanase It works too ``` statement count 0 create table t(a int) as values (1), (2), (3); statement count 0 create view v as select a, count(a) from t group by a; query II rowsort select * from v; 1 1 2 1 3 1 statement count 0 create view v2 as select "count(t.a)", a from v; query II select a, "count(t.a)" from v2; 1 1 3 1 2 1 statement ok create view v3 as select a, "count(t.a)" from v2; query II select * from v2; 1 1 1 2 1 3 query II select * from v3; 3 1 2 1 1 1 query TT explain select * from v3; logical_plan 01)SubqueryAlias: v3 02)--Projection: v2.a, v2.count(t.a) 03)SubqueryAlias: v2 04)--Projection: v.count(t.a), v.a 05)SubqueryAlias: v 06)--Aggregate: groupBy=[[t.a]], aggr=[[count(t.a)]] 07)TableScan: t projection=[a] physical_plan 01)AggregateExec: mode=FinalPartitioned, gby=[a@0 as a], aggr=[count(t.a)] 02)--CoalesceBatchesExec: target_batch_size=8192 03)RepartitionExec: partitioning=Hash([a@0], 4), input_partitions=4 04)--RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 05)AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[count(t.a)] 06)--DataSourceExec: partitions=1, partition_sizes=[1] query II rowsort select "count(t.a)", a from v; 1 1 1 2 1 3 statement count 0 drop view v; statement count 0 drop view v2; statement ok drop view v3; statement count 0 drop table t; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
aditanase commented on PR #15201: URL: https://github.com/apache/datafusion/pull/15201#issuecomment-273664 @jayzhan211 can you try with a 3rd view? 2 levels were also working on the original inlining code. Here you only need to inline v inside v2 and you get the final plan. if you had v3 -> v2 -> v -> t, then if you just inline v2, you will get a logical plan that will have v as a leaf node instead of t. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
jayzhan211 commented on PR #15201: URL: https://github.com/apache/datafusion/pull/15201#issuecomment-2736240081 ``` statement count 0 create table t(a int) as values (1), (2), (3); statement count 0 create view v as select a, count(a) from t group by a; query II rowsort select * from v; 1 1 2 1 3 1 statement count 0 create view v2 as select "count(t.a)", a from v; query II select * from v2; 1 3 1 1 1 2 query TT explain select * from v2; logical_plan 01)SubqueryAlias: v2 02)--Projection: v.count(t.a), v.a 03)SubqueryAlias: v 04)--Aggregate: groupBy=[[t.a]], aggr=[[count(t.a)]] 05)TableScan: t projection=[a] physical_plan 01)ProjectionExec: expr=[count(t.a)@1 as count(t.a), a@0 as a] 02)--AggregateExec: mode=FinalPartitioned, gby=[a@0 as a], aggr=[count(t.a)] 03)CoalesceBatchesExec: target_batch_size=8192 04)--RepartitionExec: partitioning=Hash([a@0], 4), input_partitions=4 05)RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 06)--AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[count(t.a)] 07)DataSourceExec: partitions=1, partition_sizes=[1] query II rowsort select "count(t.a)", a from v; 1 1 1 2 1 3 statement count 0 drop view v; statement count 0 drop view v2; statement count 0 drop table t; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
aditanase commented on PR #15201: URL: https://github.com/apache/datafusion/pull/15201#issuecomment-2736011205 Hi @jayzhan211 @alamb, I'm a bit concerned with this patch, I was just in the process of submitting an enhancement to `InlineTableScan` that was dealing with multiple levels of views registered with `into_view`. https://github.com/hstack/arrow-datafusion/commit/0f79c370373cae04c541c91862cb9bf0ab2c737f The proposal on the original implementation was that we should be transforming down instead of up, to allow the child plans to be expanded all the way down. The downside with not doing inlining all the way to the original scan is that the various optimizer stages don't have full visibility and will run in separate stages. For example from the 2nd level of nesting, logical planning will only be triggered in the physical planning of the 2nd `DataFrameTableProvider`. What are your thoughts around this? How are you dealing with a query plan that is built incrementally from 2-3 levels of using `into_view`? (resulting in chained `DataFrameTableProvider` sources along the way). I'm not seeing any tests handling this use case yet, and my first attempt of cherry-picking your commit breaks in other places for our fork. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
jayzhan211 merged PR #15201: URL: https://github.com/apache/datafusion/pull/15201 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
jayzhan211 commented on PR #15201: URL: https://github.com/apache/datafusion/pull/15201#issuecomment-2734924833 Thanks @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
jayzhan211 commented on code in PR #15201:
URL: https://github.com/apache/datafusion/pull/15201#discussion_r2000904176
##
datafusion/core/tests/dataframe/mod.rs:
##
@@ -1571,14 +1571,18 @@ async fn with_column_join_same_columns() -> Result<()> {
assert_snapshot!(
df_with_column.logical_plan(),
-@r###"
+@r"
Projection: t1.c1, t2.c1, Boolean(true) AS new_column
Limit: skip=0, fetch=1
Sort: t1.c1 ASC NULLS FIRST
Inner Join: t1.c1 = t2.c1
-TableScan: t1
-TableScan: t2
-"###
+SubqueryAlias: t1
Review Comment:
I explain it in the above
https://github.com/apache/datafusion/pull/15201#discussion_r1996487546
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
alamb commented on code in PR #15201:
URL: https://github.com/apache/datafusion/pull/15201#discussion_r1995496219
##
datafusion/sqllogictest/test_files/ddl.slt:
##
@@ -855,3 +855,29 @@ DROP TABLE t1;
statement ok
DROP TABLE t2;
+
+statement count 0
+create table t(a int) as values (1), (2), (3);
+
+statement count 0
+create view v as select a, count(a) from t group by a;
+
+query II rowsort
+select * from v;
+
+1 1
+2 1
+3 1
+
+query II rowsort
+select "count(t.a)", a from v;
+
+1 1
+1 2
+1 3
+
+query error DataFusion error: Execution error: Table 'v' doesn't exist\.
Review Comment:
I think it should be `drop view v` rather than `drop table`
##
datafusion/expr/src/logical_plan/builder.rs:
##
@@ -465,9 +465,21 @@ impl LogicalPlanBuilder {
projection: Option>,
filters: Vec,
) -> Result {
-TableScan::try_new(table_name, table_source, projection, filters, None)
-.map(LogicalPlan::TableScan)
-.map(Self::new)
+let table_scan =
+TableScan::try_new(table_name, table_source, projection, filters,
None)?;
+
+// Inline TableScan
+if table_scan.filters.is_empty() {
+if let Some(p) = table_scan.source.get_logical_plan() {
+let sub_plan = p.into_owned();
+// Ensures that the reference to the inlined table remains the
+// same, meaning we don't have to change any of the parent
nodes
+// that reference this table.
+return Self::new(sub_plan).alias(table_scan.table_name);
+}
+}
+
+Ok(Self::new(LogicalPlan::TableScan(table_scan)))
Review Comment:
seems like the same logic should apply to scan_with_filters_fetch below
##
datafusion/core/tests/dataframe/mod.rs:
##
@@ -1563,8 +1563,12 @@ async fn with_column_join_same_columns() -> Result<()> {
\n Limit: skip=0, fetch=1\
\nSort: t1.c1 ASC NULLS FIRST\
\n Inner Join: t1.c1 = t2.c1\
-\nTableScan: t1\
-\nTableScan: t2",
+\nSubqueryAlias: t1\
Review Comment:
I am still worried about this change -- I don't undersrtand why the subquery
allias has not been removed or if that is a problem 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
jayzhan211 commented on code in PR #15201:
URL: https://github.com/apache/datafusion/pull/15201#discussion_r1996487546
##
datafusion/core/tests/dataframe/mod.rs:
##
@@ -1563,8 +1563,12 @@ async fn with_column_join_same_columns() -> Result<()> {
\n Limit: skip=0, fetch=1\
\nSort: t1.c1 ASC NULLS FIRST\
\n Inner Join: t1.c1 = t2.c1\
-\nTableScan: t1\
-\nTableScan: t2",
+\nSubqueryAlias: t1\
Review Comment:
You can check the `into_optimized_plan` below, we have the consistent result.
It includes analyzer and optimizer. That is why the main branch doesn't have
subquery before optimization, and we have it after `inline_tablescan` is called.
In this change, since inline table scan is called inside plan building, so
we got this subquery before optimization
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
alamb commented on code in PR #15201:
URL: https://github.com/apache/datafusion/pull/15201#discussion_r1996808102
##
datafusion/core/tests/dataframe/mod.rs:
##
@@ -1571,14 +1571,18 @@ async fn with_column_join_same_columns() -> Result<()> {
assert_snapshot!(
df_with_column.logical_plan(),
-@r###"
+@r"
Projection: t1.c1, t2.c1, Boolean(true) AS new_column
Limit: skip=0, fetch=1
Sort: t1.c1 ASC NULLS FIRST
Inner Join: t1.c1 = t2.c1
-TableScan: t1
-TableScan: t2
-"###
+SubqueryAlias: t1
Review Comment:
I am concerned about this change -- it seems like a regression to me.
Maybe it is ok, but I don't understand the difference/ why there are a bunch
of projections / SubqueryAlias nows
If these are fixed by physical plan time it is fine, but from here it looks
like something is wrong
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
jayzhan211 commented on code in PR #15201:
URL: https://github.com/apache/datafusion/pull/15201#discussion_r1996487546
##
datafusion/core/tests/dataframe/mod.rs:
##
@@ -1563,8 +1563,12 @@ async fn with_column_join_same_columns() -> Result<()> {
\n Limit: skip=0, fetch=1\
\nSort: t1.c1 ASC NULLS FIRST\
\n Inner Join: t1.c1 = t2.c1\
-\nTableScan: t1\
-\nTableScan: t2",
+\nSubqueryAlias: t1\
Review Comment:
You can check the `into_optimized_plan` below, we have the consistent result.
The process includes both an analyzer and an optimizer. This is why the main
branch does not contain a subquery before optimization, but one appears after
inline_tablescan is called.
In this change, since inline_tablescan is invoked during plan building, the
subquery is introduced before the optimization step
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
jayzhan211 commented on code in PR #15201:
URL: https://github.com/apache/datafusion/pull/15201#discussion_r1996463077
##
datafusion/core/tests/dataframe/mod.rs:
##
@@ -1563,8 +1563,12 @@ async fn with_column_join_same_columns() -> Result<()> {
\n Limit: skip=0, fetch=1\
\nSort: t1.c1 ASC NULLS FIRST\
\n Inner Join: t1.c1 = t2.c1\
-\nTableScan: t1\
-\nTableScan: t2",
+\nSubqueryAlias: t1\
Review Comment:
I thought the subquery alias is what shows we correctly inline tablescan,
since adding alias is the only thing we have done.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
alamb commented on PR #15201: URL: https://github.com/apache/datafusion/pull/15201#issuecomment-2725551994 I ran planning performance benchmarks and I would say they showed no discernible difference with this branch Details ``` group inline-table-mv main - --- logical_aggregate_with_join 1.00 1527.0±20.63µs? ?/sec 1.00 1522.8±16.03µs? ?/sec logical_select_all_from_1000 1.00141.6±0.75ms? ?/sec 1.00142.3±0.19ms? ?/sec logical_select_one_from_700 1.00 1201.5±27.95µs? ?/sec 1.01 1210.6±24.70µs? ?/sec logical_trivial_join_high_numbered_columns1.00 1167.5±28.21µs? ?/sec 1.02 1188.7±29.84µs? ?/sec logical_trivial_join_low_numbered_columns 1.00 1149.0±15.67µs? ?/sec 1.01 1165.3±15.72µs? ?/sec physical_intersection 1.05 1846.6±111.33µs? ?/sec1.00 1752.4±24.26µs? ?/sec physical_join_consider_sort 1.02 2.4±0.02ms? ?/sec 1.00 2.3±0.01ms? ?/sec physical_join_distinct1.03 1142.4±13.14µs? ?/sec 1.00 .5±12.97µs? ?/sec physical_many_self_joins 1.01 11.8±0.06ms? ?/sec 1.00 11.7±0.08ms? ?/sec physical_plan_clickbench_all 1.02 255.1±10.58ms? ?/sec 1.00251.0±1.41ms? ?/sec physical_plan_clickbench_q1 1.02 3.9±0.07ms? ?/sec 1.00 3.8±0.04ms? ?/sec physical_plan_clickbench_q10 1.02 4.8±0.06ms? ?/sec 1.00 4.7±0.07ms? ?/sec physical_plan_clickbench_q11 1.02 4.9±0.10ms? ?/sec 1.00 4.9±0.04ms? ?/sec physical_plan_clickbench_q12 1.03 5.1±0.13ms? ?/sec 1.00 5.0±0.05ms? ?/sec physical_plan_clickbench_q13 1.05 4.8±0.10ms? ?/sec 1.00 4.6±0.05ms? ?/sec physical_plan_clickbench_q14 1.05 5.1±0.13ms? ?/sec 1.00 4.9±0.09ms? ?/sec physical_plan_clickbench_q15 1.05 5.0±0.09ms? ?/sec 1.00 4.7±0.05ms? ?/sec physical_plan_clickbench_q16 1.01 4.6±0.08ms? ?/sec 1.00 4.6±0.06ms? ?/sec physical_plan_clickbench_q17 1.02 4.8±0.11ms? ?/sec 1.00 4.7±0.04ms? ?/sec physical_plan_clickbench_q18 1.01 4.2±0.07ms? ?/sec 1.00 4.2±0.05ms? ?/sec physical_plan_clickbench_q19 1.01 5.3±0.10ms? ?/sec 1.00 5.2±0.06ms? ?/sec physical_plan_clickbench_q2 1.02 4.1±0.10ms? ?/sec 1.00 4.1±0.07ms? ?/sec physical_plan_clickbench_q20 1.01 3.8±0.07ms? ?/sec 1.00 3.8±0.07ms? ?/sec physical_plan_clickbench_q21 1.05 4.3±0.20ms? ?/sec 1.00 4.1±0.04ms? ?/sec physical_plan_clickbench_q22 1.01 5.0±0.09ms? ?/sec 1.00 4.9±0.07ms? ?/sec physical_plan_clickbench_q23 1.02 5.4±0.10ms? ?/sec 1.00 5.3±0.03ms? ?/sec physical_plan_clickbench_q24 1.01 7.5±0.10ms? ?/sec 1.00 7.5±0.05ms? ?/sec physical_plan_clickbench_q25 1.01 4.3±0.08ms? ?/sec 1.00 4.3±0.05ms? ?/sec physical_plan_clickbench_q26 1.00 3.9±0.10ms? ?/sec 1.01 4.0±0.06ms? ?/sec physical_plan_clickbench_q27 1.00 4.3±0.08ms? ?/sec 1.01 4.3±0.03ms? ?/sec physical_plan_clickbench_q28 1.00 5.1±0.07ms? ?/sec 1.02 5.2±0.09ms? ?/sec physical_plan_clickbench_q29 1.00 6.1±0.14ms? ?/sec 1.01 6.2±0.05ms? ?/sec physical_plan_clickbench_q3 1.04 4.1±0.07ms? ?/sec 1.00 4.0±0.04ms? ?/sec physical_plan_clickbench_q30 1.00 18.3±0.25ms? ?/sec 1.01 18.4±0.10ms? ?/sec physical_plan_clickbench_q31 1.00 5.1±0.09ms? ?/sec 1.05 5.3±0.31ms? ?/sec physical_plan_clickbench_q32 1.00 5.1±0.09ms? ?/sec 1.03 5.2±0.09ms? ?/sec physical_plan_cli
Re: [PR] Remove inline table scan analyzer rule [datafusion]
jayzhan211 commented on code in PR #15201:
URL: https://github.com/apache/datafusion/pull/15201#discussion_r1994540603
##
datafusion/sql/src/select.rs:
##
@@ -550,7 +550,29 @@ impl SqlToRel<'_, S> {
0 => Ok(LogicalPlanBuilder::empty(true).build()?),
1 => {
let input = from.remove(0);
-self.plan_table_with_joins(input, planner_context)
+let table = self.plan_table_with_joins(input,
planner_context)?;
Review Comment:
both dataframe and sql works well now
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Remove inline table scan analyzer rule [datafusion]
alamb commented on code in PR #15201:
URL: https://github.com/apache/datafusion/pull/15201#discussion_r1993384623
##
datafusion/sql/src/select.rs:
##
@@ -550,7 +550,29 @@ impl SqlToRel<'_, S> {
0 => Ok(LogicalPlanBuilder::empty(true).build()?),
1 => {
let input = from.remove(0);
-self.plan_table_with_joins(input, planner_context)
+let table = self.plan_table_with_joins(input,
planner_context)?;
Review Comment:
This will have the effect of only inlining table scans for SQL
Perhaps we could move the code into LogicalPlanBuilder so it also applies to
dataframes?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
