Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-04-29 Thread via GitHub


EmeraldShift commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2840412850

   > They are currently exploring the possibility of using it alongside 
projections (a feature in ClickHouse akin to materialized views) to create 
secondary indexes and similar functionalities. Relevant PR: 
https://github.com/ClickHouse/ClickHouse/pull/78429#issuecomment-2777130157.
   
   There's a follow-up PR that makes the query in that comment work 
consistently: https://github.com/ClickHouse/ClickHouse/pull/79471
   
   Also, I had asked for something that sounds equivalent to dynamic filter in 
TopK in ClickHouse in https://github.com/ClickHouse/ClickHouse/issues/75098 and 
https://github.com/ClickHouse/ClickHouse/issues/75774#issuecomment-2656469210, 
so it seems like CH has yet to perform that optimization at the moment, it just 
has lazy materialization and soon the manual version of it via query rewriting 
and "projection indexes".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-04-23 Thread via GitHub


acking-you commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2824964842

   > Relevant: 
https://clickhouse.com/blog/clickhouse-gets-lazier-and-faster-introducing-lazy-materialization
   
   Thank you so much for sharing this blog link—it’s truly an excellent 
learning resource! I hadn’t come across it before; I was just looking at the 
source code of this related optimization 
[PR](https://github.com/ClickHouse/ClickHouse/pull/55518) and running 
performance tests on the hits dataset. They introduced a new type called 
[ColumnLazy](https://github.com/wudidapaopao/ClickHouse/blob/0dc1c3fdba7c1a1fab932e87d6da06c17d49427e/src/Columns/ColumnLazy.h#L14-L31)
 to implement lazy materialization of columns. However, in my testing, I’ve 
noticed that performance starts to degrade when the `limit` value becomes 
large. For instance, degradation begins at `limit 10` when the page cache 
is cleared. In fully cached scenarios, performance degradation occurs as early 
as `limit 256`. This is why they’ve set 
[query_plan_max_limit_for_lazy_materialization=10](https://github.com/wudidapaopao/ClickHouse/blob/0dc1c3fdba7c1a1fab932e87d6da06c17d49427e/src/Core/SettingsChangesHistory.c
 pp#L72) as the default value to determine whether to apply the optimization.
   
   However, I noticed that using the following SQL rewrite approach hardly 
leads to any performance degradation. I even tested it with `limit 40` 
while clearing the page cache, and it still maintained excellent performance. 
Below are the performance test results for a limit of 100,000:
   ```sql
   -- Q1:
   SELECT * from hits ORDER BY "EventTime" LIMIT 10;
   10 rows in set. Elapsed: 70.314 sec. Processed 103.89 million rows, 
70.28 GB (1.48 million rows/s., 999.50 MB/s.)
   Peak memory usage: 27.66 GiB.
   -- Q2:
   SELECT * FROM hits WHERE (_part,_part_offset) in 
   (SELECT _part,_part_offset from hits ORDER BY "EventTime" LIMIT 10);
   10 rows in set. Elapsed: 5.639 sec. Processed 82.02 million rows, 3.39 
GB (14.55 million rows/s., 601.81 MB/s.)
   Peak memory usage: 715.57 MiB.
   ```
   The direct SQL rewrite approach mentioned above might still have some issues 
in ClickHouse at the moment. They are currently exploring the possibility of 
using it alongside projections (a feature in ClickHouse akin to materialized 
views) to create secondary indexes and similar functionalities. Relevant PR: 
https://github.com/ClickHouse/ClickHouse/pull/78429#issuecomment-2777130157.
   
   I find this truly remarkable, and I’m also investigating why the direct SQL 
rewrite performs faster than ColumnLazy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-04-23 Thread via GitHub


acking-you commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2824790924

   > I tried the rewrite into a Semi join and indeed it is over 2x slower 
(5.3sec vs 12sec)
   > 
   > > SELECT * from 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY 
"EventTime" LIMIT 10;
   > Elapsed 5.320 seconds.
   > Here is what I think the rewrite is
   > 
   > > SELECT * from 'hits_partitioned' WHERE "WatchID" IN (
   >   SELECT "WatchID" FROM 'hits_partitioned' WHERE "URL" LIKE '%google%' 
ORDER BY "EventTime" LIMIT 10
   > );
   > 
   > Elapsed 12.023 seconds.
   > WatchID is a unique key
   > 
   > > select count(distinct "WatchID"), count(*) from 'hits_partitioned';
   > +--+--+
   > | count(DISTINCT hits_partitioned.WatchID) | count(*) |
   > +--+--+
   > | 7493 | 7497 |
   > +--+--+
   > I also double checked the output
   > 
   > ## orig
   > datafusion-cli -c "SELECT * FROM 'hits_partitioned' WHERE \"URL\" LIKE 
'%google%' ORDER BY \"EventTime\" LIMIT 10;" > orig.out
   > 
   > ## rewrite
   >  datafusion-cli -c "SELECT * from 'hits_partitioned' WHERE \"WatchID\" IN 
(SELECT \"WatchID\" FROM 'hits_partitioned' WHERE \"URL\" LIKE '%google%' ORDER 
BY \"EventTime\" LIMIT 10);" > rewrite.out
   > 
   > ## check
   > sort orig.out > orig.out.sort
   > sort rewrite.out > rewrite.out.sort
   > diff orig.out.sort rewrite.out.sort
   > 
   > 7c7
   > < Elapsed 5.649 seconds.
   > ---
   > > Elapsed 11.067 seconds.
   
   I recently took a detailed look at this optimization in ClickHouse, and it 
might offer you some insights @alamb .
   
   ## rewrite SQL in ClickHouse
   First, in ClickHouse, each row of data can be located using the two virtual 
columns `_part` and `_part_offset` (this applies only to MergeTree tables).  
   Specifically, for the optimization you mentioned, you can compare the 
performance of the following two queries in ClickHouse:
   ```sql
   -- Q1:
   SELECT * from hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10;
   10 rows in set. Elapsed: 34.907 sec. Processed 18.63 million rows, 11.77 GB 
(533.61 thousand rows/s., 337.25 MB/s.)
   Peak memory usage: 1.31 GiB.
   -- Q2:
   SELECT * FROM hits WHERE (_part,_part_offset) in 
   (SELECT _part,_part_offset from hits WHERE "URL" LIKE '%google%' ORDER BY 
"EventTime" LIMIT 10);
   10 rows in set. Elapsed: 0.334 sec. Processed 18.68 million rows, 3.13 GB 
(55.88 million rows/s., 9.37 GB/s.)
   Peak memory usage: 236.42 MiB.
   ```
   I measured that Q1 took 34 seconds, while Q2 only took 7.2 seconds (both 
cleared the page cache before running).  
   This performance improvement may be attributed to ClickHouse correctly 
utilizing the result of the subquery to filter parts and quickly pinpoint the 
relevant data.
   
   However, in earlier versions of ClickHouse (such as 23.12), the 
aforementioned query Q2 would actually degrade in performance. But at that 
time, if I split Q2 into two separate statements and executed them manually, it 
would still work perfectly fine.
   ```sql
   -- Initial (Clear page cache)
   -- step 1
   SELECT _part,_part_offset  FROM hits WHERE "URL" LIKE '%google%' ORDER BY 
"EventTime" LIMIT 10;
   10 rows in set. Elapsed: 6.254 sec. Processed 18.63 million rows, 3.10 GB 
(2.98 million rows/s., 495.76 MB/s.)
   Peak memory usage: 190.79 MiB.
   -- step 2
   SELECT * FROM hits WHERE
   (_part = 'all_1_210_3' AND _part_offset IN (20223140, 20223142, 
20223144, 1972, 15188338, 13322137, 19741966, 3076201))
   OR
   (_part = 'all_211_216_1' AND _part_offset IN (692957, 692958));
   10 rows in set. Elapsed: 0.731 sec. Processed 65.04 thousand rows, 36.83 MB 
(89.02 thousand rows/s., 50.41 MB/s.)
   Peak memory usage: 46.23 MiB.
   
   -- The total time is 6.98 seconds.
   ```
   
   ## Conclusion
   - Accessing row_id might require some optimization techniques, such as 
merging row_id conditions to quickly skip through RowGroups
   - After implementing this optimization, the performance improvement in 
ClickBench may not be significant. I noticed that when columns are fully loaded 
into the page cache, the performance gain from lazy materialization is not very 
noticeable. Additionally, ClickBench only clears the page cache before 
executing the first SQL query
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@d

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-04-05 Thread via GitHub


Dandandan commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740754487

   Ah actually, the query given by @xudong963 is slightly off, I think it 
should be the following (without the explicit join):
   
   ```
   > EXPLAIN (WITH ids AS (SELECT row_id, a FROM t ORDER BY a LIMIT 10)
   SELECT t.* FROM t WHERE t.row_id IN (SELECT row_id FROM ids));
   
+---+--+
   | plan_type | plan   
  |
   
+---+--+
   | logical_plan  | LeftSemi Join: t.row_id = __correlated_sq_1.row_id 
  |
   |   |   TableScan: t projection=[a, b, row_id]   
  |
   |   |   SubqueryAlias: __correlated_sq_1 
  |
   |   | SubqueryAlias: ids 
  |
   |   |   Projection: t.row_id 
  |
   |   | Sort: t.a ASC NULLS LAST, fetch=10 
  |
   |   |   Projection: t.row_id, t.a
  |
   |   | TableScan: t projection=[a, row_id]
  |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192
  |
   |   |   HashJoinExec: mode=Partitioned, join_type=LeftSemi, 
on=[(row_id@2, row_id@0)]  |
   |   | DataSourceExec: partitions=1, partition_sizes=[0]  
  |
   |   | ProjectionExec: expr=[row_id@0 as row_id]  
  |
   |   |   SortExec: TopK(fetch=10), expr=[a@1 ASC NULLS LAST], 
preserve_partitioning=[false] |
   |   | DataSourceExec: partitions=1, partition_sizes=[0]  
  |
   |   |
  |
   
+---+--+
   2 row(s) fetched. 
   Elapsed 0.004 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-04-04 Thread via GitHub


adriangb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740806655

   I think the difference is that DuckDB _dynamically_ pushes down the current 
state of the TopK heap into file opening as described in #15037 and implemented 
in #15301


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-22 Thread via GitHub


alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740983293

   BTW combined with @adriangb's PR here
   - https://github.com/apache/datafusion/pull/15301
   
   It will likely go crazy fast 🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub


Dandandan commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2741469672

   I traced this down to an issue in the planner, which uses 
`PartitionMode::Auto` iff stats are collected 
(`datafusion.execution.collect_statistics`)
   We can however still use file and plan-derived statistics, so let's change 
that.
   https://github.com/apache/datafusion/pull/15339


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub


alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740980237

   > Thanks for checking [@alamb](https://github.com/alamb) !
   > 
   > I think a large portion is spent in the hash join (repartitioning the 
right side input) - I think because it runs as `Partitioned` hash join, instead 
of realizing it could use `CollectLeft` (see repartition_time=55.648133838s) 
for the small side (10 rows).
   
   I also think the `Url`, `WatchID` and `EventDate` columns will be decoded 
twice. To make that join go really fast we would need something like:
   -  https://github.com/apache/datafusion/issues/7955


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub


alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740900315

   I am not really sure where the time is going 🤔 
   output of explain analyze:  
[explain.txt](https://github.com/user-attachments/files/19370532/explain.txt)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub


alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740888007

   I tried the rewrite into a Semi join and indeed it is over 2x slower (5.3sec 
vs 12sec)
   
   ```sql
   > SELECT * from 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY 
"EventTime" LIMIT 10;
   Elapsed 5.320 seconds.
   ```
   
   Here is what I think the rewrite is
   ```sql
   > SELECT * from 'hits_partitioned' WHERE "WatchID" IN (
 SELECT "WatchID" FROM 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER 
BY "EventTime" LIMIT 10
   );
   
   Elapsed 12.023 seconds.
   ```
   
   WatchID is a unique key
   
   ```sql
   > select count(distinct "WatchID"), count(*) from 'hits_partitioned';
   +--+--+
   | count(DISTINCT hits_partitioned.WatchID) | count(*) |
   +--+--+
   | 7493 | 7497 |
   +--+--+
   ```
   
   I also double checked the output
   
   ```shell
   ## orig
   datafusion-cli -c "SELECT * FROM 'hits_partitioned' WHERE \"URL\" LIKE 
'%google%' ORDER BY \"EventTime\" LIMIT 10;" > orig.out
   
   ## rewrite
datafusion-cli -c "SELECT * from 'hits_partitioned' WHERE \"WatchID\" IN 
(SELECT \"WatchID\" FROM 'hits_partitioned' WHERE \"URL\" LIKE '%google%' ORDER 
BY \"EventTime\" LIMIT 10);" > rewrite.out
   
   ## check
   sort orig.out > orig.out.sort
   sort rewrite.out > rewrite.out.sort
   diff orig.out.sort rewrite.out.sort
   
   7c7
   < Elapsed 5.649 seconds.
   ---
   > Elapsed 11.067 seconds.
   
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub


Dandandan commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740936826

   Thanks for checking @alamb !
   
   I think a large portion is spent in the h join (repartitioning the right 
input) - I think because it runs as `Partitioned` hash join, instead of 
realizing it could use `CollectLeft` (see repartition_time=55.648133838s) for 
the small side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub


alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740855773

   > I did not fully get this part. DF has semi join support and some rewrites 
to utilize it in similar cases?
   
   > The query transformation in SQL as given by @xudong963 is optimized to a 
SEMI join + TopK, so I think it could be implemented as logical optimization 
rule (i.e. adding a filter with subquery on the ids).
   
   @Dandandan  -- 
   
   
   I think it would be interesting to try and rewrite q23 manually to that 
pattern and see how it goes fast
   
   I suspect (but have not measured), if we implemented this rewrite we would 
find it runs much more slowly than the existing code because  what would happen 
is that the entire input file (all columns) would be decoded and all but 10 
rows are thrown away
   
   To avoid this we need to push the join filters into the scan (and get 
predicate pushdown on by default)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub


Dandandan commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740739642

   > Note that late materialization (the join / semi join rewrite) needs join 
operator support that DataFusion doesn't yet have (we could add it but it will 
take non trivial effort)
   
   I did not fully get this part. DF has semi join support and some rewrites to 
utilize it in similar cases?
   The query transformation in SQL as given by @xudong963 is optimized to a 
SEMI join + TopK, so I think it could be implemented as logical optimization 
rule (i.e. adding a filter with subquery on the ids).
   
   ```
   > CREATE TABLE t (a int, b int, row_id int);
   0 row(s) fetched. 
   Elapsed 0.004 seconds.
   
   > EXPLAIN (WITH ids AS (SELECT row_id, a FROM t ORDER BY a LIMIT 10)
   SELECT t.* FROM t JOIN ids WHERE t.row_id IN (SELECT row_id FROM ids));
   
+---++
   | plan_type | plan   
|
   
+---++
   | logical_plan  | LeftSemi Join: t.row_id = __correlated_sq_1.row_id 
|
   |   |   Cross Join:  
|
   |   | TableScan: t projection=[a, b, row_id] 
|
   |   | SubqueryAlias: ids 
|
   |   |   Projection:  
|
   |   | Sort: t.a ASC NULLS LAST, fetch=10 
|
   |   |   TableScan: t projection=[a]  
|
   |   |   SubqueryAlias: __correlated_sq_1 
|
   |   | SubqueryAlias: ids 
|
   |   |   Projection: t.row_id 
|
   |   | Sort: t.a ASC NULLS LAST, fetch=10 
|
   |   |   Projection: t.row_id, t.a
|
   |   | TableScan: t projection=[a, row_id]
|
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192
|
   |   |   HashJoinExec: mode=Partitioned, join_type=LeftSemi, 
on=[(row_id@2, row_id@0)]|
   |   | CrossJoinExec  
|
   |   |   DataSourceExec: partitions=1, partition_sizes=[0]
|
   |   |   ProjectionExec: expr=[]  
|
   |   | SortExec: TopK(fetch=10), expr=[a@0 ASC NULLS 
LAST], preserve_partitioning=[false] |
   |   |   DataSourceExec: partitions=1, 
partition_sizes=[0]|
   |   | ProjectionExec: expr=[row_id@0 as row_id]  
|
   |   |   SortExec: TopK(fetch=10), expr=[a@1 ASC NULLS LAST], 
preserve_partitioning=[false]   |
   |   | DataSourceExec: partitions=1, partition_sizes=[0]  
|
   |   |
|
   
+---++
   2 row(s) fetched. 
   Elapsed 0.005 seconds.
   
   > 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-12 Thread via GitHub


alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2718103422

   > We can spilt the idea to the query:
   
   I agree -- this is what I meant by "late materialization" . Your example / 
explanation is much better than mine @xudong963  🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-12 Thread via GitHub


xudong963 commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2718074072

   There is a similar thought named `prewhere`: 
https://clickhouse.com/docs/sql-reference/statements/select/prewhere.
   
   Even though it aims to filter, the idea is similar, for example:
   
   Table `t` has 100 columns, one of them is `a`, for sql: `select * from t 
where a = 1`, it'll do the following steps:
   1. First, read only the data in column a
   2. Apply a = 1 filter to filter out matching rows.
   3. Read the remaining 99 columns only for those matching rows.
   
   
   Back to topk, `select * from t order by a limit 10`
   1. First, read only the data in column a
   2. Perform a sort to find the row_id of the top 10 rows.
   3. Row identifiers as determined by 2 and selectively read only the other 
columns of these 10 rows.
   
   We can spilt the idea to the query:
   ```sql
   WITH ids AS (SELECT row_id, a FROM t ORDER BY a LIMIT 10)
   SELECT t.* FROM t JOIN ids WHERE t.row_id IN (SELECT row_id FROM ids)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-12 Thread via GitHub


alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2717635254

   Note that late materialization (the join / semi join rewrite) needs join 
operator support that DataFusion doesn't yet have (we could add it but it will 
take non trivial effort)
   
   My suggested order of implementation is:
   1. https://github.com/apache/datafusion/issues/3463 with @XiangpengHao  (so 
that we can actually evaluate the topk filter during scan)
   2. Then implement topk filtering 
https://github.com/apache/datafusion/issues/15037
   
   I actually think that will likely get us quite fast. I am not sure how much 
more improvement late materialized joins will get without a specialized file 
format.   
   
   I don't have time to help plan out late materializing joins at the moment, 
but I am quite interested in pushing along the predicate pushdown


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-12 Thread via GitHub


robert3005 commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2717583565

   There's two optimizations here that go together, if you check clickbench 
results duckdb on their own format is significantly faster than parquet. The 
two optimizer rules that do this is 1) TopN 
https://github.com/duckdb/duckdb/blob/main/src/optimizer/topn_optimizer.cpp#L105
 2) Late materialization 
https://github.com/duckdb/duckdb/blob/main/src/optimizer/late_materialization.cpp#L180
 (join back the filter result to obtain rest of the columns)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-12 Thread via GitHub


alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2717477809

   BTW apparently DuckDB uses the "late materialization" technique with its own 
native format. Here is an explain courtesy of Joe Issacs and Robert Kruszewski
   
   ```
   ┌─┐
   │┌───┐│
   ││   Physical Plan   ││
   │└───┘│
   └─┘
   ┌───┐
   │ PROJECTION│
   │   │
   │ #0│
   │__internal_decompress_integ│
   │ral_smallint(#1, 0)│
   │ #2│
   │__internal_decompress_integ│
   │ral_smallint(#3, 1)│
   │ #4│
   │ #5│
   │ #6│
   │ #7│
   │ #8│
   │ #9│
   │__internal_decompress_integ│
   │ral_smallint(#10, 0)   │
   │__internal_decompress_integ│
   │ral_smallint(#11, 0)   │
   │__internal_decompress_integ│
   │ral_smallint(#12, 0)   │
   │#13│
   │#14│
   │...│
   │#90│
   │#91│
   │#92│
   │#93│
   │#94│
   │#95│
   │#96│
   │#97│
   │#98│
   │#99│
   │#100   │
   │__internal_decompress_integ│
   │   ral_smallint(#101, 0)   │
   │#102   │
   │#103   │
   │#104   │
   │   │
   │   ~7497 Rows  │
   └─┬─┘
   ┌─┴─┐
   │  ORDER_BY │
   │   │
   │memory.main.hits_tb│
   │   .EventTime ASC  │
   └─┬─┘
   ┌─┴─┐
   │ PROJECTION│
   │   │
   │ #0│
   │__internal_compress_integra│
   │ l_utinyint(#1, 0) │
   │ #2│
   │__internal_compress_integra│
   │ l_utinyint(#3, 1) │
   │ #4│
   │ #5│
   │ #6│
   │ #7│
   │ #8│
   │ #9│
   │__internal_compress_integra│
   │ l_utinyint(#10, 0)│
   │__internal_compress_integra│
   │ l_utinyint(#11, 0)│
   │__internal_compress_integra│
   │ l_utinyint(#12, 0)│
   │#13│
   │#14│
   │...│
   │#90│
   │#91│
   │#92│
   │#93│
   │#94│
   │#95│
   │#96│
   │#97│
   │#98│
   │#99│
   │#100   │
   │__internal_compress_integra│
   │l_utinyint(#101, 0)│
   │#102   │
   │#103   │
   │#104   │
   │   │
   │   ~7497 Rows  │
   └─┬─┘
   ┌─┴─┐
   │ PROJECTION│
   │   │
   │  WatchID  │
   │ JavaEnable│
   │   Title   │
   │ GoodEvent │
   │ EventTime │
   │ EventDate │
   │ CounterID │
   │  ClientIP │
   │  RegionID │
   │   UserID  │
   │CounterClass   │
   │ OS│
   │ UserAgent │
   │URL│
   │  Referer  │
   │...│
   │  ParamCurrencyID  │
   │OpenstatServiceName│
   │ OpenstatCampaignID│
   │OpenstatAdID   │
   │  OpenstatSourceID │
   │ UTMSource │
   │ UTMMedium │
   │UTMCampaign│
   │ UTMContent│
   │  UTMTerm  │
   │  FromTag  │
   │  HasGCLID │
   │RefererHash│
   │  URLHash  │
   │CLID   │
   │   │
   │   ~7497 Rows  │
   └─┬─┘
   ┌─┴─┐
   │ HASH_JOIN │
   │   │
   │  Join Type: SEMI  │
   │   │
   │Conditions:├──┐
   │   rowid = rowid   │  │
   │

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-12 Thread via GitHub


alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2717443917

    -- here is the duckdb plan and it shows what they are doing! 
   
   The key is this line:
   ```
   │  Filters: │
   │ optional: Dynamic Filter  │
   │(EventTime)│
   ```
   
   What I think this is referring to is what @adriangb is describing in :
   - https://github.com/apache/datafusion/issues/15037
   
   Specifically, the Top_N operator passes down a filter into the scan. The 
filter is "dynamic" in the sense that
   1.  the TOP_N operator knows what the smallest maximum value currently is
   2. That means the scan can filter rows where the current timestamp is less 
than that number
   
   
   
   ```
   ┌─┐
   │┌───┐│
   ││   Physical Plan   ││
   │└───┘│
   └─┘
   ┌───┐
   │   TOP_N   │
   │   │
   │  Top: 10  │
   │   │
   │ Order By: │
   │ memory.main.hits.EventTime│
   │ ASC   │
   └─┬─┘
   ┌─┴─┐
   │   FILTER  │
   │   │
   │  contains(URL, 'google')  │
   │   │
   │   ~2000 Rows  │
   └─┬─┘
   ┌─┴─┐
   │   PARQUET_SCAN│
   │   │
   │ Function: │
   │PARQUET_SCAN   │
   │   │
   │Projections:   │
   │  WatchID  │
   │ JavaEnable│
   │   Title   │
   │ GoodEvent │
   │ EventTime │
   │ EventDate │
   │ CounterID │
   │  ClientIP │
   │  RegionID │
   │   UserID  │
   │CounterClass   │
   │ OS│
   │ UserAgent │
   │URL│
   │...│
   │  ParamCurrencyID  │
   │OpenstatServiceName│
   │ OpenstatCampaignID│
   │OpenstatAdID   │
   │  OpenstatSourceID │
   │ UTMSource │
   │ UTMMedium │
   │UTMCampaign│
   │ UTMContent│
   │  UTMTerm  │
   │  FromTag  │
   │  HasGCLID │
   │RefererHash│
   │  URLHash  │
   │CLID   │
   │   │
   │  Filters: │
   │ optional: Dynamic Filter  │
   │(EventTime)│
   │   │
   │  ~1 Rows  │
   └───┘
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org