my-ship-it commented on issue #1234:
URL: https://github.com/apache/cloudberry/issues/1234#issuecomment-3116302842

   > > > Yes!
   > > > I want to believe we one time could reduce the number of rows in 
SubqueryScan. And use Forward and Backward filter pass )
   > > > That's what inspires me - [Debunking the Myth of Join Ordering: Toward 
Robust SQL Analytics](https://arxiv.org/pdf/2502.15181)
   > > > Here the DuckDB implementation - 
[duckdb/duckdb#17326](https://github.com/duckdb/duckdb/pull/17326)
   > > > What I cannot understand right now - how to use bloom filter in MPP 
environment? Is it enough to create local bloom filters?
   > > 
   > > 
   > > Hi [@leborchuk](https://github.com/leborchuk), thanks for the 
suggestions and for providing other reference implementations in the industry, 
which is helpful.
   > > In Cloudberry, for queries with Motion, a bloom filter can be enabled 
after Motion to reduce the amount of data in Hash Join and improve performance, 
for example:
   > > ```
   > > 
-----------------------------------------------------------------------------------------------------
   > >  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..1582.50 rows=4203 
width=8)
   > >    ->  Hash Join  (cost=0.00..1582.37 rows=1401 width=8)
   > >          Hash Cond: ((r.v + 1) = l.v)
   > >          ->  Redistribute Motion 3:3  (slice2; segments: 3)  
(cost=0.00..1148.63 rows=14933 width=4).  
   > >                Hash Key: (r.v + 1)
   > >                  Rows Removed by Pushdown Runtime Filter: 127
   > >                ->  Seq Scan on tbl2 r  (cost=0.00..1148.44 rows=14933 
width=4)
   > >          ->  Hash  (cost=431.01..431.01 rows=334 width=4)
   > >                ->  Seq Scan on tbl1 l  (cost=0.00..431.01 rows=334 
width=4)
   > >  Optimizer: GPORCA
   > > ```
   > > 
   > > 
   > >     
   > >       
   > >     
   > > 
   > >       
   > >     
   > > 
   > >     
   > >   
   > > However, the data motion volume remains relatively large. A better 
approach is to perform filtering before data is sent, like:
   > > ```
   > > 
-----------------------------------------------------------------------------------------------------
   > >  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..1582.50 rows=4203 
width=8)
   > >    ->  Hash Join  (cost=0.00..1582.37 rows=1401 width=8)
   > >          Hash Cond: ((r.v + 1) = l.v)
   > >          ->  Redistribute Motion 3:3  (slice2; segments: 3)  
(cost=0.00..1148.63 rows=14933 width=4).  
   > >                Hash Key: (r.v + 1)
   > >                ->  Seq Scan on tbl2 r  (cost=0.00..1148.44 rows=14933 
width=4)
   > >                          Rows Removed by Pushdown Runtime Filter: 127
   > >          ->  Hash  (cost=431.01..431.01 rows=334 width=4)
   > >                ->  Seq Scan on tbl1 l  (cost=0.00..431.01 rows=334 
width=4)
   > >  Optimizer: GPORCA
   > > ```
   > > 
   > > 
   > >     
   > >       
   > >     
   > > 
   > >       
   > >     
   > > 
   > >     
   > >   
   > > Motion in Cloudberry only supports unidirectional transmission of data 
from the sender to the receiver, and cannot send data from the receiver to the 
sender. Unless we can modify Motion to support sending the Bloom filter from 
the receiver to the sender?
   > 
   > For this issue, I believe we have a more elegant and broadly applicable 
solution that is not only suitable for sharing bloom filter information across 
nodes, but also applicable to globally sharing all runtime state information.
   > 
   > There is a paper we learn from: [Anser: Adaptive Information Sharing 
Framework of AnalyticDB](https://dl.acm.org/doi/10.14778/3611540.3611553).
   
   Thank Lirong for the valuable insights. The paper is worthy of further study 
by us, which provides inspiration on how to share information in distributed 
systems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to