avamingli opened a new pull request, #1284:
URL: https://github.com/apache/cloudberry/pull/1284
In CBDB, row estimation is determined by the relation's rows and cluster
segments.
However, when there is a parallel subquery scan path, each worker will
process fewer rows (divided by parallel_workers).
```sql
set enable_parallel = off;
explain SELECT e.name
FROM employees e
WHERE e.salary > (
SELECT AVG(salary)
FROM employees
WHERE department_id = e.department_id);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=163.42..307.76 rows=3767
width=218)
-> Hash Join (cost=163.42..257.54 rows=1256 width=218)
Hash Cond: (e.department_id = "Expr_SUBQUERY".csq_c0)
Join Filter: (e.salary > "Expr_SUBQUERY".csq_c1)
-> Seq Scan on employees e (cost=0.00..71.67 rows=3767 width=254)
-> Hash (cost=150.92..150.92 rows=1000 width=36)
-> Broadcast Motion 3:3 (slice2; segments: 3)
(cost=130.09..150.92 rows=1000 width=36)
-> Subquery Scan on "Expr_SUBQUERY"
(cost=130.09..137.59 rows=333 width=36)
-> Finalize HashAggregate (cost=130.09..134.26
rows=333 width=36)
Group Key: employees.department_id
-> Redistribute Motion 3:3 (slice3;
segments: 3) (cost=90.50..122.67 rows=990 width=36)
Hash Key: employees.department_id
-> Partial HashAggregate
(cost=90.50..102.87 rows=990 width=36)
Group Key:
employees.department_id
-> Seq Scan on employees
(cost=0.00..71.67 rows=3767 width=36)
Optimizer: Postgres query optimizer
(16 rows)
```
Subquery Scan on "Expr_SUBQUERY" (cost=130.09..137.59 **rows=333** width=36)
While, a parallel Subquery Scan has the same rows though cost is less than
that.
```sql
set enable_parallel = on;
set min_parallel_table_scan_size = 0;
explain SELECT e.name
FROM employees e
WHERE e.salary > (
SELECT AVG(salary)
FROM employees
WHERE department_id = e.department_id);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Gather Motion 6:1 (slice1; segments: 6) (cost=131.17..245.45 rows=3767
width=218)
-> Parallel Hash Join (cost=131.17..201.50 rows=628 width=218)
Hash Cond: (e.department_id = "Expr_SUBQUERY".csq_c0)
Join Filter: (e.salary > "Expr_SUBQUERY".csq_c1)
-> Parallel Seq Scan on employees e (cost=0.00..52.83 rows=1883
width=254)
-> Parallel Hash (cost=118.67..118.67 rows=1000 width=36)
-> Broadcast Workers Motion 6:6 (slice2; segments: 6)
(cost=99.92..118.67 rows=1000 width=36)
-> Subquery Scan on "Expr_SUBQUERY"
(cost=99.92..105.33 rows=333 width=36)
-> HashAggregate (cost=99.92..102.00 rows=167
width=36)
Group Key: employees.department_id
-> Redistribute Motion 6:6 (slice3;
segments: 6) (cost=0.00..90.50 rows=1883 width=36)
Hash Key: employees.department_id
Hash Module: 3
-> Parallel Seq Scan on employees
(cost=0.00..52.83 rows=1883 width=36)
Optimizer: Postgres query optimizer
(15 rows)
```
This commit fixes that issue.
The correction not only makes parallel subquery estimation more accurate,
but also enables the entire plan to be as parallel as possible, particularly
for subqueries in complex queries.
Authored-by: Zhang Mingli [email protected]
<!-- Thank you for your contribution to Apache Cloudberry (Incubating)! -->
Fixes #ISSUE_Number
### What does this PR do?
<!-- Brief overview of the changes, including any major features or fixes -->
### Type of Change
- [ ] Bug fix (non-breaking change)
- [ ] New feature (non-breaking change)
- [ ] Breaking change (fix or feature with breaking changes)
- [ ] Documentation update
### Breaking Changes
<!-- Remove if not applicable. If yes, explain impact and migration path -->
### Test Plan
<!-- How did you test these changes? -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Passed `make installcheck`
- [ ] Passed `make -C src/test installcheck-cbdb-parallel`
### Impact
<!-- Remove sections that don't apply -->
**Performance:**
<!-- Any performance implications? -->
**User-facing changes:**
<!-- Any changes visible to users? -->
**Dependencies:**
<!-- New dependencies or version changes? -->
### Checklist
- [ ] Followed [contribution
guide](https://cloudberry.apache.org/contribute/code)
- [ ] Added/updated documentation
- [ ] Reviewed code for security implications
- [ ] Requested review from [cloudberry
committers](https://github.com/orgs/apache/teams/cloudberry-committers)
### Additional Context
<!-- Any other information that would help reviewers? Remove if none -->
### CI Skip Instructions
<!--
To skip CI builds, add the appropriate CI skip identifier to your PR title.
The identifier must:
- Be in square brackets []
- Include the word "ci" and either "skip" or "no"
- Only use for documentation-only changes or when absolutely necessary
-->
---
<!-- Join our community:
- Mailing list:
[[email protected]](https://lists.apache.org/[email protected])
(subscribe: [email protected])
- Discussions: https://github.com/apache/cloudberry/discussions -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]