MisterRaindrop commented on PR #1571:
URL: https://github.com/apache/cloudberry/pull/1571#issuecomment-3949722894

   > > > Overall, FDW parallel scan is a direction worth exploring, but this 
approach is too rough. The core problems are:
   > > > 
   > > > 1. locus transition semantics for Gather in an MPP context haven't 
been thought through, and the changes are too broad.
   > > > 2. FDW is a black box from the database's perspective.
   > > >    For heap tables we have parallel scan (divide work by pages), for 
AO/AOCS we have parallel scan (divide work by files) — the work partitioning is 
well-defined.
   > > >    But for FDWs, the parallel behavior depends entirely on the FDW's 
own implementation. If an FDW (say file_fdw) sets parallel_safe = true 
following planner's parallel logic but doesn't actually implement the DSM 
parallel callbacks (EstimateDSMForeignScan, InitializeDSMForeignScan, 
InitializeWorkerForeignScan), then multiple workers will each scan the full 
dataset, producing duplicate rows.
   > > 
   > > 
   > > I'm not very familiar with Cloudberry. Still learning.
   > > FDW itself is a black box. Its specific implementation largely depends 
on how the user implements it. My understanding is that users need to take 
responsibility for their own implementations. Additionally, I should only 
enable gather for FDW. In other cases, it should remain false, this will 
parallel processing advantages of PostgreSQL?
   > > Additionally, I've looked into other aspects of FDW parallelism. 
Currently, it seems there is no optimal solution.
   > > So, should we aim to implement parallelism that is transparent to users? 
Or are there better approaches? Could you share some idea?
   > 
   > Neither PostgreSQL nor Cloudberry supports parallel FDW scans, that's a 
deliberate decision, not an oversight.
   > 
   > On the implementation side: having the kernel generate partial paths for 
FDW will cause FDWs that don't implement parallel scan callbacks to silently 
produce wrong results (e.g. duplicate rows). That's a kernel bug, not a user 
error — we can't shift that responsibility to FDW authors. And mixing Gather 
with CBDB-style parallelism remains fundamentally broken — the locus handling 
is wrong, and none of the issues I raised (joins, locus transitions, the overly 
broad execMain.c change) have been addressed.
   > 
   > More importantly, before discussing how, we need to answer why. What 
real-world problem does this solve in an MPP system where FDW is already used 
across segments? And given the risks I mentioned above — broken locus 
transitions, silent wrong results for existing FDWs, untested join/subquery 
interactions — even if it can be done, is it worth the complexity? If you want 
to push this forward, you need to make the case clearly: what's the motivation, 
and convince us that all the issues raised have sound solutions.
   
   Parallel FDW primarily addresses the issue of slow data loading. This 
functionality was already implemented in earlier versions of PostgreSQL. Now, I 
am attempting to integrate this feature into MPP systems. In simple tests, 
parallelization has indeed delivered a performance improvement of one to two 
times. Such gains are essential for performance-sensitive business scenarios. 
Therefore, I am working to introduce this functionality. Alternatively, we 
could discuss the implementation plan in the issue tracker.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to