JunWang222 opened a new issue, #761:
URL: https://github.com/apache/wayang/issues/761

   ## Problem
   
   `SqlToRddOperator` currently converts JDBC query results into Spark RDDs by:
   
   1. Executing the JDBC query
   2. Collecting the entire `ResultSet` into driver memory
   3. Calling `sc.parallelize(...)`
   
   Relevant code:
   
   see `SqlToRddOperator.java` around lines 90–93
   
   ## Why this is problematic
   
   This implementation is not scalable for large datasets because:
   
   * All JDBC results are materialized in the Spark driver JVM.
   * The driver can become a memory bottleneck.
   * Spark JDBC partitioning is not used.
   * Data transfer is centralized instead of distributed.
   
   For analytical engines such as Trino, BigQuery, presto, large query results 
are expected, so this execution shape can significantly limit scalability.
   
   ## Suggested direction
   Instead of collecting the full `ResultSet` into the driver, the operator 
could leverage Spark's JDBC reader APIs and partitioned reads where possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to