JunWang222 opened a new issue, #761: URL: https://github.com/apache/wayang/issues/761
## Problem `SqlToRddOperator` currently converts JDBC query results into Spark RDDs by: 1. Executing the JDBC query 2. Collecting the entire `ResultSet` into driver memory 3. Calling `sc.parallelize(...)` Relevant code: see `SqlToRddOperator.java` around lines 90–93 ## Why this is problematic This implementation is not scalable for large datasets because: * All JDBC results are materialized in the Spark driver JVM. * The driver can become a memory bottleneck. * Spark JDBC partitioning is not used. * Data transfer is centralized instead of distributed. For analytical engines such as Trino, BigQuery, presto, large query results are expected, so this execution shape can significantly limit scalability. ## Suggested direction Instead of collecting the full `ResultSet` into the driver, the operator could leverage Spark's JDBC reader APIs and partitioned reads where possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
