> In case a table has a few > million records, it all goes through the driver.
This sounds clear in JDBC mode, the driver get all the rows and then it spreads the RDD over the executors. I d'say that most use cases deal with SQL to aggregate huge datasets, and retrieve small amount of rows to be then transformed for ML tasks. Then using JDBC offers the robustness of HIVE to produce a small aggregated dataset into spark. While using SPARK SQL uses RDD to produce the small one from huge. Not very clear how SPARK SQL deal with huge HIVE table. Does it load everything into memory and crash, or does this never happend? --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org