My take on this might sound a bit different. Here are few points to consider below:
1. Going through Hive JDBC means that the application is restricted by the # of queries that can be compiled. HS2 can only compile one SQL at a time and if users have bad SQL, it can take a long time just to compile (not map reduce). This will reduce the query throughput i.e. # of queries you can fire through the JDBC. 2. Going through Hive JDBC does have an advantage that HMS service is protected. The JIRA: https://issues.apache.org/jira/browse/HIVE-13884 does protect HMS from crashing - because at the end of the day retrieving metadata about a Hive table that may have millions or simply put 1000s of partitions hits jvm limit on the array size that it can hold for the metadata retrieved. JVM array size limit is hit and there is a crash on HMS. So in effect this is good to have to protect HMS & the relational database on its back end. Note: Hive community does propose to move the database to HBase that scales but I dont think this will get implemented sooner. 3. Going through the SparkContext, it directly interfaces with the Hive MetaStore. I have tried to put a sequence of code flow below. The bit I didnt have time to dive into is that I believe if the table is really large i.e. say partitions in the table are more than 32K (size of a short) then some sort of slicing does occur (I didnt have time to dive and get this piece of code but from experience this does seem to occur). Code flow: Spark uses Hive External catalog - goo.gl/7CZcDw HiveClient version of getPartitions is -> goo.gl/ZAEsqQ HiveClientImpl of getPartitions is: -> goo.gl/msPrr5 The Hive call is made at: -> goo.gl/TB4NFU ThriftHiveMetastore.java -> get_partitions_ps_with_auth -1 value is sent within Spark all the way throughout to Hive Metastore thrift. So in effect for large tables at a time 32K partitions are retrieved. This also has led to a few HMS crashes but I am yet to identify if this is really the cause. Based on the 3 points above, I would prefer to use SparkContext. If the cause of crash is indeed high # of partitions retrieval, then I may opt for the JDBC route. Thanks Kabeer. On Fri, 13 Oct 2017 09:22:37 +0200, Nicolas Paris wrote: >> In case a table has a few >> million records, it all goes through the driver. > > This sounds clear in JDBC mode, the driver get all the rows and then it > spreads the RDD over the executors. > > I d'say that most use cases deal with SQL to aggregate huge datasets, > and retrieve small amount of rows to be then transformed for ML tasks. > Then using JDBC offers the robustness of HIVE to produce a small aggregated > dataset into spark. While using SPARK SQL uses RDD to produce the small > one from huge. > > Not very clear how SPARK SQL deal with huge HIVE table. Does it load > everything into memory and crash, or does this never happend? > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Sent using Dekko from my Ubuntu device --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org