I guess it's really depends on your configuration.  The Hive metastore is providing just the metadata/schema data for your database, not actual data storage.  Hive is running on top of Hadoop. If you configure your Spark to run on the same Hadoop cluster using Yarn, your SQL dataframe in Spark will be converted automatically to data files in HDFS in Hive format provided by the metastore.  So HDFS is the connecting medium between Spark and Hive.

The default Spark distribution package is bundled with a thrift server that you can save/retrieve dataframe to/from Hive tables using on a standalone metastore, without the presence of a Hive server.  However, you do have to run a script (not provided by Spark) to initialize the standalone metastore.  In the standalone Hive mode, Spark read/write from/to these hive tables directly, which are stored as plain text files on disk (not parquet).

The benefit of using a thrift server is having the option of saving/retrieving Spark data by third party applications via JDBC directly.  Since most people running Spark without Hadoop, and you can run a standalone Hive to achieve the same outcome, not sure what benefits of running Spark on HDFS and Hive would bring, considering the huge admin overhead associated with both Hadoop and Hive.  Hope this helps...


On 3/14/22 1:54 PM, Venkatesan Muniappan wrote:
hi Team,

I wanted to understand how spark connects to Hive. Does it connect to Hive metastore directly bypassing hive server?. Lets say when we are inserting data into a hive table with its I/O format as Parquet. Does Spark creates the parquet file from the Dataframe/RDD/DataSet and put it in its HDFS location and update metastore about the new parquet file?. Or it simply run the insert statement on Hiverserver (through jdbc or some other means).

We are using Spark 2.4.3 and Hive 2.1.1 in our cluster.

Is there a document that explains about this?. Please share.

Thanks,
Venkat
2016173438

Reply via email to