I guess it's really depends on your configuration. The Hive metastore
is providing just the metadata/schema data for your database, not actual
data storage. Hive is running on top of Hadoop. If you configure your
Spark to run on the same Hadoop cluster using Yarn, your SQL dataframe
in Spark will be converted automatically to data files in HDFS in Hive
format provided by the metastore. So HDFS is the connecting medium
between Spark and Hive.
The default Spark distribution package is bundled with a thrift server
that you can save/retrieve dataframe to/from Hive tables using on a
standalone metastore, without the presence of a Hive server. However,
you do have to run a script (not provided by Spark) to initialize the
standalone metastore. In the standalone Hive mode, Spark read/write
from/to these hive tables directly, which are stored as plain text files
on disk (not parquet).
The benefit of using a thrift server is having the option of
saving/retrieving Spark data by third party applications via JDBC
directly. Since most people running Spark without Hadoop, and you can
run a standalone Hive to achieve the same outcome, not sure what
benefits of running Spark on HDFS and Hive would bring, considering the
huge admin overhead associated with both Hadoop and Hive. Hope this
helps...
On 3/14/22 1:54 PM, Venkatesan Muniappan wrote:
hi Team,
I wanted to understand how spark connects to Hive. Does it connect to
Hive metastore directly bypassing hive server?. Lets say when we are
inserting data into a hive table with its I/O format as Parquet. Does
Spark creates the parquet file from the Dataframe/RDD/DataSet and put
it in its HDFS location and update metastore about the new parquet
file?. Or it simply run the insert statement on Hiverserver (through
jdbc or some other means).
We are using Spark 2.4.3 and Hive 2.1.1 in our cluster.
Is there a document that explains about this?. Please share.
Thanks,
Venkat
2016173438