FYI, about integration of Spark + Hive 4 HTH Dr Mich Talebzadeh, Data Scientist | Distributed Systems (Spark) | Financial Forensics & Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based Analytics
view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> ---------- Forwarded message --------- From: Mich Talebzadeh <[email protected]> Date: Tue, 17 Mar 2026 at 19:16 Subject: Spark + Hive 4 Integration Guide (Practical Approach) To: Spark dev list <[email protected]>, user @spark < [email protected]> Spark + Hive 4 Integration Guide (Practical Approach) 1. Overview This write-up was motivated by my personal use of Hive as a stable and reliable data warehouse for storing my bank records, which span over a decade. Migrating away from Hive would have been both time-consuming and impractical, so I needed to find a way to upgrade my lab environment—including Hadoop, Hive, and Spark—while preserving the existing architecture. The objective was to identify a cost-effective and practical solution that maintains the current setup, which relies on Airflow as the workflow orchestration tool to populate Hive tables on demand. This document explains how Apache Spark interacts with Apache Hive 4 in a modern architecture. Due to incompatibility between Spark (3.x) and the Hive 4 metastore APIs, a decoupled approach is required. The environment used consists of *Spark 3.5.5, Hadoop 3.4.1, and Hive 4.0.1*. 2. Architecture Diagram +------------------+ | Spark | | (Processing) | +--------+---------+ | | Write (Parquet) v +------------------+ | HDFS | | (Data Storage) | +--------+---------+ | | Read v +------------------+ | Hive | | (SQL / Metadata) | +------------------+ 3. Key Principle Spark does NOT talk to Hive metastore directly. Instead: - Spark writes data to HDFS - Hive reads data from HDFS using external tables 4. Real Example (Based on My Pipeline) Spark transformation and write: val df = spark.read .option("inferSchema", "true") .option("header", "false") .csv("hdfs://rhes75:9000/data/stg/...") val finalDf = df .filter($"_c0" =!= "Date") .select( to_date($"_c0","dd/MM/yyyy").as("transactiondate"), $"_c1".as("description") ) finalDf.write.mode("overwrite") .parquet("hdfs://.../parquet/path") Hive table creation via JDBC: CREATE EXTERNAL TABLE elayer.bank_123456_stg ( transactiondate DATE, description STRING ) STORED AS PARQUET LOCATION 'hdfs://.../parquet/path'; 5. Hive JDBC Example Class.forName("org.apache.hive.jdbc.HiveDriver") val conn = DriverManager.getConnection("jdbc:hive2://rhes75:10099/default") val stmt = conn.createStatement() stmt.execute("SELECT COUNT(*) FROM table") 6. Troubleshooting (Based on My Issues) Problem: Invalid method name 'get_table' Cause: Spark trying to use Hive 4 metastore (not supported) Problem: No data in staging table Cause: Table already existed with old LOCATION Fix: DROP TABLE elayer.bank_123456_stg before recreating Problem: HDFS write errors (map.xml) Cause: DataNode capacity / replication issue 7. Best Practices - Always use EXTERNAL tables for Spark-written data - Always control LOCATION explicitly - Drop & recreate staging tables when path changes - Keep Spark and Hive responsibilities separate 8. Summary Spark + Hive 4 integration works reliably when using HDFS as the bridge. Avoid direct metastore usage from Spark. HTH Dr Mich Talebzadeh, Data Scientist | Distributed Systems (Spark) | Financial Forensics & Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based Analytics view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
