Hello,
It is a pleasure to hear from you. Thank you for sharing your insights.
We have adopted a similar approach to address component stack upgrade and 
evolution challenges, and we are delighted to see that the community is 
actively advancing this work as well.
Our general strategy focuses on two key principles:
1. Decoupling the Engine-Data Relationship
This appears to be a consensus among industry peers. Given that our data 
processing remains predominantly structured, the data lake technology stack 
naturally became our preferred choice, as it inherently preserves data schema.
In our production environment, we utilize HadoopCatalog to migrate data 
previously managed by legacy HMS versions into Iceberg. While alternative 
RestCatalog implementations are certainly worth considering, our preference for 
minimizing component stack complexity led us to favor a FileSystemCatalog-like 
solution.
We are grateful that the community has incorporated similar experiences—HIVE 
now supports both HadoopCatalog and LocationBasedIcebergTable.
Leveraging these features, we have achieved seamless data interoperability 
between Spark and HIVE4 in our production environment. This liberates us from 
concerns about engine upgrades potentially corrupting data or compromising 
accessibility. For Trino and other MPP databases, we now implementing a 
compatibility layer adhering to RestCatalog specifications, enabling these 
engines to access FileSystemCatalog-based tables.
Our current production architecture maintains separate deployments: Spark 3.x 
operates with legacy HMS 3.x, while HIVE 4.x runs independently. This 
arrangement allows us to upgrade Spark and gracefully phase out the legacy 
HMS(3.x,2.x,...) at our discretion,and use hms 4.x. Should we need to 
reconfigure extensively—even clearing all metadata and redeploying engines—the 
data remains secure within Iceberg. We simply re-establish the connection 
between Iceberg and the engine.
2. Minimizing Hadoop Cluster Dependencies
Similar to Spark-on-Kubernetes deployments, our approach involves bundling 
essential runtime dependencies within the engine's self-contained libraries, 
ensuring the engine operates exclusively with its own libraries during 
execution.
This method has effectively decoupled Hadoop versioning from our engines. 
Provided base APIs remain stable, we can successfully run engines dependent on 
newer Hadoop versions atop older Hadoop/YARN infrastructures.
We were honored to contribute this approach to the official documentation:
https://hive.apache.org/docs/latest/admin/manual-installation/#installing-with-old-version-hadoopgreater-than-or-equal-310
Employing these techniques, we currently operate three or more distinct 
HIVE+HMS version combinations in production. In our customer engagements, we 
have similarly enabled HIVE4 (dependent on Hadoop 3.4+) to run within Hadoop 
2.x environments.
The above reflects our humble experience and observations. We would be 
delighted to exchange ideas should you have alternative approaches or insights 
to share. Please kindly point out any misconceptions or areas for improvement.
Warm regards,
Lisoda

Reply via email to