Hello everyone, For those of you not familiar with AWS Glue Catalog<https://aws.amazon.com/glue/>, it’s a Hive Metastore implemented as a web service. The Glue service is composed of different components, but the one I’m interested in is the Catalog. Today, there’s a Hive metastore implementation and you can plug the catalog to Spark as instructed here.<https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html> Basically, the Hive metastore Java class is swapped with an implementation that calls into Glue’s web service.
I don’t like this implementation because: * It puts Hive as a middle-man between Spark and Glue * It prevents Glue specific implementations As an example of the second issue, the Hive version embedded in Spark today does not support partition pruning for column types that are fractionals or timestamps. I have a pull request to fix this<https://github.com/apache/spark/pull/20100>, but as rxin correctly pointed out, I have to fake a new Hive version called Glue or something and put this under the Hive shim for it. I have locally implemented a version of ExternalCatalog<https://github.com/apache/spark/blob/2fd12af4372a1e2c3faf0eb5d0a1cf530abc0016/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala> on top of Glue and would like to productionize it and submit it as a pull request. You can set spark.catalog.implementation config to “glue” and then it will use Glue instead of either the in-memory catalog or Hive. Rudimentary tests are promising and I can hook up Parquet tables directly without going through any Hive. I really need this because I need to fix a data consistency issue with InsertIntoHiveTable command when data is backed by S3. Different topic. The biggest challenge is that I had to upgrade the AWS SDK to a newer version so that it includes the Glue client since Glue is a new service. So far, I haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve made sure the version is in sync with the Kinesis client used by spark-streaming module. Are there any objections to this? Any guidance around upgrading the AWS client? Who would be a good person to review this pull request? Thanks, -Ameen