Sorry, but my original solution is incorrect 1. Glue Crawlers are not supposed to set the spark.sql.sources.schema.* properties, but Spark SQL should. The default in Spark 2.4 for spark.sql.hive.caseSensitiveInferenceMode is INFER_AND_SAVE which means that Spark infers the schema from the underlying files and alters the tables to add the spark.sql.sources.schema.* properties to SERDEPROPERTIES. In our case, Spark failed to do so, because of a I"llegalArgumentException: Can not create a Path from an empty string" exception which is caused because the Hive database class instance has an empty locationUri property string. This is caused because the Glue database does not have a Location property enter image description here. After the schema is saved, Spark reads it from the table. 2. There could be a way around this, by setting INFER_ONLY, which should only infer the schema from the files and not attempt to alter the table SERDEPROPERTIES. However, this doesn't work because of a Spark bug, where the inferred schema is then lowercased [1].
[1] https://github.com/apache/spark/blob/c1b6fe479649c482947dfce6b6db67b159bd78a3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L284 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org