Ohad Raviv created SPARK-34416: ---------------------------------- Summary: Support avroSchemaUrl in addition to avroSchema Key: SPARK-34416 URL: https://issues.apache.org/jira/browse/SPARK-34416 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 2.3.0, 3.2.0 Reporter: Ohad Raviv
We have a use case in which we read a huge table in Avro format. About 30k columns. using the default Hive reader - `AvroGenericRecordReader` it is just hangs forever. after 4 hours not even one task has finished. We tried instead to use `spark.read.format("com.databricks.spark.avro").load(..)` but we failed on: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema .. at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85) at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174) ... 53 elided ``` because files schema contain duplicate column names (when considering case-insensitive). So we wanted to provide a user schema with non-duplicated fields, but the schema is huge. a few MBs. it is not practical to provide it in json format. So we patched spark-avro to be able to get also `avroSchemaUrl` in addition to `avroSchema` and it worked perfectly. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org