Ohad Raviv created SPARK-34416:
----------------------------------

             Summary: Support avroSchemaUrl in addition to avroSchema
                 Key: SPARK-34416
                 URL: https://issues.apache.org/jira/browse/SPARK-34416
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.0, 2.3.0, 3.2.0
            Reporter: Ohad Raviv


We have a use case in which we read a huge table in Avro format. About 30k 
columns.

using the default Hive reader - `AvroGenericRecordReader` it is just hangs 
forever. after 4 hours not even one task has finished.

We tried instead to use 
`spark.read.format("com.databricks.spark.avro").load(..)` but we failed on:

```

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data 
schema

..

at 
org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
 at 
org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
 ... 53 elided

```

 

because files schema contain duplicate column names (when considering 
case-insensitive).

So we wanted to provide a user schema with non-duplicated fields, but the 
schema is huge. a few MBs. it is not practical to provide it in json format.

 

So we patched spark-avro to be able to get also `avroSchemaUrl` in addition to 
`avroSchema` and it worked perfectly.

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to