Hello Spark Developers, I'd like to bring attention to a significant limitation in the current *open-source from_avro implementation* within Apache Spark SQL, especially regarding its integration with the common *Kafka + Avro* ecosystem.
The current design, which is largely "naive" in that it requires a manually supplied, static schema, falls short of supporting the most basic and prevalent streaming scenario: reading an *Avro-encoded Kafka topic with schema evolution*. The Core Problem: Missing "Automatic" Schema Resolution When an Avro record is paired with a *Schema Registry* (like Confluent), the standard procedure is: 1. The record bytes contain a *Schema ID* header. 2. The consumer (Spark) uses this ID to fetch the corresponding *writer schema* from the registry. 3. The consumer also uses its desired *reader schema* (often the latest version). 4. The Avro library's core function performs *schema resolution* using both the writer and reader schemas. This is what handles *schema evolution* by automatically dropping old fields or applying default values for new fields. *Crucially, this entire process is currently missing from the open-source Spark core.* Why This Is a Critical Gap: - It forces users to rely on non-standard, and sometimes poorly maintained, third-party libraries (like the now-partially-stalled ABRiS project) or proprietary vendor extensions (like those available in Databricks - and also there it's partially supported). - The absence of this feature makes the out-of-the-box Kafka-to-Spark data pipeline for Avro highly brittle, non-compliant with standard Avro/Schema Registry practices, and cumbersome to maintain when schemas inevitably change. Proposed Path Forward Given that this is an essential and ubiquitous pattern for using Spark with Kafka, I strongly believe that *native Schema Registry integration and automatic schema resolution must become a core feature of Apache Spark*. This enhancement would not only bring Spark up to parity with standard data engineering expectations but also significantly lower the barrier to entry for building robust, schema-compliant streaming pipelines. I encourage the community to consider dedicating resources to integrating this fundamental Avro deserialization logic into the core from_avro function - I'll be happy to take part in it. Thank you for considering this proposal to make Spark an even more powerful and streamlined tool for streaming data. Nimrod
