Hello Spark Developers,

I'd like to bring attention to a significant limitation in the current
*open-source
from_avro implementation* within Apache Spark SQL, especially regarding its
integration with the common *Kafka + Avro* ecosystem.

The current design, which is largely "naive" in that it requires a manually
supplied, static schema, falls short of supporting the most basic and
prevalent streaming scenario: reading an *Avro-encoded Kafka topic with
schema evolution*.

The Core Problem: Missing "Automatic" Schema Resolution

When an Avro record is paired with a *Schema Registry* (like Confluent),
the standard procedure is:

   1.

   The record bytes contain a *Schema ID* header.
   2.

   The consumer (Spark) uses this ID to fetch the corresponding *writer
   schema* from the registry.
   3.

   The consumer also uses its desired *reader schema* (often the latest
   version).
   4.

   The Avro library's core function performs *schema resolution* using both
   the writer and reader schemas. This is what handles *schema evolution*
   by automatically dropping old fields or applying default values for new
   fields.

*Crucially, this entire process is currently missing from the open-source
Spark core.*

Why This Is a Critical Gap:


   -

   It forces users to rely on non-standard, and sometimes poorly
   maintained, third-party libraries (like the now-partially-stalled ABRiS
   project) or proprietary vendor extensions (like those available in
   Databricks - and also there it's partially supported).
   -

   The absence of this feature makes the out-of-the-box Kafka-to-Spark data
   pipeline for Avro highly brittle, non-compliant with standard Avro/Schema
   Registry practices, and cumbersome to maintain when schemas inevitably
   change.

Proposed Path Forward

Given that this is an essential and ubiquitous pattern for using Spark with
Kafka, I strongly believe that *native Schema Registry integration and
automatic schema resolution must become a core feature of Apache Spark*.

This enhancement would not only bring Spark up to parity with standard data
engineering expectations but also significantly lower the barrier to entry
for building robust, schema-compliant streaming pipelines.

I encourage the community to consider dedicating resources to integrating
this fundamental Avro deserialization logic into the core from_avro
function - I'll be happy to take part in it.

Thank you for considering this proposal to make Spark an even more powerful
and streamlined tool for streaming data.

Nimrod

Reply via email to