[jira] [Updated] (SPARK-55787) Add built-in function to detect all-null structs

ASF GitHub Bot (Jira) Sun, 12 Apr 2026 13:20:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-55787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated SPARK-55787:
-----------------------------------
    Labels: pull-request-available  (was: )

> Add built-in function to detect all-null structs
> ------------------------------------------------
>
>                 Key: SPARK-55787
>                 URL: https://issues.apache.org/jira/browse/SPARK-55787
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 4.1.1
>            Reporter: Joaquín Antonio De Vicente López
>            Priority: Major
>              Labels: pull-request-available
>
> h3. Problem
> Spark currently lacks a built-in way to determine whether a struct contains 
> only NULL values, without explicitly checking every field or performing 
> expensive serialization.
> In Structured Streaming pipelines using Kafka + Avro (via ABRiS), when 
> deserialization fails and {{PermissiveRecordExceptionHandler}} is used, the 
> handler produces a struct that is *non-null but has ALL
> fields set to null*:
> {code:scala}
>   class PermissiveRecordExceptionHandler() extends 
> DeserializationExceptionHandler with Logging {
>     def handle(exception: Throwable, deserializer: AbrisAvroDeserializer, 
> readerSchema: Schema): Any = {
>       logWarning("Malformed record detected. Replacing with full null row.", 
> exception)
>       val record = new GenericData.Record(readerSchema)
>       deserializer.deserialize(record)
>     }
>   }  {code}
> Currently, the only way to detect these "empty" structs is:
> {code:scala}
>   df.filter(col("value").isNotNull && to_json(col("value")) =!= "{}"){code}
> This is problematic because:
>  * {{to_json}} forces *full JSON serialization* of every row (Jackson 
> allocation + byte array + UTF8String)
>  * It is *not type-aware* — it compares against a magic string {{"{}"}}
>  * It creates significant GC pressure in high-throughput streaming (100K+ 
> msg/sec)
>  * It breaks whole-stage codegen ({{{}StructsToJson{}}} uses 
> {{RuntimeReplaceable}} → falls back to {{{}Invoke{}}})
> In streaming ingestion pipelines, if these all-null structs are not filtered 
> out, they may propagate downstream to sinks such as Apache Kudu. When struct 
> fields correspond to primary key columns, this can cause runtime failures due 
> to NULL values in primary keys, aborting the streaming query. Preventing such 
> rows in an efficient and schema-agnostic way is therefore not only a 
> performance concern but also a correctness and reliability requirement.
> This functionality is currently not expressible in a schema-agnostic and 
> optimizer-friendly way using existing built-in functions.
> h3. Proposed Solution
> Add two new built-in SQL functions:
> {code:sql}
>   is_struct_empty(struct)     -- TRUE if non-null struct has all null fields
>   is_struct_non_empty(struct) -- TRUE if non-null struct has at least one 
> non-null field{code}
> *Usage:*
> {code:scala}
>   // New — zero-cost null-bitmap check
>   df.filter(col("value").isNotNull && is_struct_non_empty(col("value")))
>   // SQL syntax
>   SELECT * FROM kafka_messages WHERE value IS NOT NULL AND 
> is_struct_non_empty(value)  {code}
> h3. Semantics
> ||Input State||is_struct_empty||is_struct_non_empty||
> |NULL struct|NULL|NULL|
> |Non-null, all fields null|TRUE|FALSE|
> |Non-null, at least one field non-null|FALSE|TRUE|
> |Non-null, zero fields (empty schema)|TRUE|FALSE|
>  * Follows SQL three-valued logic: NULL input → NULL output
>  * Shallow check only: a nested struct that is itself non-null (even if its 
> children are all null) counts as non-null at the parent level
>  * Non-null empty arrays/maps inside the struct count as non-null (they are 
> values, not null)
> h3. Performance Comparison
> ||Dimension||to_json =!= "{}"||is_struct_non_empty||
> |Object allocation|JacksonGenerator, StringWriter, JsonGenerator, UTF8String, 
> byte[] per row|*Zero*|
> |Serialization|Full JSON of ALL fields|*None* — bitwise null-check only|
> |Short-circuit|No — must serialize entire struct|Yes — stops at first 
> non-null field|
> |CPU cost|O(fields × type_dispatch + string_compare)|O(1) best case, 
> O(fields) worst case|
> |GC pressure|High|*None*|
> |Codegen|Falls back to Invoke (no tight codegen)|Full whole-stage codegen, 
> inlined null checks|
> h3. References
>  * ABRiS (Avro Bridge for Spark): [https://github.com/AbsaOSS/ABRiS]
>  * Similar concepts in other engines:
>  ** BigQuery: struct comparison semantics handle NULL fields natively
>  ** DuckDB: struct null-awareness in predicates



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-55787) Add built-in function to detect all-null structs

Reply via email to