Raghu Angadi created SPARK-42406:
------------------------------------

             Summary: [PROTOBUF] Recursive field handling is incompatible with 
delta
                 Key: SPARK-42406
                 URL: https://issues.apache.org/jira/browse/SPARK-42406
             Project: Spark
          Issue Type: Bug
          Components: Protobuf
    Affects Versions: 3.4.0
            Reporter: Raghu Angadi
             Fix For: 3.4.1


Protobuf deserializer (`from_protobuf()` function()) optionally supports 
recursive fields by limiting the depth to certain level. See example below. It 
assigns a 'NullType' for such a field when allowed depth is reached. 

It causes a few issues. E.g. a repeated field as in the following example 
results in a Array field with 'NullType'. Delta does not support null type in a 
complex type.

Actually `Array[NullType]` is not really useful anyway.

How about this fix: Drop the recursive field when the limit reached rather than 
using a NullType. 

The example below makes it clear:

Consider a recursive Protobuf:
```
message TreeNode {
  string value = 1;
  repeated TreeNode children = 2;
}
```

Allow depth of 2: 

```python
   df.select(
    'proto',
     messageName = 'TreeNode',
    options = { ... "recursive.fields.max.depth" : "2" }
  ).printSchema()
```
Schema looks like this:
```
root
 |-- from_protobuf(proto): struct (nullable = true)
 |    |-- value: string (nullable = true)
 |    |-- children: array (nullable = false)
 |    |    |-- element: struct (containsNull = false)
 |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- children: array (nullable = false)
 |    |    |    |    |-- element: struct (containsNull = false)
 |    |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |    |    |-- children: array (nullable = false). [ === 
Proposed fix: Drop this field === ] 
 |    |    |    |    |    |    |-- element: void (containsNull = false) [ === 
NOTICE 'void' HERE === ] 
```

When we try to write this to a delta table, we get an error:
```
AnalysisException:  Found nested NullType in column 
from_protobuf(proto).children which is of ArrayType. Delta doesn't support 
writing NullType in complex types.
```
 
We could just drop the field 'element' when recursion depth is reached. It is 
simpler and does not need to deal with NullType. We are ignoring the value 
anyway. There is no use in keeping the field. 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to