Raghu Angadi created SPARK-42406: ------------------------------------ Summary: [PROTOBUF] Recursive field handling is incompatible with delta Key: SPARK-42406 URL: https://issues.apache.org/jira/browse/SPARK-42406 Project: Spark Issue Type: Bug Components: Protobuf Affects Versions: 3.4.0 Reporter: Raghu Angadi Fix For: 3.4.1
Protobuf deserializer (`from_protobuf()` function()) optionally supports recursive fields by limiting the depth to certain level. See example below. It assigns a 'NullType' for such a field when allowed depth is reached. It causes a few issues. E.g. a repeated field as in the following example results in a Array field with 'NullType'. Delta does not support null type in a complex type. Actually `Array[NullType]` is not really useful anyway. How about this fix: Drop the recursive field when the limit reached rather than using a NullType. The example below makes it clear: Consider a recursive Protobuf: ``` message TreeNode { string value = 1; repeated TreeNode children = 2; } ``` Allow depth of 2: ```python df.select( 'proto', messageName = 'TreeNode', options = { ... "recursive.fields.max.depth" : "2" } ).printSchema() ``` Schema looks like this: ``` root |-- from_protobuf(proto): struct (nullable = true) | |-- value: string (nullable = true) | |-- children: array (nullable = false) | | |-- element: struct (containsNull = false) | | | |-- value: string (nullable = true) | | | |-- children: array (nullable = false) | | | | |-- element: struct (containsNull = false) | | | | | |-- value: string (nullable = true) | | | | | |-- children: array (nullable = false). [ === Proposed fix: Drop this field === ] | | | | | | |-- element: void (containsNull = false) [ === NOTICE 'void' HERE === ] ``` When we try to write this to a delta table, we get an error: ``` AnalysisException: Found nested NullType in column from_protobuf(proto).children which is of ArrayType. Delta doesn't support writing NullType in complex types. ``` We could just drop the field 'element' when recursion depth is reached. It is simpler and does not need to deal with NullType. We are ignoring the value anyway. There is no use in keeping the field. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org