[ https://issues.apache.org/jira/browse/SPARK-42406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-42406: ------------------------------------ Assignee: Apache Spark (was: Raghu Angadi) > [PROTOBUF] Recursive field handling is incompatible with delta > -------------------------------------------------------------- > > Key: SPARK-42406 > URL: https://issues.apache.org/jira/browse/SPARK-42406 > Project: Spark > Issue Type: Bug > Components: Protobuf > Affects Versions: 3.4.0 > Reporter: Raghu Angadi > Assignee: Apache Spark > Priority: Major > Fix For: 3.4.0 > > > Protobuf deserializer (`from_protobuf()` function()) optionally supports > recursive fields by limiting the depth to certain level. See example below. > It assigns a 'NullType' for such a field when allowed depth is reached. > It causes a few issues. E.g. a repeated field as in the following example > results in a Array field with 'NullType'. Delta does not support null type in > a complex type. > Actually `Array[NullType]` is not really useful anyway. > How about this fix: Drop the recursive field when the limit reached rather > than using a NullType. > The example below makes it clear: > Consider a recursive Protobuf: > > {code:python} > message TreeNode { > string value = 1; > repeated TreeNode children = 2; > } > {code} > Allow depth of 2: > > {code:python} > df.select( > 'proto', > messageName = 'TreeNode', > options = { ... "recursive.fields.max.depth" : "2" } > ).printSchema() > {code} > Schema looks like this: > {noformat} > root > |– from_protobuf(proto): struct (nullable = true)| > | |– value: string (nullable = true)| > | |– children: array (nullable = false)| > | | |– element: struct (containsNull = false)| > | | | |– value: string (nullable = true)| > | | | |– children: array (nullable = false)| > | | | | |– element: struct (containsNull = false)| > | | | | | |– value: string (nullable = true)| > | | | | | |– children: array (nullable = false). [ === Proposed fix: Drop > this field === ]| > | | | | | | |– element: void (containsNull = false) [ === NOTICE 'void' HERE > === ] > {noformat} > When we try to write this to a delta table, we get an error: > {noformat} > AnalysisException: Found nested NullType in column > from_protobuf(proto).children which is of ArrayType. Delta doesn't support > writing NullType in complex types. > {noformat} > > We could just drop the field 'element' when recursion depth is reached. It is > simpler and does not need to deal with NullType. We are ignoring the value > anyway. There is no use in keeping the field. > Another issue is setting for 'recursive.fields.max.depth': It is not enforced > correctly. '0' does not make sense. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org