[GitHub] spark issue #22708: [SPARK-21402] Fix java array/map of structs deserializat...

vofque Mon, 15 Oct 2018 02:22:31 -0700

Github user vofque commented on the issue:

    https://github.com/apache/spark/pull/22708
  
    The original problem is described here: 
https://issues.apache.org/jira/browse/SPARK-21402
    
    I'll try to explain what happens in detail.
    
    Let's consider this data structure:
    root
     |-- intervals: array
     |    |-- element: struct
     |    |    |-- startTime: long
     |    |    |-- endTime: long
    
    And let's say we have a java bean class with corresponding structure.
    
    When building a deserializer for the field _intervals_ in 
_JavaTypeInference.deserializerFor_ we construct a _MapObjects_ expression to 
convert structs to java beans:
    ```
    case c if listType.isAssignableFrom(typeToken) =>
      val et = elementType(typeToken)
      MapObjects(
        p => deserializerFor(et, Some(p)),
        getPath,
        inferDataType(et)._1,
        customCollectionCls = Some(c))
    ```
    
    _MapObjects_ requires _DataType_ of array elements. It is extracted from 
java element type using _JavaTypeInference.inferDataType_ which gets java bean 
properties and maps them to _StructFields_.
    ```
    case other =>
      // some more code goes here
      val properties = getJavaBeanReadableProperties(other)
      val fields = properties.map { property =>
        val returnType = typeToken.method(property.getReadMethod).getReturnType
        val (dataType, nullable) = inferDataType(returnType, seenTypeSet + 
other)
        new StructField(property.getName, dataType, nullable)
    }
    ```
    
    The order of properties in the resulting _StructType_ may not correspond to 
their declaration order as the declaration order is simply unknown. So the 
resulting element _StructType_ may look like this:
    root
     |-- endTime: long
     |-- startTime: long
    
    This _StructType_ is passed to _MapObjects_ and then to its loop variable 
_LambdaVariable_.
    
    For deserialization of single array elements an _InitializeJavaBean_ 
expression is created. It contains _UnresolvedExtractValue_ expressions for 
each field, and these expressions have _LambdaVariable_ as a child. They are 
resolved during analysis:
    ```
    case UnresolvedExtractValue(child, fieldName) if child.resolved =>
      ExtractValue(child, fieldName, resolver)
    ```  
    
    For each field _startTime_ and _endTime_ ordinals are calculated. For that 
child's _DataType_ is used, and in our case this is _StructType_ of 
_LambdaVariable_ with incorrect field order.
    As a result we get _GetStructField_ expressions with ordinal = 0 for 
'endTime' and ordinal = 1 for startTime.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22708: [SPARK-21402] Fix java array/map of structs deserializat...

Reply via email to