[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...

cloud-fan Wed, 07 Nov 2018 20:31:29 -0800

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22938#discussion_r231762733
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
    @@ -550,15 +550,33 @@ case class JsonToStructs(
           s"Input schema ${nullableSchema.catalogString} must be a struct, an 
array or a map.")
       }
     
    -  // This converts parsed rows to the desired output by the given schema.
       @transient
    -  lazy val converter = nullableSchema match {
    -    case _: StructType =>
    -      (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else 
null
    -    case _: ArrayType =>
    -      (rows: Iterator[InternalRow]) => if (rows.hasNext) 
rows.next().getArray(0) else null
    -    case _: MapType =>
    -      (rows: Iterator[InternalRow]) => if (rows.hasNext) 
rows.next().getMap(0) else null
    +  private lazy val castRow = nullableSchema match {
    +    case _: StructType => (row: InternalRow) => row
    +    case _: ArrayType => (row: InternalRow) =>
    +      if (row.isNullAt(0)) {
    +        new GenericArrayData(Array())
    --- End diff --
    
    I think this is the place `from_json` is different from json data source. A 
data source must produce data as rows, while the `from_json` can return array 
or map.
    
    I think the previous behavior also makes sense. For array/map, we don't 
have the corrupted column,  and returning null is reasonable. Actually I prefer 
null over empty array/map, but we need more discussion about this behavior.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...

Reply via email to