[GitHub] spark pull request #21073: [SPARK-23936][SQL] Implement map_concat

bersprockets Tue, 01 May 2018 21:52:16 -0700

Github user bersprockets commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21073#discussion_r185392954
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
    @@ -116,6 +117,161 @@ case class MapValues(child: Expression)
       override def prettyName: String = "map_values"
     }
     
    +/**
    + * Returns the union of all the given maps.
    + */
    +@ExpressionDescription(
    +usage = "_FUNC_(map, ...) - Returns the union of all the given maps",
    +examples = """
    +    Examples:
    +      > SELECT _FUNC_(map(1, 'a', 2, 'b'), map(2, 'c', 3, 'd'));
    +       [[1 -> "a"], [2 -> "c"], [3 -> "d"]
    +  """, since = "2.4.0")
    +case class MapConcat(children: Seq[Expression]) extends Expression {
    +
    +  override def checkInputDataTypes(): TypeCheckResult = {
    +    // check key types and value types separately to allow 
valueContainsNull to vary
    +    if (children.exists(!_.dataType.isInstanceOf[MapType])) {
    +      TypeCheckResult.TypeCheckFailure(
    +        s"The given input of function $prettyName should all be of type 
map, " +
    +          "but they are " + 
children.map(_.dataType.simpleString).mkString("[", ", ", "]"))
    +    } else if 
(children.map(_.dataType.asInstanceOf[MapType].keyType).distinct.length > 1) {
    +      TypeCheckResult.TypeCheckFailure(
    +        s"The given input maps of function $prettyName should all be the 
same type, " +
    +          "but they are " + 
children.map(_.dataType.simpleString).mkString("[", ", ", "]"))
    +    } else if 
(children.map(_.dataType.asInstanceOf[MapType].valueType).distinct.length > 1) {
    +      TypeCheckResult.TypeCheckFailure(
    +        s"The given input maps of function $prettyName should all be the 
same type, " +
    +          "but they are " + 
children.map(_.dataType.simpleString).mkString("[", ", ", "]"))
    +    } else {
    +      TypeCheckResult.TypeCheckSuccess
    +    }
    +  }
    +
    +  override def dataType: MapType = {
    +    MapType(
    +      keyType = children.headOption
    +        
.map(_.dataType.asInstanceOf[MapType].keyType).getOrElse(StringType),
    +      valueType = children.headOption
    +        
.map(_.dataType.asInstanceOf[MapType].valueType).getOrElse(StringType),
    +      valueContainsNull = children.map { c =>
    +        c.dataType.asInstanceOf[MapType]
    +      }.exists(_.valueContainsNull)
    +    )
    +  }
    +
    +  override def nullable: Boolean = children.exists(_.nullable)
    +
    +  override def eval(input: InternalRow): Any = {
    +    val union = new util.LinkedHashMap[Any, Any]()
    +    children.map(_.eval(input)).foreach { raw =>
    +      if (raw == null) {
    +        return null
    +      }
    +      val map = raw.asInstanceOf[MapData]
    +      map.foreach(dataType.keyType, dataType.valueType, (k, v) =>
    +        union.put(k, v)
    +      )
    --- End diff --
    
    I found an issue. I was preparing to add some more tests when I noticed 
that using maps as keys doesn't work well in interpreted mode (seems to work 
fine in codegen mode, so far).
    
    So, something like this doesn't work in interpreted mode:
    <pre>
    scala> dfmapmap.show(truncate=false)
    
+--------------------------------------------------+---------------------------------------------+
    |mapmap1                                           |mapmap2                 
                     |
    
+--------------------------------------------------+---------------------------------------------+
    |[[1 -> 2, 3 -> 4] -> 101, [5 -> 6, 7 -> 8] -> 102]|[[11 -> 12] -> 103, [1 
-> 2, 3 -> 4] -> 1001]|
    
+--------------------------------------------------+---------------------------------------------+
    scala> dfmapmap.select(map_concat('mapmap1, 
'mapmap2).as('mapmap3)).show(truncate=false)
    
+-----------------------------------------------------------------------------------------------+
    |mapmap3                                                                    
                    |
    
+-----------------------------------------------------------------------------------------------+
    |[[1 -> 2, 3 -> 4] -> 101, [5 -> 6, 7 -> 8] -> 102, [11 -> 12] -> 103, [1 
-> 2, 3 -> 4] -> 1001]|
    
+-----------------------------------------------------------------------------------------------+
    </pre>
    As you can see, the key `[1 -> 2, 3 -> 4]` shows up twice in the new map.
    
    This is because:
    <pre>
      val a1 = new ArrayBasedMapData(new GenericArrayData(Array(1, 3)), new 
GenericArrayData(Array(2, 4)))
      val a2 = new ArrayBasedMapData(new GenericArrayData(Array(1, 3)), new 
GenericArrayData(Array(2, 4)))
      a1 == a2 // will be false
      a1.hashCode() == a2.hashCode() // will be false
    </pre>
    Different instances of ArrayBasedMapData with the exact same data are not 
considered the same key.
    
    So far, seems to work fine for instances of UnsafeMapData (that is, the 
UnsafeMapData version of the above maps are considered the same key).
    
    I might need to add equals and/or hashCode methods to ArrayBasedMapData. 
Either that, or wrap any ArrayBasedMapData with some other object (only during 
its time as a key in the union Map) that does implement equals and/or hashCode 
properly.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21073: [SPARK-23936][SQL] Implement map_concat

Reply via email to