Saurabh Chawla created SPARK-36452: -------------------------------------- Summary: Add the support in Spark for having group by map datatype column for the scenario that works in Hive Key: SPARK-36452 URL: https://issues.apache.org/jira/browse/SPARK-36452 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2, 3.0.3, 3.2.0 Reporter: Saurabh Chawla
Add the support in Spark for having group by map datatype column for the scenario that works in Hive. In hive the below scenario works {code:java} describe extended complex2; OK id string c1 map<int, string> Detailed Table Information Table(tableName:complex2, dbName:default, owner:abc, createTime:1627994412, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:c1, type:map<int,string>, comment:null)], location:/user/hive/warehouse/complex2, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}) select * from complex2; OK 1 {1:"u"} 2 {1:"u",2:"uo"} 1 {1:"u",2:"uo"} Time taken: 0.363 seconds, Fetched: 3 row(s) select id, c1, count(*) from complex2 group by id, c1; OK 1 {1:"u"} 1 1 {1:"u",2:"uo"} 1 2 {1:"u",2:"uo"} 1 Time taken: 1.621 seconds, Fetched: 3 row(s) failed when map type is present in aggregated expression select id, max(c1), count(*) from complex2 group by id, c1; FAILED: UDFArgumentTypeException Cannot support comparison of map<> type or complex type containing map<>. {code} But in spark this scenario where the group by map column failed for this scenario where the map column is used in the select without any aggregation {code:java} scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show org.apache.spark.sql.AnalysisException: expression spark_catalog.default.complex2.`c1` cannot be used as a grouping expression because its data type map<int,string> is not an orderable data type.; Aggregate [id#1, c1#2], [id#1, c1#2, count(1) AS count(1)#3L] +- SubqueryAlias spark_catalog.default.complex2 +- HiveTableRelation [`default`.`complex2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#1, c1#2], Partition Cols: []] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50) {code} There is need to add the this scenario where grouping expression can have map type if aggregated expression does not have the that map type reference. This helps in migrating the user from hive to Spark. After the code change {code:java} scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show +---+-----------------+--------+ | id| c1|count(1)| +---+-----------------+--------+ | 1| {1 -> u}| 1| | 2|{1 -> u, 2 -> uo}| 1| | 1|{1 -> u, 2 -> uo}| 1| +---+-----------------+--------+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org