[ https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125710#comment-15125710 ]
Low Chin Wei commented on SPARK-6319: ------------------------------------- Hi guys, Any resolution for group by binary column using Spark. > Should throw analysis exception when using binary type in groupby/join > ---------------------------------------------------------------------- > > Key: SPARK-6319 > URL: https://issues.apache.org/jira/browse/SPARK-6319 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 > Reporter: Cheng Lian > Assignee: Liang-Chi Hsieh > Priority: Critical > Fix For: 1.5.0 > > > Spark shell session for reproduction: > {noformat} > scala> import sqlContext.implicits._ > scala> import org.apache.spark.sql.types._ > scala> Seq(1, 1, 2, 2).map(i => Tuple1(i.toString)).toDF("c").select($"c" > cast BinaryType).distinct.show() > ... > CAST(c, BinaryType) > [B@43f13160 > [B@5018b648 > [B@3be22500 > [B@476fc8a1 > {noformat} > Spark SQL uses plain byte arrays to represent binary values. However, arrays > are compared by reference rather than by value. On the other hand, the > DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check > for duplicated values. These two facts together cause the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org