[ https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362614#comment-14362614 ]
Yin Huai commented on SPARK-6319: --------------------------------- I believe that our aggregation and join will also be affected. I am changing the Priority to Blocker. > DISTINCT doesn't work for binary type > ------------------------------------- > > Key: SPARK-6319 > URL: https://issues.apache.org/jira/browse/SPARK-6319 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.0.2, 1.1.1, 1.3.0, 1.2.1 > Reporter: Cheng Lian > Priority: Blocker > > Spark shell session for reproduction: > {noformat} > scala> import sqlContext.implicits._ > scala> import org.apache.spark.sql.types._ > scala> Seq(1, 1, 2, 2).map(i => Tuple1(i.toString)).toDF("c").select($"c" > cast BinaryType).distinct.show() > ... > CAST(c, BinaryType) > [B@43f13160 > [B@5018b648 > [B@3be22500 > [B@476fc8a1 > {noformat} > Spark SQL uses plain byte arrays to represent binary values. However, arrays > are compared by reference rather than by value. On the other hand, the > DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check > for duplicated values. These two facts together cause the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org