[ https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091833#comment-17091833 ]
Pablo Langa Blanco commented on SPARK-31500: -------------------------------------------- Hi [~ewasserman], This is a scala base problem, equality between arrays is not behaving as expected. [https://blog.bruchez.name/2013/05/scala-array-comparison-without-phd.html] I'm going to work to find a solution, but here is a workaround, change the definition of the case class and put Seq instead of Array and it will work as expected. {code:java} case class R(id: String, value: String, bytes: Seq[Byte]){code} > collect_set() of BinaryType returns duplicate elements > ------------------------------------------------------ > > Key: SPARK-31500 > URL: https://issues.apache.org/jira/browse/SPARK-31500 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.4, 2.4.5 > Reporter: Eric Wasserman > Priority: Major > > The collect_set() aggregate function should produce a set of distinct > elements. When the column argument's type is BinayType this is not the case. > > Example: > {{import org.apache.spark.sql.functions._}} > {{import org.apache.spark.sql.expressions.Window}} > {{case class R(id: String, value: String, bytes: Array[Byte])}} > {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}} > {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), > makeR("b", "fish")).toDF()}} > > {{// In the example below "bytesSet" erroneously has duplicates but > "stringSet" does not (as expected).}} > {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as > "byteSet").show(truncate=false)}} > > {{// The same problem is displayed when using window functions.}} > {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, > Window.unboundedFollowing)}} > {{val result = df.select(}} > collect_set('value).over(win) as "stringSet", > collect_set('bytes).over(win) as "bytesSet" > {{)}} > {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", > size('bytesSet) as "bytesSetSize")}} > {{.show()}} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org