[ https://issues.apache.org/jira/browse/SPARK-32109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147358#comment-17147358 ]
koert kuipers commented on SPARK-32109: --------------------------------------- the issue is that Row here isnt really a sequence. it represent an object. if you have say an object Person(name: String, nickname: String) you would not want Person("john", null) and Person(null, "john") to have same hashCode. see for example the suggested hashcode implementations in effective java by joshua bloch. they do something similar to what you suggest to solve this problem. so unfortunately i think our current implementation is flawed :( PS even for pure sequences i do not think this implementation as it is right now is acceptable. but that is less of a worry than the object represenation of row. > SQL hash function handling of nulls makes collision too likely > -------------------------------------------------------------- > > Key: SPARK-32109 > URL: https://issues.apache.org/jira/browse/SPARK-32109 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: koert kuipers > Priority: Minor > > this ticket is about org.apache.spark.sql.functions.hash and sparks handling > of nulls when hashing sequences. > {code:java} > scala> spark.sql("SELECT hash('bar', null)").show() > +---------------+ > |hash(bar, NULL)| > +---------------+ > | -1808790533| > +---------------+ > scala> spark.sql("SELECT hash(null, 'bar')").show() > +---------------+ > |hash(NULL, bar)| > +---------------+ > | -1808790533| > +---------------+ > {code} > these are differences sequences. e.g. these could be positions 0 and 1 in a > dataframe which are diffferent columns with entirely different meanings. the > hashes should not be the same. > another example: > {code:java} > scala> Seq(("john", null), (null, "john")).toDF("name", > "alias").withColumn("hash", hash(col("name"), col("alias"))).show > +----+-----+---------+ > |name|alias| hash| > +----+-----+---------+ > |john| null|487839701| > |null| john|487839701| > +----+-----+---------+ {code} > instead of ignoring nulls each null show do a transform to the hash so that > the order of elements including the nulls matters for the outcome. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org