koert kuipers created SPARK-32109: ------------------------------------- Summary: SQL hash function handling of nulls makes collision too likely Key: SPARK-32109 URL: https://issues.apache.org/jira/browse/SPARK-32109 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: koert kuipers
this ticket is about org.apache.spark.sql.functions.hash and sparks handling of nulls when hashing sequences. {code:java} scala> spark.sql("SELECT hash('bar', null)").show() +---------------+ |hash(bar, NULL)| +---------------+ | -1808790533| +---------------+ scala> spark.sql("SELECT hash(null, 'bar')").show() +---------------+ |hash(NULL, bar)| +---------------+ | -1808790533| +---------------+ {code} these are differences sequences. e.g. these could be positions 0 and 1 in a dataframe which are diffferent columns with entirely different meanings. the hashes should bot be the same. another example: {code:java} scala> Seq(("john", null), (null, "john")).toDF("name", "alias").withColumn("hash", hash(col("name"), col("alias"))).show +----+-----+---------+ |name|alias| hash| +----+-----+---------+ |john| null|487839701| |null| john|487839701| +----+-----+---------+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org