[ https://issues.apache.org/jira/browse/SPARK-32109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
koert kuipers updated SPARK-32109: ---------------------------------- Description: this ticket is about org.apache.spark.sql.functions.hash and sparks handling of nulls when hashing sequences. {code:java} scala> spark.sql("SELECT hash('bar', null)").show() +---------------+ |hash(bar, NULL)| +---------------+ | -1808790533| +---------------+ scala> spark.sql("SELECT hash(null, 'bar')").show() +---------------+ |hash(NULL, bar)| +---------------+ | -1808790533| +---------------+ {code} these are differences sequences. e.g. these could be positions 0 and 1 in a dataframe which are diffferent columns with entirely different meanings. the hashes should not be the same. another example: {code:java} scala> Seq(("john", null), (null, "john")).toDF("name", "alias").withColumn("hash", hash(col("name"), col("alias"))).show +----+-----+---------+ |name|alias| hash| +----+-----+---------+ |john| null|487839701| |null| john|487839701| +----+-----+---------+ {code} instead of ignoring nulls each null show do a transform to the hash so that the order of elements including the nulls matters for the outcome. was: this ticket is about org.apache.spark.sql.functions.hash and sparks handling of nulls when hashing sequences. {code:java} scala> spark.sql("SELECT hash('bar', null)").show() +---------------+ |hash(bar, NULL)| +---------------+ | -1808790533| +---------------+ scala> spark.sql("SELECT hash(null, 'bar')").show() +---------------+ |hash(NULL, bar)| +---------------+ | -1808790533| +---------------+ {code} these are differences sequences. e.g. these could be positions 0 and 1 in a dataframe which are diffferent columns with entirely different meanings. the hashes should bot be the same. another example: {code:java} scala> Seq(("john", null), (null, "john")).toDF("name", "alias").withColumn("hash", hash(col("name"), col("alias"))).show +----+-----+---------+ |name|alias| hash| +----+-----+---------+ |john| null|487839701| |null| john|487839701| +----+-----+---------+ {code} > SQL hash function handling of nulls makes collision too likely > -------------------------------------------------------------- > > Key: SPARK-32109 > URL: https://issues.apache.org/jira/browse/SPARK-32109 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: koert kuipers > Priority: Minor > > this ticket is about org.apache.spark.sql.functions.hash and sparks handling > of nulls when hashing sequences. > {code:java} > scala> spark.sql("SELECT hash('bar', null)").show() > +---------------+ > |hash(bar, NULL)| > +---------------+ > | -1808790533| > +---------------+ > scala> spark.sql("SELECT hash(null, 'bar')").show() > +---------------+ > |hash(NULL, bar)| > +---------------+ > | -1808790533| > +---------------+ > {code} > these are differences sequences. e.g. these could be positions 0 and 1 in a > dataframe which are diffferent columns with entirely different meanings. the > hashes should not be the same. > another example: > {code:java} > scala> Seq(("john", null), (null, "john")).toDF("name", > "alias").withColumn("hash", hash(col("name"), col("alias"))).show > +----+-----+---------+ > |name|alias| hash| > +----+-----+---------+ > |john| null|487839701| > |null| john|487839701| > +----+-----+---------+ {code} > instead of ignoring nulls each null show do a transform to the hash so that > the order of elements including the nulls matters for the outcome. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org