[ https://issues.apache.org/jira/browse/SPARK-22773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289547#comment-16289547 ]
Apache Spark commented on SPARK-22773: -------------------------------------- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/19971 > Empty arrays are not equal after transformation > ----------------------------------------------- > > Key: SPARK-22773 > URL: https://issues.apache.org/jira/browse/SPARK-22773 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.1 > Environment: Scala > Reporter: Laurent Legrand > Priority: Minor > > The comparison of a transformed column with another one gives inconsistent > results when cols contain empty arrays. > In the following code, the column "equals" of the DF "diff" should have all > values to true. But two are false. > {code:scala} > import org.apache.spark.ml.feature.StopWordsRemover > > val tf = new > StopWordsRemover().setInputCol("in").setOutputCol("out").setStopWords(Array("a", > "b")) > val df = spark.createDataFrame(Seq( > ("foo bar".split(' '), "foo bar".split(' ')), > ("foo a bar".split(' '), "foo bar".split(' ')), > ("foo bar b".split(' '), "foo bar".split(' ')), > ("a foo bar".split(' '), "foo bar".split(' ')), > ("a b".split(' '), "".split(' ')), > ("a".split(' '), "".split(' ')), > ("".split(' '), "".split(' ')) > )).toDF("in", "res") > > val res = tf.transform(df) > res.show() > > val diff = res.withColumn("equals", res("res") === res("out")) > > diff.show() > > diff.printSchema() > > println(diff.filter(diff("equals") === false).count()) > {code} > It gives: > {{+-------------+----------+----------+ > | in| res| out| > +-------------+----------+----------+ > | [foo, bar]|[foo, bar]|[foo, bar]| > |[foo, a, bar]|[foo, bar]|[foo, bar]| > |[foo, bar, b]|[foo, bar]|[foo, bar]| > |[a, foo, bar]|[foo, bar]|[foo, bar]| > | [a, b]| []| []| > | [a]| []| []| > | []| []| []| > +-------------+----------+----------+ > +-------------+----------+----------+------+ > | in| res| out|equals| > +-------------+----------+----------+------+ > | [foo, bar]|[foo, bar]|[foo, bar]| true| > |[foo, a, bar]|[foo, bar]|[foo, bar]| true| > |[foo, bar, b]|[foo, bar]|[foo, bar]| true| > |[a, foo, bar]|[foo, bar]|[foo, bar]| true| > | [a, b]| []| []| false| > | [a]| []| []| false| > | []| []| []| true| > +-------------+----------+----------+------+ > root > |-- in: array (nullable = true) > | |-- element: string (containsNull = true) > |-- res: array (nullable = true) > | |-- element: string (containsNull = true) > |-- out: array (nullable = true) > | |-- element: string (containsNull = true) > 2 > }} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org