[ https://issues.apache.org/jira/browse/SPARK-15528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305150#comment-15305150 ]
Takeshi Yamamuro commented on SPARK-15528: ------------------------------------------ I tried this in master and I could reproduce; {code} import org.apache.spark.sql.functions._ val df = Seq(("aaaa", 0), ("aaaa", 1)).toDF("a", "b") (0 until 10).map(_ => df.select(countDistinct(conv(df("a"), 16, 10))).show) +-------------------------------+ |count(DISTINCT conv(a, 16, 10))| +-------------------------------+ | 1| +-------------------------------+ +-------------------------------+ |count(DISTINCT conv(a, 16, 10))| +-------------------------------+ | 1| +-------------------------------+ +-------------------------------+ |count(DISTINCT conv(a, 16, 10))| +-------------------------------+ | 1| +-------------------------------+ +-------------------------------+ |count(DISTINCT conv(a, 16, 10))| +-------------------------------+ | 2| +-------------------------------+ +-------------------------------+ |count(DISTINCT conv(a, 16, 10))| +-------------------------------+ | 1| +-------------------------------+ {code} Sometimes, we could weirdly get not '1' but '2'. The explain is below; {code} == Physical Plan == *TungstenAggregate(key=[], functions=[(count(conv(a#5, 16, 10)#19),mode=Final,isDistinct=true)], output=[count(DISTINCT conv(a, 16, 10))#15L]) +- Exchange SinglePartition, None +- *TungstenAggregate(key=[], functions=[(count(conv(a#5, 16, 10)#19),mode=Partial,isDistinct=true)], output=[count#22L]) +- *TungstenAggregate(key=[conv(a#5, 16, 10)#19], functions=[], output=[conv(a#5, 16, 10)#19]) +- Exchange hashpartitioning(conv(a#5, 16, 10)#19, 200), None +- *TungstenAggregate(key=[conv(a#5, 16, 10) AS conv(a#5, 16, 10)#19], functions=[], output=[conv(a#5, 16, 10)#19]) +- LocalTableScan [a#5], [[aaaa],[aaaa]] {code} > conv function returns inconsistent result for the same data > ----------------------------------------------------------- > > Key: SPARK-15528 > URL: https://issues.apache.org/jira/browse/SPARK-15528 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.1 > Reporter: Lior Regev > > When using F.conv to convert a column from a hexadecimal string to an > integer, the results are inconsistent > val col = F.conv(df("some_col"), 16, 10) > val a = df.select(F.countDistinct("some_col"), F.countDistinct(col)).collect() > val b = df.select(F.countDistinct("some_col"), F.countDistinct(col)).collect() > returns: > a: Array[org.apache.spark.sql.Row] = Array([59776,1941936]) > b: Array[org.apache.spark.sql.Row] = Array([59776,1965154]) > P.S. > "some_col" is a md5 hash of some string column calculated using F.md5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org