[ 
https://issues.apache.org/jira/browse/SPARK-15528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305150#comment-15305150
 ] 

Takeshi Yamamuro commented on SPARK-15528:
------------------------------------------

I tried this in master and I could reproduce;

{code}
import org.apache.spark.sql.functions._
val df = Seq(("aaaa", 0), ("aaaa", 1)).toDF("a", "b")
(0 until 10).map(_ => df.select(countDistinct(conv(df("a"), 16, 10))).show)

+-------------------------------+
|count(DISTINCT conv(a, 16, 10))|
+-------------------------------+
|                              1|
+-------------------------------+

+-------------------------------+
|count(DISTINCT conv(a, 16, 10))|
+-------------------------------+
|                              1|
+-------------------------------+

+-------------------------------+
|count(DISTINCT conv(a, 16, 10))|
+-------------------------------+
|                              1|
+-------------------------------+

+-------------------------------+
|count(DISTINCT conv(a, 16, 10))|
+-------------------------------+
|                              2|
+-------------------------------+

+-------------------------------+
|count(DISTINCT conv(a, 16, 10))|
+-------------------------------+
|                              1|
+-------------------------------+
{code}

Sometimes, we could weirdly get not '1' but '2'.
The explain is below;

{code}
== Physical Plan ==
*TungstenAggregate(key=[], functions=[(count(conv(a#5, 16, 
10)#19),mode=Final,isDistinct=true)], output=[count(DISTINCT conv(a, 16, 
10))#15L])
+- Exchange SinglePartition, None
   +- *TungstenAggregate(key=[], functions=[(count(conv(a#5, 16, 
10)#19),mode=Partial,isDistinct=true)], output=[count#22L])
      +- *TungstenAggregate(key=[conv(a#5, 16, 10)#19], functions=[], 
output=[conv(a#5, 16, 10)#19])
         +- Exchange hashpartitioning(conv(a#5, 16, 10)#19, 200), None
            +- *TungstenAggregate(key=[conv(a#5, 16, 10) AS conv(a#5, 16, 
10)#19], functions=[], output=[conv(a#5, 16, 10)#19])
               +- LocalTableScan [a#5], [[aaaa],[aaaa]]
{code}




> conv function returns inconsistent result for the same data
> -----------------------------------------------------------
>
>                 Key: SPARK-15528
>                 URL: https://issues.apache.org/jira/browse/SPARK-15528
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>            Reporter: Lior Regev
>
> When using F.conv to convert a column from a hexadecimal string to an 
> integer, the results are inconsistent
> val col = F.conv(df("some_col"), 16, 10)
> val a = df.select(F.countDistinct("some_col"), F.countDistinct(col)).collect()
> val b = df.select(F.countDistinct("some_col"), F.countDistinct(col)).collect()
> returns:
> a: Array[org.apache.spark.sql.Row] = Array([59776,1941936])
> b: Array[org.apache.spark.sql.Row] = Array([59776,1965154])
> P.S.
> "some_col" is a md5 hash of some string column calculated using F.md5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to