Heya,

I would like to use countApproxDistinct in pyspark, I know that it's an
experimental method and that it is not yet available in pyspark. I started
with porting the countApproxDistinct unit-test to Python, see
https://gist.github.com/drdee/d68eaf0208184d72cbff. Surprisingly, the
results are way off.

Using Scala, I get the following two counts (using
https://github.com/apache/spark/blob/4c7243e109c713bdfb87891748800109ffbaae07/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala#L78-87):

scala> simpleRdd.countApproxDistinct(4, 0)
res2: Long = 73

scala> simpleRdd.countApproxDistinct(8, 0)
res3: Long = 99

In Python, with the same RDD as you can see in the gist, I get the following
results:

In [7]: rdd._jrdd.rdd().countApproxDistinct(4, 0)
Out[7]: 29L

In [8]: rdd._jrdd.rdd().countApproxDistinct(8, 0)
Out[8]: 26L


Clearly, I am doing something wrong here :) What is also weird is that when
I set p to 8, I should get a more accurate number, but it's actually
smaller. Any tips or pointers are much appreciated!
Best,
Diederik



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-countApproxDistinct-in-pyspark-tp10878.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to