Heya, I would like to use countApproxDistinct in pyspark, I know that it's an experimental method and that it is not yet available in pyspark. I started with porting the countApproxDistinct unit-test to Python, see https://gist.github.com/drdee/d68eaf0208184d72cbff. Surprisingly, the results are way off.
Using Scala, I get the following two counts (using https://github.com/apache/spark/blob/4c7243e109c713bdfb87891748800109ffbaae07/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala#L78-87): scala> simpleRdd.countApproxDistinct(4, 0) res2: Long = 73 scala> simpleRdd.countApproxDistinct(8, 0) res3: Long = 99 In Python, with the same RDD as you can see in the gist, I get the following results: In [7]: rdd._jrdd.rdd().countApproxDistinct(4, 0) Out[7]: 29L In [8]: rdd._jrdd.rdd().countApproxDistinct(8, 0) Out[8]: 26L Clearly, I am doing something wrong here :) What is also weird is that when I set p to 8, I should get a more accurate number, but it's actually smaller. Any tips or pointers are much appreciated! Best, Diederik -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-countApproxDistinct-in-pyspark-tp10878.html Sent from the Apache Spark User List mailing list archive at Nabble.com.