countApprox gives the best answer within some timeout. Is it possible
that 1ms is more than enough to count this exactly? then the
confidence wouldn't matter. Although that seems way too fast, you're
counting ranges whose values don't actually matter, and maybe the
Python side is smart enough to use that fact. Then counting a
partition takes almost no time. Does it return immediately?

On Thu, Sep 15, 2016 at 6:20 PM, Stefano Lodi <stefano.l...@unibo.it> wrote:
> I am experimenting with countApprox. I created a RDD of 10^8 numbers and ran
> countApprox with different parameters but I failed to generate any
> approximate output. In all runs it returns the exact number of elements.
> What is the effect of approximation in countApprox supposed to be, and for
> what inputs and parameters?
>
>>>> rdd = sc.parallelize([random.choice(range(1000)) for i in range(10**8)],
>>>> 50)
>>>> rdd.countApprox(1, 0.8)
> [Stage 12:>                                                        (0 + 0) /
> 50]16/09/15 15:45:28 WARN TaskSetManager: Stage 12 contains a task of very
> large size (5402 KB). The maximum recommended task size is 100 KB.
> [Stage 12:======================================================> (49 + 1) /
> 50]100000000
>>>> rdd.countApprox(1, 0.01)
> 16/09/15 15:45:45 WARN TaskSetManager: Stage 13 contains a task of very
> large size (5402 KB). The maximum recommended task size is 100 KB.
> [Stage 13:====================================================>   (47 + 3) /
> 50]100000000
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to