It launches two jobs because it doesn't know ahead of time how big your RDD
is, so it doesn't know what the sampling rate should be.  After counting
all the records, it can determine what the sampling rate should be -- then
it does another pass through the data, sampling by the rate its just
determined.

Note that this suggests: (a) if you know the size of your RDD ahead of
time, you could eliminate that first pass and (b) since you end up
computing the input RDD twice, it may make sense to cache it.

On Thu, Jun 11, 2015 at 11:43 AM, barmaley <o...@solver.com> wrote:

> I've observed interesting behavior in Spark 1.3.1, the reason for which is
> not clear.
>
> Doing something as simple as sc.textFile("...").takeSample(...) always
> results in two stages:Spark's takeSample() results in two stages
>
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n23280/Capture.jpg
> >
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/takeSample-results-in-two-stages-tp23280.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to