It launches two jobs because it doesn't know ahead of time how big your RDD is, so it doesn't know what the sampling rate should be. After counting all the records, it can determine what the sampling rate should be -- then it does another pass through the data, sampling by the rate its just determined.
Note that this suggests: (a) if you know the size of your RDD ahead of time, you could eliminate that first pass and (b) since you end up computing the input RDD twice, it may make sense to cache it. On Thu, Jun 11, 2015 at 11:43 AM, barmaley <o...@solver.com> wrote: > I've observed interesting behavior in Spark 1.3.1, the reason for which is > not clear. > > Doing something as simple as sc.textFile("...").takeSample(...) always > results in two stages:Spark's takeSample() results in two stages > > < > http://apache-spark-user-list.1001560.n3.nabble.com/file/n23280/Capture.jpg > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/takeSample-results-in-two-stages-tp23280.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >