But there isn't a 1-1 mapping from operations to stages since multiple operations will be pipelined into a single stage if no shuffle is required. To determine the number of stages in a job you really need to be looking for shuffle boundaries.
On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > You can easily understand the flow by looking at the number of operations > in your program (like map, groupBy, join etc.), first of all you list out > the number of operations happening in your application and then from the > webui you will be able to see how many operations have happened so far. > > Thanks > Best Regards > > On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass <jw...@crossref.org> wrote: > >> I'm sitting here looking at my application crunching gigabytes of data on >> a cluster and I have no idea if it's an hour away from completion or a >> minute. The web UI shows progress through each stage, but not how many >> stages remaining. How can I work out how many stages my program will take >> automatically? >> >> My application has a slightly interesting DAG (re-use of functions that >> contain Spark transformations, persistent RDDs). Not that complex, but not >> 'step 1, step 2, step 3'. >> >> I'm guessing that if the driver program runs sequentially sending >> messages to Spark, then Spark has no knowledge of the structure of the >> driver program. Therefore it's necessary to execute it on a small test >> dataset and see how many stages result? >> >> When I set spark.eventLog.enabled = true and run on (very small) test >> data I don't get any stage messages in my STDOUT or in the log file. This >> is on a `local` instance. >> >> Did I miss something obvious? >> >> Thanks! >> >> Joe >> > >