Yes, there is no way right now to know how many stages a job will generate automatically. Like Mark said, RDD#toDebugString will give you some info about the RDD DAG and from that you can determine based on the dependency types (Wide vs. narrow) if there is a stage boundary.
On Thu, Feb 5, 2015 at 1:41 AM, Mark Hamstra <m...@clearstorydata.com> wrote: > And the Job page of the web UI will give you an idea of stages completed > out of the total number of stages for the job. That same information is > also available as JSON. Statically determining how many stages a job > logically comprises is one thing, but dynamically determining how many > stages remain to be run to complete a job is a surprisingly tricky problem > -- take a look at the discussion that went into Josh's Job page PR to get > an idea of the issues and subtleties involved: > https://github.com/apache/spark/pull/3009 > > On Thu, Feb 5, 2015 at 1:27 AM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> RDD#toDebugString will help. >> >> On Thu, Feb 5, 2015 at 1:15 AM, Joe Wass <jw...@crossref.org> wrote: >> >>> Thanks Akhil and Mark. I can of course count events (assuming I can >>> deduce the shuffle boundaries), but like I said the program isn't simple >>> and I'd have to do this manually every time I change the code. So I rather >>> find a way of doing this automatically if possible. >>> >>> On 4 February 2015 at 19:41, Mark Hamstra <m...@clearstorydata.com> >>> wrote: >>> >>>> But there isn't a 1-1 mapping from operations to stages since multiple >>>> operations will be pipelined into a single stage if no shuffle is >>>> required. To determine the number of stages in a job you really need to be >>>> looking for shuffle boundaries. >>>> >>>> On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das <ak...@sigmoidanalytics.com> >>>> wrote: >>>> >>>>> You can easily understand the flow by looking at the number of >>>>> operations in your program (like map, groupBy, join etc.), first of all >>>>> you >>>>> list out the number of operations happening in your application and then >>>>> from the webui you will be able to see how many operations have happened >>>>> so >>>>> far. >>>>> >>>>> Thanks >>>>> Best Regards >>>>> >>>>> On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass <jw...@crossref.org> wrote: >>>>> >>>>>> I'm sitting here looking at my application crunching gigabytes of >>>>>> data on a cluster and I have no idea if it's an hour away from completion >>>>>> or a minute. The web UI shows progress through each stage, but not how >>>>>> many >>>>>> stages remaining. How can I work out how many stages my program will take >>>>>> automatically? >>>>>> >>>>>> My application has a slightly interesting DAG (re-use of functions >>>>>> that contain Spark transformations, persistent RDDs). Not that complex, >>>>>> but >>>>>> not 'step 1, step 2, step 3'. >>>>>> >>>>>> I'm guessing that if the driver program runs sequentially sending >>>>>> messages to Spark, then Spark has no knowledge of the structure of the >>>>>> driver program. Therefore it's necessary to execute it on a small test >>>>>> dataset and see how many stages result? >>>>>> >>>>>> When I set spark.eventLog.enabled = true and run on (very small) test >>>>>> data I don't get any stage messages in my STDOUT or in the log file. This >>>>>> is on a `local` instance. >>>>>> >>>>>> Did I miss something obvious? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Joe >>>>>> >>>>> >>>>> >>>> >>> >> >