RDD#toDebugString will help. On Thu, Feb 5, 2015 at 1:15 AM, Joe Wass <jw...@crossref.org> wrote:
> Thanks Akhil and Mark. I can of course count events (assuming I can deduce > the shuffle boundaries), but like I said the program isn't simple and I'd > have to do this manually every time I change the code. So I rather find a > way of doing this automatically if possible. > > On 4 February 2015 at 19:41, Mark Hamstra <m...@clearstorydata.com> wrote: > >> But there isn't a 1-1 mapping from operations to stages since multiple >> operations will be pipelined into a single stage if no shuffle is >> required. To determine the number of stages in a job you really need to be >> looking for shuffle boundaries. >> >> On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das <ak...@sigmoidanalytics.com> >> wrote: >> >>> You can easily understand the flow by looking at the number of >>> operations in your program (like map, groupBy, join etc.), first of all you >>> list out the number of operations happening in your application and then >>> from the webui you will be able to see how many operations have happened so >>> far. >>> >>> Thanks >>> Best Regards >>> >>> On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass <jw...@crossref.org> wrote: >>> >>>> I'm sitting here looking at my application crunching gigabytes of data >>>> on a cluster and I have no idea if it's an hour away from completion or a >>>> minute. The web UI shows progress through each stage, but not how many >>>> stages remaining. How can I work out how many stages my program will take >>>> automatically? >>>> >>>> My application has a slightly interesting DAG (re-use of functions that >>>> contain Spark transformations, persistent RDDs). Not that complex, but not >>>> 'step 1, step 2, step 3'. >>>> >>>> I'm guessing that if the driver program runs sequentially sending >>>> messages to Spark, then Spark has no knowledge of the structure of the >>>> driver program. Therefore it's necessary to execute it on a small test >>>> dataset and see how many stages result? >>>> >>>> When I set spark.eventLog.enabled = true and run on (very small) test >>>> data I don't get any stage messages in my STDOUT or in the log file. This >>>> is on a `local` instance. >>>> >>>> Did I miss something obvious? >>>> >>>> Thanks! >>>> >>>> Joe >>>> >>> >>> >> >