And the Job page of the web UI will give you an idea of stages completed
out of the total number of stages for the job.  That same information is
also available as JSON.  Statically determining how many stages a job
logically comprises is one thing, but dynamically determining how many
stages remain to be run to complete a job is a surprisingly tricky problem
-- take a look at the discussion that went into Josh's Job page PR to get
an idea of the issues and subtleties involved:
https://github.com/apache/spark/pull/3009

On Thu, Feb 5, 2015 at 1:27 AM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> RDD#toDebugString will help.
>
> On Thu, Feb 5, 2015 at 1:15 AM, Joe Wass <jw...@crossref.org> wrote:
>
>> Thanks Akhil and Mark. I can of course count events (assuming I can
>> deduce the shuffle boundaries), but like I said the program isn't simple
>> and I'd have to do this manually every time I change the code. So I rather
>> find a way of doing this automatically if possible.
>>
>> On 4 February 2015 at 19:41, Mark Hamstra <m...@clearstorydata.com>
>> wrote:
>>
>>> But there isn't a 1-1 mapping from operations to stages since multiple
>>> operations will be pipelined into a single stage if no shuffle is
>>> required.  To determine the number of stages in a job you really need to be
>>> looking for shuffle boundaries.
>>>
>>> On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> You can easily understand the flow by looking at the number of
>>>> operations in your program (like map, groupBy, join etc.), first of all you
>>>> list out the number of operations happening in your application and then
>>>> from the webui you will be able to see how many operations have happened so
>>>> far.
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass <jw...@crossref.org> wrote:
>>>>
>>>>> I'm sitting here looking at my application crunching gigabytes of data
>>>>> on a cluster and I have no idea if it's an hour away from completion or a
>>>>> minute. The web UI shows progress through each stage, but not how many
>>>>> stages remaining. How can I work out how many stages my program will take
>>>>> automatically?
>>>>>
>>>>> My application has a slightly interesting DAG (re-use of functions
>>>>> that contain Spark transformations, persistent RDDs). Not that complex, 
>>>>> but
>>>>> not 'step 1, step 2, step 3'.
>>>>>
>>>>> I'm guessing that if the driver program runs sequentially sending
>>>>> messages to Spark, then Spark has no knowledge of the structure of the
>>>>> driver program. Therefore it's necessary to execute it on a small test
>>>>> dataset and see how many stages result?
>>>>>
>>>>> When I set spark.eventLog.enabled = true and run on (very small) test
>>>>> data I don't get any stage messages in my STDOUT or in the log file. This
>>>>> is on a `local` instance.
>>>>>
>>>>> Did I miss something obvious?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Joe
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to