Yes, there is no way right now to know how many stages a job will generate
automatically. Like Mark said, RDD#toDebugString will give you some info
about the RDD DAG and from that you can determine based on the dependency
types (Wide vs. narrow) if there is a stage boundary.

On Thu, Feb 5, 2015 at 1:41 AM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> And the Job page of the web UI will give you an idea of stages completed
> out of the total number of stages for the job.  That same information is
> also available as JSON.  Statically determining how many stages a job
> logically comprises is one thing, but dynamically determining how many
> stages remain to be run to complete a job is a surprisingly tricky problem
> -- take a look at the discussion that went into Josh's Job page PR to get
> an idea of the issues and subtleties involved:
> https://github.com/apache/spark/pull/3009
>
> On Thu, Feb 5, 2015 at 1:27 AM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> RDD#toDebugString will help.
>>
>> On Thu, Feb 5, 2015 at 1:15 AM, Joe Wass <jw...@crossref.org> wrote:
>>
>>> Thanks Akhil and Mark. I can of course count events (assuming I can
>>> deduce the shuffle boundaries), but like I said the program isn't simple
>>> and I'd have to do this manually every time I change the code. So I rather
>>> find a way of doing this automatically if possible.
>>>
>>> On 4 February 2015 at 19:41, Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>>
>>>> But there isn't a 1-1 mapping from operations to stages since multiple
>>>> operations will be pipelined into a single stage if no shuffle is
>>>> required.  To determine the number of stages in a job you really need to be
>>>> looking for shuffle boundaries.
>>>>
>>>> On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das <ak...@sigmoidanalytics.com>
>>>> wrote:
>>>>
>>>>> You can easily understand the flow by looking at the number of
>>>>> operations in your program (like map, groupBy, join etc.), first of all 
>>>>> you
>>>>> list out the number of operations happening in your application and then
>>>>> from the webui you will be able to see how many operations have happened 
>>>>> so
>>>>> far.
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass <jw...@crossref.org> wrote:
>>>>>
>>>>>> I'm sitting here looking at my application crunching gigabytes of
>>>>>> data on a cluster and I have no idea if it's an hour away from completion
>>>>>> or a minute. The web UI shows progress through each stage, but not how 
>>>>>> many
>>>>>> stages remaining. How can I work out how many stages my program will take
>>>>>> automatically?
>>>>>>
>>>>>> My application has a slightly interesting DAG (re-use of functions
>>>>>> that contain Spark transformations, persistent RDDs). Not that complex, 
>>>>>> but
>>>>>> not 'step 1, step 2, step 3'.
>>>>>>
>>>>>> I'm guessing that if the driver program runs sequentially sending
>>>>>> messages to Spark, then Spark has no knowledge of the structure of the
>>>>>> driver program. Therefore it's necessary to execute it on a small test
>>>>>> dataset and see how many stages result?
>>>>>>
>>>>>> When I set spark.eventLog.enabled = true and run on (very small) test
>>>>>> data I don't get any stage messages in my STDOUT or in the log file. This
>>>>>> is on a `local` instance.
>>>>>>
>>>>>> Did I miss something obvious?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Joe
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to