RDD#toDebugString will help.

On Thu, Feb 5, 2015 at 1:15 AM, Joe Wass <jw...@crossref.org> wrote:

> Thanks Akhil and Mark. I can of course count events (assuming I can deduce
> the shuffle boundaries), but like I said the program isn't simple and I'd
> have to do this manually every time I change the code. So I rather find a
> way of doing this automatically if possible.
>
> On 4 February 2015 at 19:41, Mark Hamstra <m...@clearstorydata.com> wrote:
>
>> But there isn't a 1-1 mapping from operations to stages since multiple
>> operations will be pipelined into a single stage if no shuffle is
>> required.  To determine the number of stages in a job you really need to be
>> looking for shuffle boundaries.
>>
>> On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> You can easily understand the flow by looking at the number of
>>> operations in your program (like map, groupBy, join etc.), first of all you
>>> list out the number of operations happening in your application and then
>>> from the webui you will be able to see how many operations have happened so
>>> far.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass <jw...@crossref.org> wrote:
>>>
>>>> I'm sitting here looking at my application crunching gigabytes of data
>>>> on a cluster and I have no idea if it's an hour away from completion or a
>>>> minute. The web UI shows progress through each stage, but not how many
>>>> stages remaining. How can I work out how many stages my program will take
>>>> automatically?
>>>>
>>>> My application has a slightly interesting DAG (re-use of functions that
>>>> contain Spark transformations, persistent RDDs). Not that complex, but not
>>>> 'step 1, step 2, step 3'.
>>>>
>>>> I'm guessing that if the driver program runs sequentially sending
>>>> messages to Spark, then Spark has no knowledge of the structure of the
>>>> driver program. Therefore it's necessary to execute it on a small test
>>>> dataset and see how many stages result?
>>>>
>>>> When I set spark.eventLog.enabled = true and run on (very small) test
>>>> data I don't get any stage messages in my STDOUT or in the log file. This
>>>> is on a `local` instance.
>>>>
>>>> Did I miss something obvious?
>>>>
>>>> Thanks!
>>>>
>>>> Joe
>>>>
>>>
>>>
>>
>

Reply via email to