Re: Bridging gap between Spark UI and Code

2021-05-25 Thread Wenchen Fan
You can see the SQL plan node name in the DAG visualization. Please refer
to https://spark.apache.org/docs/latest/web-ui.html for more details. If
you still have any confusion, please let us know and we will keep improving
the document.

On Tue, May 25, 2021 at 4:41 AM mhawes  wrote:

> @Wenchen Fan, understood that the mapping of query plan to application code
> is very hard. I was wondering if we might be able to instead just handle
> the
> mapping from the final physical plan to the stage graph. So for example
> you’d be able to tell what part of the plan generated which stages. I feel
> this would provide the most benefit without having to worry about several
> optimisation steps.
>
> The main issue as I see it is that currently, if there’s a failing stage,
> it’s almost impossible to track down the part of the plan that generated
> the
> stage. Would this be possible? If not, do you have any other suggestions
> for
> this kind of debugging?
>
> Best,
> Matt
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Bridging gap between Spark UI and Code

2021-05-24 Thread mhawes
@Wenchen Fan, understood that the mapping of query plan to application code
is very hard. I was wondering if we might be able to instead just handle the
mapping from the final physical plan to the stage graph. So for example
you’d be able to tell what part of the plan generated which stages. I feel
this would provide the most benefit without having to worry about several
optimisation steps.

The main issue as I see it is that currently, if there’s a failing stage,
it’s almost impossible to track down the part of the plan that generated the
stage. Would this be possible? If not, do you have any other suggestions for
this kind of debugging?

Best,
Matt



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Bridging gap between Spark UI and Code

2021-05-24 Thread Mich Talebzadeh
Plus some operators can be repeated because if a node dies, spark would
need to rebuild that state again from RDD lineage.

HTH

Mich


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 24 May 2021 at 18:22, Wenchen Fan  wrote:

> I believe you can already see each plan change Spark did to your query
> plan in the debug-level logs. I think it's hard to do in the web UI as
> keeping all these historical query plans is expensive.
>
> Mapping the query plan to your application code is nearly impossible, as
> so many optimizations can happen (some operators can be removed, some
> operators can be replaced by different ones, some operators can be added by
> Spark).
>
> On Mon, May 24, 2021 at 10:30 PM Will Raschkowski
>  wrote:
>
>> This would be great.
>>
>>
>>
>> At least for logical nodes, would it be possible to re-use the existing
>> Utils.getCallSite
>> <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1526>
>> to populate a field when nodes are created? I suppose most value would come
>> from eventually passing the call-sites along to physical nodes. But maybe
>> just as starting point Spark could display the call-site only with
>> unoptimized logical plans? Users would still get a better sense for how the
>> plan’s structure relates to their code.
>>
>>
>>
>> *From: *mhawes 
>> *Date: *Friday, 21 May 2021 at 22:36
>> *To: *dev@spark.apache.org 
>> *Subject: *Re: Bridging gap between Spark UI and Code
>>
>> CAUTION: This email originates from an external party (outside of
>> Palantir). If you believe this message is suspicious in nature, please use
>> the "Report Phishing" button built into Outlook.
>>
>>
>> Reviving this thread to ask whether any of the Spark maintainers would
>> consider helping to scope a solution for this. Michal outlines the problem
>> in this thread, but to clarify. The issue is that for very complex spark
>> application where the Logical Plans often span many pages, it is extremely
>> hard to figure out how the stages in the Spark UI/RDD operations link to
>> the
>> Logical Plan that generated them.
>>
>> Now, obviously this is a hard problem to solve given the various
>> optimisations and transformations that go on in between these two stages.
>> However I wanted to raise it as a potential option as I think it would be
>> /extremely/ valuable for Spark users.
>>
>> My two main ideas are either:
>>  - To carry a reference to the original plan around when
>> planning/optimising.
>>  - To maintain a separate mapping for each planning/optimisation step that
>> maps from source to target. Im thinking along the lines of JavaScript
>> sourcemaps.
>>
>> It would be great to get the opinion of an experienced Spark maintainer on
>> this, given the complexity.
>>
>>
>>
>> --
>> Sent from:
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_&d=DwICAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=HrP36vwrw3UfNOlJ_ndb5EgIQ5INvWvw9xCbXhhQujY&m=jhxzuGxzWWdVR-pHNp2qV4JtVtGoOiAisKfUe-ySPt8&s=S68eCuXKhVzlv12dMdK8YM1YY0BocZ3vMblM_I8E_wo&e=
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>


Re: Bridging gap between Spark UI and Code

2021-05-24 Thread Wenchen Fan
I believe you can already see each plan change Spark did to your query plan
in the debug-level logs. I think it's hard to do in the web UI as keeping
all these historical query plans is expensive.

Mapping the query plan to your application code is nearly impossible, as so
many optimizations can happen (some operators can be removed, some
operators can be replaced by different ones, some operators can be added by
Spark).

On Mon, May 24, 2021 at 10:30 PM Will Raschkowski
 wrote:

> This would be great.
>
>
>
> At least for logical nodes, would it be possible to re-use the existing
> Utils.getCallSite
> <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1526>
> to populate a field when nodes are created? I suppose most value would come
> from eventually passing the call-sites along to physical nodes. But maybe
> just as starting point Spark could display the call-site only with
> unoptimized logical plans? Users would still get a better sense for how the
> plan’s structure relates to their code.
>
>
>
> *From: *mhawes 
> *Date: *Friday, 21 May 2021 at 22:36
> *To: *dev@spark.apache.org 
> *Subject: *Re: Bridging gap between Spark UI and Code
>
> CAUTION: This email originates from an external party (outside of
> Palantir). If you believe this message is suspicious in nature, please use
> the "Report Phishing" button built into Outlook.
>
>
> Reviving this thread to ask whether any of the Spark maintainers would
> consider helping to scope a solution for this. Michal outlines the problem
> in this thread, but to clarify. The issue is that for very complex spark
> application where the Logical Plans often span many pages, it is extremely
> hard to figure out how the stages in the Spark UI/RDD operations link to
> the
> Logical Plan that generated them.
>
> Now, obviously this is a hard problem to solve given the various
> optimisations and transformations that go on in between these two stages.
> However I wanted to raise it as a potential option as I think it would be
> /extremely/ valuable for Spark users.
>
> My two main ideas are either:
>  - To carry a reference to the original plan around when
> planning/optimising.
>  - To maintain a separate mapping for each planning/optimisation step that
> maps from source to target. Im thinking along the lines of JavaScript
> sourcemaps.
>
> It would be great to get the opinion of an experienced Spark maintainer on
> this, given the complexity.
>
>
>
> --
> Sent from:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_&d=DwICAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=HrP36vwrw3UfNOlJ_ndb5EgIQ5INvWvw9xCbXhhQujY&m=jhxzuGxzWWdVR-pHNp2qV4JtVtGoOiAisKfUe-ySPt8&s=S68eCuXKhVzlv12dMdK8YM1YY0BocZ3vMblM_I8E_wo&e=
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>


Re: Bridging gap between Spark UI and Code

2021-05-24 Thread Will Raschkowski
This would be great.

At least for logical nodes, would it be possible to re-use the existing 
Utils.getCallSite<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1526>
 to populate a field when nodes are created? I suppose most value would come 
from eventually passing the call-sites along to physical nodes. But maybe just 
as starting point Spark could display the call-site only with unoptimized 
logical plans? Users would still get a better sense for how the plan’s 
structure relates to their code.

From: mhawes 
Date: Friday, 21 May 2021 at 22:36
To: dev@spark.apache.org 
Subject: Re: Bridging gap between Spark UI and Code
CAUTION: This email originates from an external party (outside of Palantir). If 
you believe this message is suspicious in nature, please use the "Report 
Phishing" button built into Outlook.


Reviving this thread to ask whether any of the Spark maintainers would
consider helping to scope a solution for this. Michal outlines the problem
in this thread, but to clarify. The issue is that for very complex spark
application where the Logical Plans often span many pages, it is extremely
hard to figure out how the stages in the Spark UI/RDD operations link to the
Logical Plan that generated them.

Now, obviously this is a hard problem to solve given the various
optimisations and transformations that go on in between these two stages.
However I wanted to raise it as a potential option as I think it would be
/extremely/ valuable for Spark users.

My two main ideas are either:
 - To carry a reference to the original plan around when
planning/optimising.
 - To maintain a separate mapping for each planning/optimisation step that
maps from source to target. Im thinking along the lines of JavaScript
sourcemaps.

It would be great to get the opinion of an experienced Spark maintainer on
this, given the complexity.



--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_&d=DwICAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=HrP36vwrw3UfNOlJ_ndb5EgIQ5INvWvw9xCbXhhQujY&m=jhxzuGxzWWdVR-pHNp2qV4JtVtGoOiAisKfUe-ySPt8&s=S68eCuXKhVzlv12dMdK8YM1YY0BocZ3vMblM_I8E_wo&e=

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


Re: Bridging gap between Spark UI and Code

2021-05-21 Thread mhawes
Reviving this thread to ask whether any of the Spark maintainers would
consider helping to scope a solution for this. Michal outlines the problem
in this thread, but to clarify. The issue is that for very complex spark
application where the Logical Plans often span many pages, it is extremely
hard to figure out how the stages in the Spark UI/RDD operations link to the
Logical Plan that generated them.

Now, obviously this is a hard problem to solve given the various
optimisations and transformations that go on in between these two stages.
However I wanted to raise it as a potential option as I think it would be
/extremely/ valuable for Spark users.

My two main ideas are either:
 - To carry a reference to the original plan around when
planning/optimising. 
 - To maintain a separate mapping for each planning/optimisation step that
maps from source to target. Im thinking along the lines of JavaScript
sourcemaps.

It would be great to get the opinion of an experienced Spark maintainer on
this, given the complexity. 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Bridging gap between Spark UI and Code

2020-07-21 Thread Michal Sankot
And to be clear. Yes, execution plans show what exactly it's doing. The 
problem is that it's unclear how it's related to the actual Scala/Python 
code.


On 7/21/20 15:45, Michal Sankot wrote:
Yes, the problem is that DAGs only refer to code line (action) that 
inovked it. It doesn't provide information about how individual 
transformations link to the code.


So you can have dozen of stages, each with the same code line which 
invoked it, doing different stuff. And then we guess what it's 
actually doing.



On 7/21/20 15:36, Russell Spitzer wrote:
Have you looked in the DAG visualization? Each block refer to the 
code line invoking it.


For Dataframes the execution plan will let you know explicitly which 
operations are in which stages.


On Tue, Jul 21, 2020, 8:18 AM Michal Sankot 
 wrote:


Hi,
when I analyze and debug our Spark batch jobs executions it's a
pain to
find out how blocks in Spark UI Jobs/SQL tab correspond to the
actual
Scala code that we write and how much time they take. Would there
be a
way to somehow instruct compiler or something and get this
information
into Spark UI?

At the moment linking Spark UI elements with our code is a guess
work
driven by adding and removing lines of code and reruning the job,
which
is tedious. A possibility to make our life easier e.g. by running
Spark
jobs in dedicated debug mode where this information would be
available
would be greatly appreciated. (Though I don't know whether it's
possible
at all).

Thanks,
Michal

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




--


MichalSankot
BigData Engineer
E: michal.san...@voxnest.com 


Re: Bridging gap between Spark UI and Code

2020-07-21 Thread Michal Sankot
Yes, the problem is that DAGs only refer to code line (action) that 
inovked it. It doesn't provide information about how individual 
transformations link to the code.


So you can have dozen of stages, each with the same code line which 
invoked it, doing different stuff. And then we guess what it's actually 
doing.



On 7/21/20 15:36, Russell Spitzer wrote:
Have you looked in the DAG visualization? Each block refer to the code 
line invoking it.


For Dataframes the execution plan will let you know explicitly which 
operations are in which stages.


On Tue, Jul 21, 2020, 8:18 AM Michal Sankot 
 wrote:


Hi,
when I analyze and debug our Spark batch jobs executions it's a
pain to
find out how blocks in Spark UI Jobs/SQL tab correspond to the actual
Scala code that we write and how much time they take. Would there
be a
way to somehow instruct compiler or something and get this
information
into Spark UI?

At the moment linking Spark UI elements with our code is a guess work
driven by adding and removing lines of code and reruning the job,
which
is tedious. A possibility to make our life easier e.g. by running
Spark
jobs in dedicated debug mode where this information would be
available
would be greatly appreciated. (Though I don't know whether it's
possible
at all).

Thanks,
Michal

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: Bridging gap between Spark UI and Code

2020-07-21 Thread Russell Spitzer
Have you looked in the DAG visualization? Each block refer to the code line
invoking it.

For Dataframes the execution plan will let you know explicitly which
operations are in which stages.

On Tue, Jul 21, 2020, 8:18 AM Michal Sankot
 wrote:

> Hi,
> when I analyze and debug our Spark batch jobs executions it's a pain to
> find out how blocks in Spark UI Jobs/SQL tab correspond to the actual
> Scala code that we write and how much time they take. Would there be a
> way to somehow instruct compiler or something and get this information
> into Spark UI?
>
> At the moment linking Spark UI elements with our code is a guess work
> driven by adding and removing lines of code and reruning the job, which
> is tedious. A possibility to make our life easier e.g. by running Spark
> jobs in dedicated debug mode where this information would be available
> would be greatly appreciated. (Though I don't know whether it's possible
> at all).
>
> Thanks,
> Michal
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Bridging gap between Spark UI and Code

2020-07-21 Thread Michal Sankot

Hi,
when I analyze and debug our Spark batch jobs executions it's a pain to 
find out how blocks in Spark UI Jobs/SQL tab correspond to the actual 
Scala code that we write and how much time they take. Would there be a 
way to somehow instruct compiler or something and get this information 
into Spark UI?


At the moment linking Spark UI elements with our code is a guess work 
driven by adding and removing lines of code and reruning the job, which 
is tedious. A possibility to make our life easier e.g. by running Spark 
jobs in dedicated debug mode where this information would be available 
would be greatly appreciated. (Though I don't know whether it's possible 
at all).


Thanks,
Michal

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org