[ 
https://issues.apache.org/jira/browse/SPARK-28562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28562:
---------------------------------
    Priority: Minor  (was: Critical)

> PySpark profiling is not understandable
> ---------------------------------------
>
>                 Key: SPARK-28562
>                 URL: https://issues.apache.org/jira/browse/SPARK-28562
>             Project: Spark
>          Issue Type: Question
>          Components: Optimizer
>    Affects Versions: 2.4.0
>            Reporter: Albertus Kelvin
>            Priority: Minor
>
> I was profiling code in PySpark. What I did was set the 
> "spark.python.profile" in the config to "true". I also made a simple method 
> consisting of several dataframe operations, such as "withColumn" and "join". 
> Here's the code sample:
> {code:python}
> def join_df(df, df1):
>       df = df.withColumn('rowa', F.lit(100))
>       df = df.withColumn('rowb', df['rowa'] * F.lit(100))
>       
>       joined_df = df.join(df1,'rowid',how='left')
>       return joined_df
> {code}
> However, after the driver exits, the output of the profiler was not 
> understandable because there were no my filename and the corresponding 
> methods. All exists was Spark's built-in files and methods, such as "rdd.py", 
> "worker.py", and "serializers.py".
> The question is, how to show all of my methods that become the bottlenecks? 
> For example, using the above code sample, I'd like to know the time needed 
> for "withColumn" and "join" operation.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to