[ https://issues.apache.org/jira/browse/SPARK-28562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897835#comment-16897835 ]
Hyukjin Kwon commented on SPARK-28562: -------------------------------------- Please ask questions to mailing list rather than filing as an issue. See https://spark.apache.org/community.html > PySpark profiling is not understandable > --------------------------------------- > > Key: SPARK-28562 > URL: https://issues.apache.org/jira/browse/SPARK-28562 > Project: Spark > Issue Type: Question > Components: Optimizer > Affects Versions: 2.4.0 > Reporter: Albertus Kelvin > Priority: Minor > > I was profiling code in PySpark. What I did was set the > "spark.python.profile" in the config to "true". I also made a simple method > consisting of several dataframe operations, such as "withColumn" and "join". > Here's the code sample: > {code:python} > def join_df(df, df1): > df = df.withColumn('rowa', F.lit(100)) > df = df.withColumn('rowb', df['rowa'] * F.lit(100)) > > joined_df = df.join(df1,'rowid',how='left') > return joined_df > {code} > However, after the driver exits, the output of the profiler was not > understandable because there were no my filename and the corresponding > methods. All exists was Spark's built-in files and methods, such as "rdd.py", > "worker.py", and "serializers.py". > The question is, how to show all of my methods that become the bottlenecks? > For example, using the above code sample, I'd like to know the time needed > for "withColumn" and "join" operation. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org