[ https://issues.apache.org/jira/browse/SPARK-37174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437184#comment-17437184 ]
Hyukjin Kwon commented on SPARK-37174: -------------------------------------- This is related to default index, see also https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type. Spark 3.3 targets to remove such warnings. > WARN WindowExec: No Partition Defined is being printed 4 times. > ---------------------------------------------------------------- > > Key: SPARK-37174 > URL: https://issues.apache.org/jira/browse/SPARK-37174 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 3.3.0 > Reporter: Bjørn Jørgensen > Priority: Major > > Hi I use this code > {code:java} > f01 = spark.read.json("/home/test_files/falk/flatted110721/F01.json/*.json") > pf01 = f01.to_pandas_on_spark() > pf01 = pf01.rename(columns=lambda x: re.sub(':P$', '', x)) > pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"] = > ps.to_datetime(pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"]) > pf01.info(){code} > > sometimes it prints > > {code:java} > 21/10/31 20:38:04 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 20:38:04 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > 21/10/31 20:38:08 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > /opt/spark/python/pyspark/sql/pandas/conversion.py:214: PerformanceWarning: > DataFrame is highly fragmented. This is usually the result of calling > `frame.insert` many times, which has poor performance. Consider joining all > columns at once using pd.concat(axis=1) instead. To get a de-fragmented > frame, use `newframe = frame.copy()` > df[column_name] = series > /opt/spark/python/pyspark/pandas/utils.py:967: UserWarning: `to_pandas` > loads all data into the driver's memory. It should only be used if the > resulting pandas Series is expected to be small. > warnings.warn(message, UserWarning) > 21/10/31 20:38:16 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 20:38:18 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation.{code} > > and some other times it "just" prints > > {code:java} > 21/10/31 21:24:13 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 21:24:16 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 21:24:22 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 21:24:24 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation.{code} > Why does it print df[column_name] = series ? > > can we remove /opt/spark/python/pyspark/pandas/utils.py:967: ? > and warnings.warn(message, UserWarning) ? > and 3 of WARN WindowExec: No Partition Defined for Window operation! Moving > all data to a single partition, this can cause serious performance > degradation.? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org