[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

GitBox Wed, 23 Nov 2022 23:51:30 -0800


EnricoMi commented on code in PR #38223:
URL: https://github.com/apache/spark/pull/38223#discussion_r1031150574



##########
python/pyspark/worker.py:
##########
@@ -146,7 +146,74 @@ def verify_result_type(result):
     )
 
 
-def wrap_cogrouped_map_pandas_udf(f, return_type, argspec):
+def verify_pandas_result(result, return_type, assign_cols_by_name):
+    import pandas as pd
+
+    if not isinstance(result, pd.DataFrame):
+        raise TypeError(
+            "Return type of the user-defined function should be "
+            "pandas.DataFrame, but is {}".format(type(result))
+        )
+
+    # check the schema of the result only if it is not empty or has columns
+    if not result.empty or len(result.columns) != 0:
+        # if any column name of the result is a string
+        # the column names of the result have to match the return type
+        #   see create_array in 
pyspark.sql.pandas.serializers.ArrowStreamPandasSerializer
+        field_names = set([field.name for field in return_type.fields])
+        column_names = set(result.columns)
+        if (
+            assign_cols_by_name
+            and any(isinstance(name, str) for name in result.columns)
+            and column_names != field_names
+        ):
+            limit = 5

Review Comment:
   @HyukjinKwon I have removed the schema and columns limit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

Reply via email to