[GitHub] [spark] nicolasazrak commented on a change in pull request #34509: [SPARK-34521][PYTHON][SQL] Fix spark.createDataFrame when using pandas with StringDtype

GitBox Mon, 29 Nov 2021 08:34:07 -0800


nicolasazrak commented on a change in pull request #34509:
URL: https://github.com/apache/spark/pull/34509#discussion_r758535773




##########
File path: python/pyspark/sql/pandas/serializers.py
##########
@@ -169,6 +169,8 @@ def create_array(s, t):
             elif is_categorical_dtype(s.dtype):
                 # Note: This can be removed once minimum pyarrow version is >= 
0.16.1
                 s = s.astype(s.dtypes.categories.dtype)
+            elif t is not None and pa.types.is_string(t):
+                s = s.astype(str)

Review comment:
       Now that I think it again, adding support for the `StringArray` in arrow 
is the real solution, this change is just a workaround. I don't know much about 
the arrow internals or if they can add some metadata to support two different 
string types. If you feel this is better done in arrow upstream feel free to 
close the PR and I'll investigate it from the arrow side. Otherwise, we can 
leave this patch adding a comment and keep investigating in arrow.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] nicolasazrak commented on a change in pull request #34509: [SPARK-34521][PYTHON][SQL] Fix spark.createDataFrame when using pandas with StringDtype

Reply via email to