[ https://issues.apache.org/jira/browse/SPARK-42751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702394#comment-17702394 ]
Hyukjin Kwon commented on SPARK-42751: -------------------------------------- cc [~itholic] FYI > Pyspark.pandas.series.str.findall can't handle tuples that are returned by > regex > -------------------------------------------------------------------------------- > > Key: SPARK-42751 > URL: https://issues.apache.org/jira/browse/SPARK-42751 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark > Affects Versions: 3.3.2 > Reporter: IonK > Priority: Major > > When you use the str.findall accessor method on a ps.series and you're > passing a regex pattern that will return match groups, it will return a > pyarrow data error. > In pandas the result is this: > {code:java} > df.to_pandas()[col].str.findall(regex_pattern, flags=re.IGNORECASE) > returns > [("value", , , , )], > [("value", , , , )], > [(, , ,"value", )]{code} > > In pyspark.pandas the result is: > {code:java} > org.apache.spark.api.python.PythonException: 'pyarrow.lib.ArrowTypeError: > Expected bytes, got a 'tuple' object'.{code} > > My temporary workaround is using > {code:java} > df.apply(lambda x: re.findall(regex_pattern, x, flags=re.IGNORECASE)[0]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org