[ https://issues.apache.org/jira/browse/SPARK-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796227#comment-15796227 ]
Nicholas Chammas commented on SPARK-18866: ------------------------------------------ Could be. I guess the issue of aliasing somehow masks the codegen bug with escaping the slash? > Codegen fails with cryptic error if regexp_replace() output column is not > aliased > --------------------------------------------------------------------------------- > > Key: SPARK-18866 > URL: https://issues.apache.org/jira/browse/SPARK-18866 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 2.0.2, 2.1.0 > Environment: Java 8, Python 3.5 > Reporter: Nicholas Chammas > Priority: Minor > > Here's a minimal repro: > {code} > import pyspark > from pyspark.sql import Column > from pyspark.sql.functions import regexp_replace, lower, col > def normalize_udf(column: Column) -> Column: > normalized_column = ( > regexp_replace( > column, > pattern='[\s]+', > replacement=' ', > ) > ) > return normalized_column > if __name__ == '__main__': > spark = pyspark.sql.SparkSession.builder.getOrCreate() > raw_df = spark.createDataFrame( > [(' ',)], > ['string'], > ) > normalized_df = raw_df.select(normalize_udf('string')) > normalized_df_prime = ( > normalized_df > .groupBy(sorted(normalized_df.columns)) > .count()) > normalized_df_prime.show() > {code} > When I run this I get: > {code} > ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 80, Column 130: Invalid escape sequence > {code} > Followed by a huge barf of generated Java code, _and then the output I > expect_. (So despite the scary error, the code actually works!) > Can you spot the error in my code? > It's simple: I just need to alias the output of {{normalize_udf()}} and all > is forgiven: > {code} > normalized_df = raw_df.select(normalize_udf('string').alias('string')) > {code} > Of course, it's impossible to tell that from the current error output. So my > *first question* is: Is there some way we can better communicate to the user > what went wrong? > Another interesting thing I noticed is that if I try this: > {code} > normalized_df = raw_df.select(lower('string')) > {code} > I immediately get a clean error saying: > {code} > py4j.protocol.Py4JError: An error occurred while calling > z:org.apache.spark.sql.functions.lower. Trace: > py4j.Py4JException: Method lower([class java.lang.String]) does not exist > {code} > I can fix this by building a column object: > {code} > normalized_df = raw_df.select(lower(col('string'))) > {code} > So that raises *a second problem/question*: Why does {{lower()}} require that > I build a Column object, whereas {{regexp_replace()}} does not? The > inconsistency adds to the confusion here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org