[ https://issues.apache.org/jira/browse/SPARK-18667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821073#comment-15821073 ]
Ben edited comment on SPARK-18667 at 1/12/17 2:10 PM: ------------------------------------------------------ I still have the same problem on pySpark 2.1.0 and Python 3.5.2 with the exact same steps as described in the issue. Additionally I also have the following problem: I have two dataframes, one where I read files and consequently has input_file_name() containing the source path for each row, and another with other stuff and a column FILEPATH already containing the same file paths as a static value. Basically I want to join these tables on the filepath columns. So I do the following: Add a proper filepath column to the first dataframe, based on input_file_name(): {noformat} dataFrame = dataFrame.withColumn('FILEPATH',input_file_name()) {noformat} or the same happens with {noformat} dataFrame = dataFrame.select('*', input_file_name().alias('FILEPATH')) {noformat} and then I join the two tables on the FILEPATH column: {noformat} dataFrame = dataFrame.join(df2, 'FILEPATH') {noformat} so until here everything works, and I don't get an empty FILEPATH column. {noformat} +--------------+-------+-------+ | FILEPATH|ColumnA|ColumnB| +--------------+-------+-------+ |file:/C:/a.xml| StuffA| StuffX| |file:/C:/b.xml| StuffB| StuffY| +--------------+-------+-------+ {noformat} But then, on the resulting dataframe, if I want to apply an UDF on any column, I get a completely empty result: e.g. I have the following UDF: {noformat} def f(text): return text {noformat} and I do: {noformat} dataFrame.selectExpr('') {noformat} I get: {noformat} +--------------+-------+-------+ | PATH| A| B| +--------------+-------+-------+ +--------------+-------+-------+ {noformat} Am I missing something or is a bug still present? was (Author: someonehere15): I still have the same problem on pySpark 2.1.0 and Python 3.5.2 with the exact same steps as described in the issue. Additionally I also have the following problem: I have two dataframes, one where I read files and consequently has input_file_name() containing the source path for each row, and another with other stuff and a column FILEPATH already containing the same file paths as a static value. Basically I want to join these tables on the filepath columns. So I do the following: Add a proper filepath column to the first dataframe, based on input_file_name(): {noformat} dataFrame = dataFrame.withColumn('FILEPATH',input_file_name()) {noformat} or the same happens with {noformat} dataFrame = dataFrame.select('*', input_file_name().alias('FILEPATH')) {noformat} and then I join the two tables on the FILEPATH column: dataFrame = dataFrame.join(df2, 'FILEPATH') so until here everything works, and I don't get an empty FILEPATH column. {noformat} +--------------+-------+-------+ | FILEPATH|ColumnA|ColumnB| +--------------+-------+-------+ |file:/C:/a.xml| StuffA| StuffX| |file:/C:/b.xml| StuffB| StuffY| +--------------+-------+-------+ {noformat} But then, on the resulting dataframe, if I want to apply an UDF on any column, I get a completely empty result: e.g. I have the following UDF: {noformat} def f(text): return text {noformat} and I do: {noformat} dataFrame.selectExpr('') {noformat} I get: {noformat} +--------------+-------+-------+ | PATH| A| B| +--------------+-------+-------+ +--------------+-------+-------+ {noformat} Am I missing something or is a bug still present? > input_file_name function does not work with UDF > ----------------------------------------------- > > Key: SPARK-18667 > URL: https://issues.apache.org/jira/browse/SPARK-18667 > Project: Spark > Issue Type: Bug > Components: PySpark > Reporter: Hyukjin Kwon > Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > {{input_file_name()}} does not return the file name but empty string instead > when it is used as input for UDF in PySpark as below: > with the data as below: > {code} > {"a": 1} > {code} > with the codes below: > {code} > from pyspark.sql.functions import * > from pyspark.sql.types import * > def filename(path): > return path > sourceFile = udf(filename, StringType()) > spark.read.json("tmp.json").select(sourceFile(input_file_name())).show() > {code} > prints as below: > {code} > +---------------------------+ > |filename(input_file_name())| > +---------------------------+ > | | > +---------------------------+ > {code} > but the codes below: > {code} > spark.read.json("tmp.json").select(input_file_name()).show() > {code} > prints correctly as below: > {code} > +--------------------+ > | input_file_name()| > +--------------------+ > |file:///Users/hyu...| > +--------------------+ > {code} > This seems PySpark specific issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org