[ https://issues.apache.org/jira/browse/SPARK-18667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821506#comment-15821506 ]
Ben edited comment on SPARK-18667 at 1/13/17 9:53 AM: ------------------------------------------------------ So, I created a new example now, and here is the code for everything: a.xml: {noformat} <root> <x>TEXT</x> <y>TEXT2</y> </root> {noformat} b.xml: {noformat} <root> <file>file:/C:/a.xml</file> <other>AAA</other> </root> {noformat} code: {noformat} from pyspark.sql.functions import udf,input_file_name from pyspark.sql.types import StringType from pyspark.sql import SparkSession def filename(path): return path session = SparkSession.builder.appName('APP').getOrCreate() session.udf.register('sameText',filename) sameText = udf(filename, StringType()) df = session.read.format('xml').load('../../res/Other/a.xml', rowTag='root').select('*',input_file_name().alias('file')) df.select('file').show() df.select(sameText(df['file'])).show() df2 = session.read.format('xml').load('../../res/Other/b.xml', rowTag='root') df3 = df.join(df2, 'file') df.show() df2.show() df3.show() df3.selectExpr('file as FILE','x AS COL1','sameText(y) AS COL2').show() {noformat} and this is the console output: {noformat} +--------------------+ | file| +--------------------+ |file:/C:/Users/SS...| +--------------------+ +--------------+ |filename(file)| +--------------+ | | +--------------+ +----+-----+--------------------+ | x| y| file| +----+-----+--------------------+ |TEXT|TEXT2|file:/C:/Users/SS...| +----+-----+--------------------+ +--------------------+-----+ | file|other| +--------------------+-----+ |file:/C:/Users/SS...| AAA| +--------------------+-----+ +--------------------+----+-----+-----+ | file| x| y|other| +--------------------+----+-----+-----+ |file:/C:/Users/SS...|TEXT|TEXT2| AAA| +--------------------+----+-----+-----+ [Stage 26:> (0 + 4) / 4] [Stage 29:> (0 + 8) / 20] [Stage 29:=================> (6 + 8) / 20] [Stage 29:===================> (7 + 8) / 20] [Stage 29:======================> (8 + 8) / 20] [Stage 29:============================> (10 + 8) / 20] [Stage 29:====================================> (13 + 7) / 20] [Stage 29:=======================================> (14 + 6) / 20] [Stage 29:==========================================> (15 + 5) / 20] [Stage 32:> (0 + 8) / 100] [Stage 32:===> (7 + 8) / 100] [Stage 32:====> (8 + 8) / 100] [Stage 32:=======> (13 + 8) / 100] [Stage 32:========> (15 + 8) / 100] [Stage 32:===========> (20 + 8) / 100] [Stage 32:============> (22 + 8) / 100] [Stage 32:==============> (27 + 8) / 100] [Stage 32:===============> (29 + 8) / 100] [Stage 32:==================> (34 + 8) / 100] [Stage 32:===================> (36 + 8) / 100] [Stage 32:======================> (41 + 8) / 100] [Stage 32:=======================> (42 + 8) / 100] [Stage 32:=========================> (46 + 8) / 100] [Stage 32:==========================> (48 + 8) / 100] [Stage 32:==========================> (49 + 8) / 100] [Stage 32:===========================> (50 + 8) / 100] [Stage 32:=============================> (53 + 8) / 100] [Stage 32:==============================> (55 + 8) / 100] [Stage 32:==============================> (56 + 8) / 100] [Stage 32:===============================> (57 + 8) / 100] [Stage 32:=================================> (60 + 8) / 100] [Stage 32:==================================> (62 + 8) / 100] [Stage 32:==================================> (63 + 8) / 100] [Stage 32:===================================> (65 + 8) / 100] [Stage 32:====================================> (67 + 8) / 100] [Stage 32:=====================================> (69 + 8) / 100] [Stage 32:======================================> (70 + 8) / 100] [Stage 32:=======================================> (72 + 8) / 100] [Stage 32:========================================> (74 + 8) / 100] [Stage 32:=========================================> (76 + 8) / 100] [Stage 32:==========================================> (77 + 8) / 100] [Stage 32:===========================================> (79 + 8) / 100] [Stage 32:============================================> (81 + 8) / 100] [Stage 32:=============================================> (83 + 8) / 100] [Stage 32:==============================================> (84 + 8) / 100] [Stage 32:===============================================> (86 + 8) / 100] [Stage 32:================================================> (88 + 8) / 100] [Stage 32:=================================================> (90 + 8) / 100] [Stage 32:==================================================> (91 + 8) / 100] [Stage 32:===================================================> (93 + 7) / 100] [Stage 32:====================================================> (95 + 5) / 100] [Stage 32:=====================================================> (97 + 3) / 100] [Stage 35:> (0 + 8) / 75] [Stage 35:==> (4 + 11) / 75] [Stage 35:=====> (7 + 8) / 75] [Stage 35:======> (8 + 8) / 75] [Stage 35:=========> (13 + 8) / 75] [Stage 35:==========> (14 + 8) / 75] [Stage 35:===========> (15 + 8) / 75] [Stage 35:===========> (16 + 8) / 75] [Stage 35:==============> (20 + 8) / 75] [Stage 35:===============> (21 + 8) / 75] [Stage 35:================> (22 + 8) / 75] [Stage 35:=================> (23 + 8) / 75] [Stage 35:====================> (27 + 8) / 75] [Stage 35:====================> (28 + 8) / 75] [Stage 35:=====================> (29 + 8) / 75] [Stage 35:======================> (30 + 8) / 75] [Stage 35:=========================> (34 + 8) / 75] [Stage 35:==========================> (35 + 8) / 75] [Stage 35:==========================> (36 + 8) / 75] [Stage 35:===========================> (37 + 8) / 75] [Stage 35:==============================> (41 + 8) / 75] [Stage 35:===============================> (42 + 8) / 75] [Stage 35:================================> (43 + 8) / 75] [Stage 35:================================> (44 + 8) / 75] [Stage 35:===================================> (48 + 8) / 75] [Stage 35:====================================> (49 + 8) / 75] [Stage 35:=====================================> (50 + 8) / 75] [Stage 35:======================================> (51 + 8) / 75] [Stage 35:=========================================> (55 + 8) / 75] [Stage 35:=========================================> (56 + 8) / 75] [Stage 35:==========================================> (57 + 8) / 75] [Stage 35:===========================================> (58 + 8) / 75] [Stage 35:==============================================> (62 + 8) / 75] [Stage 35:===============================================> (63 + 8) / 75] [Stage 35:================================================> (65 + 8) / 75] [Stage 35:===================================================> (69 + 6) / 75] [Stage 35:=====================================================> (72 + 3) / 75] +--------------------+----+-----+ | FILE|COL1| COL2| +--------------------+----+-----+ |file:/C:/Users/SS...|TEXT|TEXT2| +--------------------+----+-----+ SUCCESS: The process with PID 11916 (child process of PID 3592) has been terminated. SUCCESS: The process with PID 3592 (child process of PID 9904) has been terminated. SUCCESS: The process with PID 9904 (child process of PID 5468) has been terminated. {noformat} As you can see, After I add the "file" column, and show it, it's working. But then if I apply an UDF on it, it returns empty. Then I proceed with the join, and show each dataframe again. Everything is OK until the last row, where it takes a very long time considering the amount of data and the speed of the previous processes. And this worries me becauses I don't think it is supposed to take such a long time. Additionally, although in this example it works in the end, in my actual code it is not working. So as I already wrote on the previous post, if I join two dataframes, everything is OK, until I do a select where I apply a UDF, and then the whole query returns empty. The UDF works well if I don't join. The other thing is, that if I do a count on the dataframe after the select just mentioned, it returns the correct number of rows, and not 0, but if I try to show or write the rows, the dataframe comes up empty. I'm not sure why this is happening so I would appreciate any help, and to at least know whether it's a bug or not. was (Author: someonehere15): So, I created a new example now, and here is the code for everything: a.xml: {noformat} <root> <x>TEXT</x> <y>TEXT2</y> </root> {noformat} b.xml: {noformat} <root> <file>file:/C:/a.xml</file> <other>AAA</other> </root> {noformat} code: {noformat} from pyspark.sql.functions import udf,input_file_name from pyspark.sql.types import StringType from pyspark.sql import SparkSession def filename(path): return path session = SparkSession.builder.appName('APP').getOrCreate() session.udf.register('sameText',filename) sameText = udf(filename, StringType()) df = session.read.format('xml').load('../../res/Other/a.xml', rowTag='root').select('*',input_file_name().alias('file')) df.select('file').show() df.select(sameText(df['file'])).show() df2 = session.read.format('xml').load('../../res/Other/b.xml', rowTag='root') df3 = df.join(df2, 'file') df.show() df2.show() df3.show() df3.selectExpr('file as FILE','x AS COL1','sameText(y) AS COL2').show() {noformat} and this is the console output: {noformat} 2017-01-13 10:27:55 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable +--------------------+ | file| +--------------------+ |file:/C:/Users/SS...| +--------------------+ +--------------+ |filename(file)| +--------------+ | | +--------------+ +----+-----+--------------------+ | x| y| file| +----+-----+--------------------+ |TEXT|TEXT2|file:/C:/Users/SS...| +----+-----+--------------------+ +--------------------+-----+ | file|other| +--------------------+-----+ |file:/C:/Users/SS...| AAA| +--------------------+-----+ +--------------------+----+-----+-----+ | file| x| y|other| +--------------------+----+-----+-----+ |file:/C:/Users/SS...|TEXT|TEXT2| AAA| +--------------------+----+-----+-----+ [Stage 26:> (0 + 4) / 4] [Stage 29:> (0 + 8) / 20] [Stage 29:=================> (6 + 8) / 20] [Stage 29:===================> (7 + 8) / 20] [Stage 29:======================> (8 + 8) / 20] [Stage 29:============================> (10 + 8) / 20] [Stage 29:====================================> (13 + 7) / 20] [Stage 29:=======================================> (14 + 6) / 20] [Stage 29:==========================================> (15 + 5) / 20] [Stage 32:> (0 + 8) / 100] [Stage 32:===> (7 + 8) / 100] [Stage 32:====> (8 + 8) / 100] [Stage 32:=======> (13 + 8) / 100] [Stage 32:========> (15 + 8) / 100] [Stage 32:===========> (20 + 8) / 100] [Stage 32:============> (22 + 8) / 100] [Stage 32:==============> (27 + 8) / 100] [Stage 32:===============> (29 + 8) / 100] [Stage 32:==================> (34 + 8) / 100] [Stage 32:===================> (36 + 8) / 100] [Stage 32:======================> (41 + 8) / 100] [Stage 32:=======================> (42 + 8) / 100] [Stage 32:=========================> (46 + 8) / 100] [Stage 32:==========================> (48 + 8) / 100] [Stage 32:==========================> (49 + 8) / 100] [Stage 32:===========================> (50 + 8) / 100] [Stage 32:=============================> (53 + 8) / 100] [Stage 32:==============================> (55 + 8) / 100] [Stage 32:==============================> (56 + 8) / 100] [Stage 32:===============================> (57 + 8) / 100] [Stage 32:=================================> (60 + 8) / 100] [Stage 32:==================================> (62 + 8) / 100] [Stage 32:==================================> (63 + 8) / 100] [Stage 32:===================================> (65 + 8) / 100] [Stage 32:====================================> (67 + 8) / 100] [Stage 32:=====================================> (69 + 8) / 100] [Stage 32:======================================> (70 + 8) / 100] [Stage 32:=======================================> (72 + 8) / 100] [Stage 32:========================================> (74 + 8) / 100] [Stage 32:=========================================> (76 + 8) / 100] [Stage 32:==========================================> (77 + 8) / 100] [Stage 32:===========================================> (79 + 8) / 100] [Stage 32:============================================> (81 + 8) / 100] [Stage 32:=============================================> (83 + 8) / 100] [Stage 32:==============================================> (84 + 8) / 100] [Stage 32:===============================================> (86 + 8) / 100] [Stage 32:================================================> (88 + 8) / 100] [Stage 32:=================================================> (90 + 8) / 100] [Stage 32:==================================================> (91 + 8) / 100] [Stage 32:===================================================> (93 + 7) / 100] [Stage 32:====================================================> (95 + 5) / 100] [Stage 32:=====================================================> (97 + 3) / 100] [Stage 35:> (0 + 8) / 75] [Stage 35:==> (4 + 11) / 75] [Stage 35:=====> (7 + 8) / 75] [Stage 35:======> (8 + 8) / 75] [Stage 35:=========> (13 + 8) / 75] [Stage 35:==========> (14 + 8) / 75] [Stage 35:===========> (15 + 8) / 75] [Stage 35:===========> (16 + 8) / 75] [Stage 35:==============> (20 + 8) / 75] [Stage 35:===============> (21 + 8) / 75] [Stage 35:================> (22 + 8) / 75] [Stage 35:=================> (23 + 8) / 75] [Stage 35:====================> (27 + 8) / 75] [Stage 35:====================> (28 + 8) / 75] [Stage 35:=====================> (29 + 8) / 75] [Stage 35:======================> (30 + 8) / 75] [Stage 35:=========================> (34 + 8) / 75] [Stage 35:==========================> (35 + 8) / 75] [Stage 35:==========================> (36 + 8) / 75] [Stage 35:===========================> (37 + 8) / 75] [Stage 35:==============================> (41 + 8) / 75] [Stage 35:===============================> (42 + 8) / 75] [Stage 35:================================> (43 + 8) / 75] [Stage 35:================================> (44 + 8) / 75] [Stage 35:===================================> (48 + 8) / 75] [Stage 35:====================================> (49 + 8) / 75] [Stage 35:=====================================> (50 + 8) / 75] [Stage 35:======================================> (51 + 8) / 75] [Stage 35:=========================================> (55 + 8) / 75] [Stage 35:=========================================> (56 + 8) / 75] [Stage 35:==========================================> (57 + 8) / 75] [Stage 35:===========================================> (58 + 8) / 75] [Stage 35:==============================================> (62 + 8) / 75] [Stage 35:===============================================> (63 + 8) / 75] [Stage 35:================================================> (65 + 8) / 75] [Stage 35:===================================================> (69 + 6) / 75] [Stage 35:=====================================================> (72 + 3) / 75] +--------------------+----+-----+ | FILE|COL1| COL2| +--------------------+----+-----+ |file:/C:/Users/SS...|TEXT|TEXT2| +--------------------+----+-----+ SUCCESS: The process with PID 11916 (child process of PID 3592) has been terminated. SUCCESS: The process with PID 3592 (child process of PID 9904) has been terminated. SUCCESS: The process with PID 9904 (child process of PID 5468) has been terminated. {noformat} As you can see, After I add the "file" column, and show it, it's working. But then if I apply an UDF on it, it returns empty. Then I proceed with the join, and show each dataframe again. Everything is OK until the last row, where it takes a very long time considering the amount of data and the speed of the previous processes. And this worries me becauses I don't think it is supposed to take such a long time. Additionally, although in this example it works in the end, in my actual code it is not working. So as I already wrote on the previous post, if I join two dataframes, everything is OK, until I do a select where I apply a UDF, and then the whole query returns empty. The UDF works well if I don't join. The other thing is, that if I do a count on the dataframe after the select just mentioned, it returns the correct number of rows, and not 0, but if I try to show or write the rows, the dataframe comes up empty. I'm not sure why this is happening so I would appreciate any help, and to at least know whether it's a bug or not. > input_file_name function does not work with UDF > ----------------------------------------------- > > Key: SPARK-18667 > URL: https://issues.apache.org/jira/browse/SPARK-18667 > Project: Spark > Issue Type: Bug > Components: PySpark > Reporter: Hyukjin Kwon > Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > {{input_file_name()}} does not return the file name but empty string instead > when it is used as input for UDF in PySpark as below: > with the data as below: > {code} > {"a": 1} > {code} > with the codes below: > {code} > from pyspark.sql.functions import * > from pyspark.sql.types import * > def filename(path): > return path > sourceFile = udf(filename, StringType()) > spark.read.json("tmp.json").select(sourceFile(input_file_name())).show() > {code} > prints as below: > {code} > +---------------------------+ > |filename(input_file_name())| > +---------------------------+ > | | > +---------------------------+ > {code} > but the codes below: > {code} > spark.read.json("tmp.json").select(input_file_name()).show() > {code} > prints correctly as below: > {code} > +--------------------+ > | input_file_name()| > +--------------------+ > |file:///Users/hyu...| > +--------------------+ > {code} > This seems PySpark specific issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org