[jira] [Created] (SPARK-31836) input_file_name() gives wrong value following Python UDF usage

Wesley Hildebrandt (Jira) Wed, 27 May 2020 05:29:10 -0700

Wesley Hildebrandt created SPARK-31836:
------------------------------------------


             Summary: input_file_name() gives wrong value following Python UDF 
usage
                 Key: SPARK-31836
                 URL: https://issues.apache.org/jira/browse/SPARK-31836
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.0.0
            Reporter: Wesley Hildebrandt


I'm using PySpark for Spark 3.0.0 RC1 with Python 3.6.8.

The following commands demonstrate that the input_file_name() function 
sometimes returns the wrong filename following usage of a Python UDF:

$ for i in `seq 5`; do echo $i > /tmp/test-file-$i; done

$ pyspark

>>> import pyspark.sql.functions as F

>>> spark.readStream.text('file:///tmp/test-file-*', 
>>> wholetext=True).withColumn('file1', F.input_file_name()).withColumn('udf', 
>>> F.udf(lambda x:x)('value')).withColumn('file2', 
>>> F.input_file_name()).writeStream.trigger(once=True).foreachBatch(lambda 
>>> df,_: df.select('file1','file2').show(truncate=False, 
>>> vertical=True)).start().awaitTermination()

A few notes about this bug:
 * It happens with many different files, so it's not related to the file 
contents
 * It also happens loading files from HDFS, so storage location is not a factor
 * It also happens using .csv() to read the files instead of .text(), so input 
format is not a factor
 * I have not been able to cause the error without using readStream, so it 
seems to be related to streaming
 * The bug also happens using spark-submit to send a job to my cluster
 * I haven't tested an older version, but it's possible that Spark pulls 24958 
and 25321([https://github.com/apache/spark/pull/24958], 
[https://github.com/apache/spark/pull/25321]) to fix issue 28153 
(https://issues.apache.org/jira/browse/SPARK-28153) introduced this bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31836) input_file_name() gives wrong value following Python UDF usage

Reply via email to