[ https://issues.apache.org/jira/browse/SPARK-19223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822729#comment-15822729 ]
Apache Spark commented on SPARK-19223: -------------------------------------- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/16585 > InputFileBlockHolder doesn't work with Python UDF for datasource other than > FileFormat > -------------------------------------------------------------------------------------- > > Key: SPARK-19223 > URL: https://issues.apache.org/jira/browse/SPARK-19223 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Reporter: Liang-Chi Hsieh > > For the datasource other than FileFormat, such as spark-xml which is based on > BaseRelation and uses HadoopRDD, NewHadoopRDD, InputFileBlockHolder doesn't > work with Python UDF. > The method to reproduce it is, running the following codes with {{bin/pyspark > --packages com.databricks:spark-xml_2.11:0.4.1}}: > {code} > from pyspark.sql.functions import udf,input_file_name > from pyspark.sql.types import StringType > from pyspark.sql import SparkSession > def filename(path): > return path > session = SparkSession.builder.appName('APP').getOrCreate() > session.udf.register('sameText',filename) > sameText = udf(filename, StringType()) > df = session.read.format('xml').load('a.xml', > rowTag='root').select('*',input_file_name().alias('file')) > df.select('file').show() // works > df.select(sameText(df['file'])).show() // returns empty content > {code} > a.xml: > {code} > <root> > <x>TEXT</x> > <y>TEXT2</y> > </root> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org