[ https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Christian Homberg updated SPARK-27966: -------------------------------------- Description: I ran into an issue similar and probably related to SPARK-26128. The _org.apache.spark.sql.functions.input_file_name_ is sometimes empty. {code:java} df.select(input_file_name()).show(5,false) {code} {code:java} +-----------------+ |input_file_name()| +-----------------+ | | | | | | | | | | +-----------------+ {code} My environment is databricks and debugging the Log4j output showed me that the issue occurred when the files are being listed in parallel, e.g. when {code:java} 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of Paths: 127; threshold: 32 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories in parallel under:{code} Everything's fine as long as {code:java} 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of Paths: 6; threshold: 32 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of Paths: 0; threshold: 32 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of Paths: 0; threshold: 32 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of Paths: 0; threshold: 32 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of Paths: 0; threshold: 32 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of Paths: 0; threshold: 32 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of Paths: 0; threshold: 32 {code} Setting spark.sql.sources.parallelPartitionDiscovery.threshold to 9999 resolves the issue for me. was: I ran into an issue similar and probably related to SPARK-26128. The `org.apache.spark.sql.functions.input_file_name` is sometime empty. My environment is databricks and debugging Log4j output showed me that the issue occurred when the files are being listed in parallel, e.g. when 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and directories. Size of Paths: 127; threshold: 32 This is not an issue when listing less than 32 files. Alternatively setting spark.sql.sources.parallelPartitionDiscovery.threshold to 9999 resolves the issue. > input_file_name empty when listing files in parallel > ---------------------------------------------------- > > Key: SPARK-27966 > URL: https://issues.apache.org/jira/browse/SPARK-27966 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.4.0 > Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11) > > Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 > Workers: 3 > Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 > Reporter: Christian Homberg > Priority: Minor > > I ran into an issue similar and probably related to SPARK-26128. The > _org.apache.spark.sql.functions.input_file_name_ is sometimes empty. > > {code:java} > df.select(input_file_name()).show(5,false) > {code} > > {code:java} > +-----------------+ > |input_file_name()| > +-----------------+ > | | > | | > | | > | | > | | > +-----------------+ > {code} > My environment is databricks and debugging the Log4j output showed me that > the issue occurred when the files are being listed in parallel, e.g. when > {code:java} > 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 127; threshold: 32 > 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories > in parallel under:{code} > > Everything's fine as long as > {code:java} > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 6; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > {code} > > Setting spark.sql.sources.parallelPartitionDiscovery.threshold to 9999 > resolves the issue for me. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org