Hi Spark Community, I am working on an optimization where I need to map Spark Partition IDs to their underlying Input File Names before the job execution starts.
My Approach: I access *df.queryExecution.executedPlan -> FileSourceScanExec -> FileScanRDD <https://github.com/apache/spark/blob/6df8d57b30e7fad18cb9e05309eed4e801128b62/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L80>* <https://github.com/apache/spark/blob/6df8d57b30e7fad18cb9e05309eed4e801128b62/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L80>to extract the file mapping directly from the driver's metadata. This gives me a *Map[PartitionID, Seq[FileName]]* instantly without triggering a Spark job. Later, I use a SparkListener to identify finished files. My Questions: - *Immutability*: Can I guarantee that the mapping inside FileScanRDD.partitions is immutable for the lifespan of that specific DataFrame/RDD execution plan? -* Dynamic Allocation & Failures*: If a job runs on a cluster with Dynamic Allocation enabled (executors added/removed) or if nodes fail causing task retries: Is it guaranteed that the Partition ID - Files mapping remains constant? My assumption: The scheduler might reschedule the task to a different node, but the partition mapping itself stays the same. Is this correct? - *Adaptive Query Execution (AQE)*: I am aware that since Spark 3.2.0, spark.sql.adaptive.enabled <https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution> is true by default. Does AQE ever modify the initial FileSourceScanExec partitions at runtime in a simple Read -> Write flow (without explicit shuffles)? I understand this relies on internal APIs, but I want to ensure the logic regarding partition ID stability is predictable. Thanks in advance!
