Hi, I believe this is the package
https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala And the code case class FilePartition(index: Int, files: Array[PartitionedFile]) extends Partition with InputPartition { override def preferredLocations(): Array[String] = { // Computes total number of bytes that can be retrieved from each host. val hostToNumBytes = mutable.HashMap.empty[String, Long] files.foreach { file => file.locations.filter(_ != "localhost").foreach { host => hostToNumBytes(host) = hostToNumBytes.getOrElse(host, 0L) + file.length } } // Selects the first 3 hosts with the most data to be retrieved. hostToNumBytes.toSeq.sortBy { case (host, numBytes) => numBytes }.reverse.take(3).map { case (host, numBytes) => host }.toArray } } HTH Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Mon, 8 Apr 2024 at 20:31, Ashley McManamon < ashley.mcmana...@quantcast.com> wrote: > Hi All, > > I've been diving into the source code to get a better understanding of how > file splitting works from a user perspective. I've hit a deadend at > `PartitionedFile`, for which I cannot seem to find a definition? It appears > though it should be found at > org.apache.spark.sql.execution.datasources but I find no definition in the > entire source code. Am I missing something? > > I appreciate there may be an obvious answer here, apologies if I'm being > naive. > > Thanks, > Ashley McManamon > >