Re: [Spark SQL]: Source code for PartitionedFile

Mich Talebzadeh Mon, 08 Apr 2024 14:41:15 -0700

Hi,

I believe this is the package


https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala

And the code

case class FilePartition(index: Int, files: Array[PartitionedFile])
  extends Partition with InputPartition {
  override def preferredLocations(): Array[String] = {
    // Computes total number of bytes that can be retrieved from each host.
    val hostToNumBytes = mutable.HashMap.empty[String, Long]
    files.foreach { file =>
      file.locations.filter(_ != "localhost").foreach { host =>
        hostToNumBytes(host) = hostToNumBytes.getOrElse(host, 0L) +
file.length
      }
    }

    // Selects the first 3 hosts with the most data to be retrieved.
    hostToNumBytes.toSeq.sortBy {
      case (host, numBytes) => numBytes
    }.reverse.take(3).map {
      case (host, numBytes) => host
    }.toArray
  }
}

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 8 Apr 2024 at 20:31, Ashley McManamon <
[email protected]> wrote:

> Hi All,
>
> I've been diving into the source code to get a better understanding of how
> file splitting works from a user perspective. I've hit a deadend at
> `PartitionedFile`, for which I cannot seem to find a definition? It appears
> though it should be found at
> org.apache.spark.sql.execution.datasources but I find no definition in the
> entire source code. Am I missing something?
>
> I appreciate there may be an obvious answer here, apologies if I'm being
> naive.
>
> Thanks,
> Ashley McManamon
>
>

Re: [Spark SQL]: Source code for PartitionedFile

Reply via email to