Re: [Spark SQL]: Source code for PartitionedFile

2024-04-11 Thread Ashley McManamon
Hi Mich,

Thanks for the reply.

I did come across that file but it didn't align with the appearance of
`PartitionedFile`:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala

In fact, the code snippet you shared also references the type
`PartitionedFile`.

There's actually this javadoc.io page for a `PartitionedFile`
at org.apache.spark.sql.execution.datasources for spark-sql_2.12:3.0.2:
https://javadoc.io/doc/org.apache.spark/spark-sql_2.12/3.0.2/org/apache/spark/sql/execution/datasources/PartitionedFile.html.
I double checked the source code for version 3.0.2 and doesn't seem to
exist there either:
https://github.com/apache/spark/tree/v3.0.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources

Ashley


On Mon, 8 Apr 2024 at 22:41, Mich Talebzadeh 
wrote:

> Hi,
>
> I believe this is the package
>
>
> https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala
>
> And the code
>
> case class FilePartition(index: Int, files: Array[PartitionedFile])
>   extends Partition with InputPartition {
>   override def preferredLocations(): Array[String] = {
> // Computes total number of bytes that can be retrieved from each host.
> val hostToNumBytes = mutable.HashMap.empty[String, Long]
> files.foreach { file =>
>   file.locations.filter(_ != "localhost").foreach { host =>
> hostToNumBytes(host) = hostToNumBytes.getOrElse(host, 0L) +
> file.length
>   }
> }
>
> // Selects the first 3 hosts with the most data to be retrieved.
> hostToNumBytes.toSeq.sortBy {
>   case (host, numBytes) => numBytes
> }.reverse.take(3).map {
>   case (host, numBytes) => host
> }.toArray
>   }
> }
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 8 Apr 2024 at 20:31, Ashley McManamon <
> ashley.mcmana...@quantcast.com> wrote:
>
>> Hi All,
>>
>> I've been diving into the source code to get a better understanding of
>> how file splitting works from a user perspective. I've hit a deadend at
>> `PartitionedFile`, for which I cannot seem to find a definition? It appears
>> though it should be found at
>> org.apache.spark.sql.execution.datasources but I find no definition in
>> the entire source code. Am I missing something?
>>
>> I appreciate there may be an obvious answer here, apologies if I'm being
>> naive.
>>
>> Thanks,
>> Ashley McManamon
>>
>>


Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh
Hi,

I believe this is the package

https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala

And the code

case class FilePartition(index: Int, files: Array[PartitionedFile])
  extends Partition with InputPartition {
  override def preferredLocations(): Array[String] = {
// Computes total number of bytes that can be retrieved from each host.
val hostToNumBytes = mutable.HashMap.empty[String, Long]
files.foreach { file =>
  file.locations.filter(_ != "localhost").foreach { host =>
hostToNumBytes(host) = hostToNumBytes.getOrElse(host, 0L) +
file.length
  }
}

// Selects the first 3 hosts with the most data to be retrieved.
hostToNumBytes.toSeq.sortBy {
  case (host, numBytes) => numBytes
}.reverse.take(3).map {
  case (host, numBytes) => host
}.toArray
  }
}

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 8 Apr 2024 at 20:31, Ashley McManamon <
ashley.mcmana...@quantcast.com> wrote:

> Hi All,
>
> I've been diving into the source code to get a better understanding of how
> file splitting works from a user perspective. I've hit a deadend at
> `PartitionedFile`, for which I cannot seem to find a definition? It appears
> though it should be found at
> org.apache.spark.sql.execution.datasources but I find no definition in the
> entire source code. Am I missing something?
>
> I appreciate there may be an obvious answer here, apologies if I'm being
> naive.
>
> Thanks,
> Ashley McManamon
>
>


[Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Ashley McManamon
Hi All,

I've been diving into the source code to get a better understanding of how
file splitting works from a user perspective. I've hit a deadend at
`PartitionedFile`, for which I cannot seem to find a definition? It appears
though it should be found at
org.apache.spark.sql.execution.datasources but I find no definition in the
entire source code. Am I missing something?

I appreciate there may be an obvious answer here, apologies if I'm being
naive.

Thanks,
Ashley McManamon