[ 
https://issues.apache.org/jira/browse/SPARK-28850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Lotz updated SPARK-28850:
-------------------------------
    Description: 
When making a call to:
{code:java}
sc.binaryFiles(somePath){code}
 

It creates a BinaryFileRDD. Some sections of that code are run inside the 
driver container. The current source code for BinaryFileRDD is available here:
 
[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala

]

The problematic line is:

 
{code:java}
conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
Runtime.getRuntime.availableProcessors().toString)
{code}
 

This line sets the number of Threads to be used (in the case of multi-threading 
reading) to the number of cores (including Hyper Threading ones) available one 
the driver host machine.

This number is false, since what really matters is the number of cores 
allocated to the driver container by YARN and not the number of cores available 
in the host machine. This can easily impact the Spark-UI and the driver 
application performance, since the number of threads is far bigger than the 
true amount of allocated cores - which increases the number of unrequired 
preemptions and context switches

The solution is to retrieve the number of cores allocated to the Application 
Master by YARN instead.

Once confirmed the problem, I can work on retrieving that information and 
making a PR.

  was:
When making a call to:

```scala

sc.binaryFiles(somePath)

```

It creates a BinaryFileRDD. Some sections of that code are run inside the 
driver container. The current source code for BinaryFileRDD is available here:
[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala

]The problematic line is:

```scala

conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
Runtime.getRuntime.availableProcessors().toString)

```

This line sets the number of Threads to be used (in the case of multi-threading 
reading) to the number of cores (including Hyper Threading ones) available one 
the driver host machine.


This number is false, since what really matters is the number of cores 
allocated to the driver container by YARN and not the number of cores available 
in the host machine. This can easily impact the Spark-UI and the driver 
application performance, since the number of threads is far bigger than the 
true amount of allocated cores - which increases the number of unrequired 
preemptions and context switches

The solution is to retrieve the number of cores allocated to the Application 
Master by YARN instead.

Once confirmed the problem, I can work on retrieving that information and 
making a PR.


> Binary Files RDD allocates false number of threads
> --------------------------------------------------
>
>                 Key: SPARK-28850
>                 URL: https://issues.apache.org/jira/browse/SPARK-28850
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.4.3
>            Reporter: Marco Lotz
>            Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When making a call to:
> {code:java}
> sc.binaryFiles(somePath){code}
>  
> It creates a BinaryFileRDD. Some sections of that code are run inside the 
> driver container. The current source code for BinaryFileRDD is available here:
>  
> [https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala
> ]
> The problematic line is:
>  
> {code:java}
> conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
> Runtime.getRuntime.availableProcessors().toString)
> {code}
>  
> This line sets the number of Threads to be used (in the case of 
> multi-threading reading) to the number of cores (including Hyper Threading 
> ones) available one the driver host machine.
> This number is false, since what really matters is the number of cores 
> allocated to the driver container by YARN and not the number of cores 
> available in the host machine. This can easily impact the Spark-UI and the 
> driver application performance, since the number of threads is far bigger 
> than the true amount of allocated cores - which increases the number of 
> unrequired preemptions and context switches
> The solution is to retrieve the number of cores allocated to the Application 
> Master by YARN instead.
> Once confirmed the problem, I can work on retrieving that information and 
> making a PR.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to