[jira] [Commented] (SPARK-44469) Utils.getOrCreateLocalRootDirs will never take effect after the first call fails, even if the exception is recovered

2023-07-17 Thread Yazhi Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744047#comment-17744047
 ] 

Yazhi Wang commented on SPARK-44469:


I'll work on it.

> Utils.getOrCreateLocalRootDirs will never take effect after the first call 
> fails, even if the exception is recovered
> 
>
> Key: SPARK-44469
> URL: https://issues.apache.org/jira/browse/SPARK-44469
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Yazhi Wang
>Priority: Minor
>
> {code:java}
> private[spark] def getOrCreateLocalRootDirs(conf: SparkConf): Array[String] = 
> {
>   if (localRootDirs == null || localRootDirs.isEmpty) {
> this.synchronized {
>   if (localRootDirs == null) {
> localRootDirs = getOrCreateLocalRootDirsImpl(conf)
>   }
> }
>   }
>   localRootDirs
> }{code}
> localRootDirs is only initialized once in the Executor/Driver life cycle. If 
> it fails due to a FileSystem exception (such as a full disk) during the first 
> initialization, localRootDirs will be assigned a value of None instead of 
> null.
> Even if the FileSystem exception recovered, the localRootDirs won't re-apply 
> from FileSystem, causing the task on the Executor to continue to fail (tasks 
> relay on fetchFiles locally)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44469) Utils.getOrCreateLocalRootDirs will never take effect after the first call fails, even if the exception is recovered

2023-07-17 Thread Yazhi Wang (Jira)
Yazhi Wang created SPARK-44469:
--

 Summary: Utils.getOrCreateLocalRootDirs will never take effect 
after the first call fails, even if the exception is recovered
 Key: SPARK-44469
 URL: https://issues.apache.org/jira/browse/SPARK-44469
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Yazhi Wang


{code:java}
private[spark] def getOrCreateLocalRootDirs(conf: SparkConf): Array[String] = {
  if (localRootDirs == null || localRootDirs.isEmpty) {
this.synchronized {
  if (localRootDirs == null) {
localRootDirs = getOrCreateLocalRootDirsImpl(conf)
  }
}
  }
  localRootDirs
}{code}
localRootDirs is only initialized once in the Executor/Driver life cycle. If it 
fails due to a FileSystem exception (such as a full disk) during the first 
initialization, localRootDirs will be assigned a value of None instead of null.

Even if the FileSystem exception recovered, the localRootDirs won't re-apply 
from FileSystem, causing the task on the Executor to continue to fail (tasks 
relay on fetchFiles locally)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41192) Task finished before speculative task scheduled leads to holding idle executors

2022-11-17 Thread Yazhi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yazhi Wang updated SPARK-41192:
---
Description: 
When task finished before speculative task has been scheduled by DAGScheduler, 
then the speculative tasks will be considered as pending and count towards the 
calculation of number of needed executors, which will lead to request more 
executors than needed
h2. Background & Reproduce

In one of our production job, we found that ExecutorAllocationManager was 
holding more executors than needed. 

We found it's difficult to reproduce in the test environment. In order to 
stably reproduce and debug, we temporarily annotated the scheduling code of 
speculative tasks in TaskSetManager:363 to ensure that the task be completed 
before the speculative task being scheduled.
{code:java}
// Original code
private def dequeueTask(
    execId: String,
    host: String,
    maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, 
Boolean)] = {
  // Tries to schedule a regular task first; if it returns None, then schedules
  // a speculative task
  dequeueTaskHelper(execId, host, maxLocality, false).orElse(
    dequeueTaskHelper(execId, host, maxLocality, true))
} 
// Speculative task will never be scheduled
private def dequeueTask(
    execId: String,
    host: String,
    maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, 
Boolean)] = {
  // Tries to schedule a regular task first; if it returns None, then schedules
  // a speculative task
  dequeueTaskHelper(execId, host, maxLocality, false)
}  {code}
Referring to examples in SPARK-30511

You will see when running the last task, we would be hold 38 executors (see 
attachment), which is exactly (149 + 1) / 4 = 38. But actually there are only 2 
tasks in running, which requires Math.min(20, 2/4) = 20 executors indeed.
{code:java}
./bin/spark-shell --master yarn --conf spark.speculation=true --conf 
spark.executor.cores=4 --conf spark.dynamicAllocation.enabled=true --conf 
spark.dynamicAllocation.minExecutors=20 --conf 
spark.dynamicAllocation.maxExecutors=1000 {code}
{code:java}
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index > 3998) {
    Thread.sleep(1000 * 1000)
} else if (index > 3850) {
    Thread.sleep(50 * 1000) // Fake running tasks
} else {
    Thread.sleep(100)
}
Array.fill[Int](1)(1).iterator{code}
 

I will have a PR ready to fix this issue

  was:
When task finished before speculative task has been scheduled by DAGScheduler, 
then the speculative tasks will be considered as pending and count towards the 
calculation of number of needed executors, which will lead to request more 
executors than needed
h2. Background & Reproduce

In one of our production job, we found that ExecutorAllocationManager was 
holding more executors than needed. 

We found it's difficult to reproduce in the test environment. In order to 
stably reproduce and debug, we temporarily annotated the scheduling code of 
speculative tasks in TaskSetManager:363 to ensure that the task be completed 
before the speculative task being scheduled.
{code:java}
// Original code
private def dequeueTask(
    execId: String,
    host: String,
    maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, 
Boolean)] = {
  // Tries to schedule a regular task first; if it returns None, then schedules
  // a speculative task
  dequeueTaskHelper(execId, host, maxLocality, false).orElse(
    dequeueTaskHelper(execId, host, maxLocality, true))
} 
// Speculative task will never be scheduled
private def dequeueTask(
    execId: String,
    host: String,
    maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, 
Boolean)] = {
  // Tries to schedule a regular task first; if it returns None, then schedules
  // a speculative task
  dequeueTaskHelper(execId, host, maxLocality, false)
}  {code}
Referring to examples in SPARK-30511

You will see when running the last task, we would be hold 38 executors (see 
attachment), which is exactly (149 + 1) / 4 = 38. But actually there are only 2 
tasks in running, which requires Math.min(20, 2/4) = 20 executors indeed.
{code:java}
./bin/spark-shell --master yarn --conf spark.speculation=true --conf 
spark.executor.cores=4 --conf spark.dynamicAllocation.enabled=true --conf 
spark.dynamicAllocation.minExecutors=20 --conf 
spark.dynamicAllocation.maxExecutors=1000 {code}
{code:java}
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index > 3998) {
    Thread.sleep(1000 * 1000)
} else if (index > 3850) {
    Thread.sleep(20 * 1000) // Fake running tasks
} else {
    Thread.sleep(100)
}
Array.fill[Int](1)(1).iterator{code}
 

I will have a PR ready to fix this issue


> Task finished before speculative task scheduled leads to holding idle 
> executors
> 

[jira] [Updated] (SPARK-41192) Task finished before speculative task scheduled leads to holding idle executors

2022-11-17 Thread Yazhi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yazhi Wang updated SPARK-41192:
---
Attachment: dynamic-executors

> Task finished before speculative task scheduled leads to holding idle 
> executors
> ---
>
> Key: SPARK-41192
> URL: https://issues.apache.org/jira/browse/SPARK-41192
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.3.1
>Reporter: Yazhi Wang
>Priority: Minor
>  Labels: dynamic_allocation
> Attachments: dynamic-executors, dynamic-log
>
>
> When task finished before speculative task has been scheduled by 
> DAGScheduler, then the speculative tasks will be considered as pending and 
> count towards the calculation of number of needed executors, which will lead 
> to request more executors than needed
> h2. Background & Reproduce
> In one of our production job, we found that ExecutorAllocationManager was 
> holding more executors than needed. 
> We found it's difficult to reproduce in the test environment. In order to 
> stably reproduce and debug, we temporarily annotated the scheduling code of 
> speculative tasks in TaskSetManager:363 to ensure that the task be completed 
> before the speculative task being scheduled.
> {code:java}
> // Original code
> private def dequeueTask(
>     execId: String,
>     host: String,
>     maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, 
> Boolean)] = {
>   // Tries to schedule a regular task first; if it returns None, then 
> schedules
>   // a speculative task
>   dequeueTaskHelper(execId, host, maxLocality, false).orElse(
>     dequeueTaskHelper(execId, host, maxLocality, true))
> } 
> // Speculative task will never be scheduled
> private def dequeueTask(
>     execId: String,
>     host: String,
>     maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, 
> Boolean)] = {
>   // Tries to schedule a regular task first; if it returns None, then 
> schedules
>   // a speculative task
>   dequeueTaskHelper(execId, host, maxLocality, false)
> }  {code}
> Referring to examples in SPARK-30511
> You will see when running the last task, we would be hold 38 executors (see 
> attachment), which is exactly (149 + 1) / 4 = 38. But actually there are only 
> 2 tasks in running, which requires Math.min(20, 2/4) = 20 executors indeed.
> {code:java}
> ./bin/spark-shell --master yarn --conf spark.speculation=true --conf 
> spark.executor.cores=4 --conf spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.minExecutors=20 --conf 
> spark.dynamicAllocation.maxExecutors=1000 {code}
> {code:java}
> val n = 4000
> val someRDD = sc.parallelize(1 to n, n)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index > 3998) {
>     Thread.sleep(1000 * 1000)
> } else if (index > 3850) {
>     Thread.sleep(20 * 1000) // Fake running tasks
> } else {
>     Thread.sleep(100)
> }
> Array.fill[Int](1)(1).iterator{code}
>  
> I will have a PR ready to fix this issue



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41192) Task finished before speculative task scheduled leads to holding idle executors

2022-11-17 Thread Yazhi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yazhi Wang updated SPARK-41192:
---
Attachment: dynamic-log

> Task finished before speculative task scheduled leads to holding idle 
> executors
> ---
>
> Key: SPARK-41192
> URL: https://issues.apache.org/jira/browse/SPARK-41192
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.3.1
>Reporter: Yazhi Wang
>Priority: Minor
>  Labels: dynamic_allocation
> Attachments: dynamic-executors, dynamic-log
>
>
> When task finished before speculative task has been scheduled by 
> DAGScheduler, then the speculative tasks will be considered as pending and 
> count towards the calculation of number of needed executors, which will lead 
> to request more executors than needed
> h2. Background & Reproduce
> In one of our production job, we found that ExecutorAllocationManager was 
> holding more executors than needed. 
> We found it's difficult to reproduce in the test environment. In order to 
> stably reproduce and debug, we temporarily annotated the scheduling code of 
> speculative tasks in TaskSetManager:363 to ensure that the task be completed 
> before the speculative task being scheduled.
> {code:java}
> // Original code
> private def dequeueTask(
>     execId: String,
>     host: String,
>     maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, 
> Boolean)] = {
>   // Tries to schedule a regular task first; if it returns None, then 
> schedules
>   // a speculative task
>   dequeueTaskHelper(execId, host, maxLocality, false).orElse(
>     dequeueTaskHelper(execId, host, maxLocality, true))
> } 
> // Speculative task will never be scheduled
> private def dequeueTask(
>     execId: String,
>     host: String,
>     maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, 
> Boolean)] = {
>   // Tries to schedule a regular task first; if it returns None, then 
> schedules
>   // a speculative task
>   dequeueTaskHelper(execId, host, maxLocality, false)
> }  {code}
> Referring to examples in SPARK-30511
> You will see when running the last task, we would be hold 38 executors (see 
> attachment), which is exactly (149 + 1) / 4 = 38. But actually there are only 
> 2 tasks in running, which requires Math.min(20, 2/4) = 20 executors indeed.
> {code:java}
> ./bin/spark-shell --master yarn --conf spark.speculation=true --conf 
> spark.executor.cores=4 --conf spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.minExecutors=20 --conf 
> spark.dynamicAllocation.maxExecutors=1000 {code}
> {code:java}
> val n = 4000
> val someRDD = sc.parallelize(1 to n, n)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index > 3998) {
>     Thread.sleep(1000 * 1000)
> } else if (index > 3850) {
>     Thread.sleep(20 * 1000) // Fake running tasks
> } else {
>     Thread.sleep(100)
> }
> Array.fill[Int](1)(1).iterator{code}
>  
> I will have a PR ready to fix this issue



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41192) Task finished before speculative task scheduled leads to holding idle executors

2022-11-17 Thread Yazhi Wang (Jira)
Yazhi Wang created SPARK-41192:
--

 Summary: Task finished before speculative task scheduled leads to 
holding idle executors
 Key: SPARK-41192
 URL: https://issues.apache.org/jira/browse/SPARK-41192
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.1, 3.2.2
Reporter: Yazhi Wang


When task finished before speculative task has been scheduled by DAGScheduler, 
then the speculative tasks will be considered as pending and count towards the 
calculation of number of needed executors, which will lead to request more 
executors than needed
h2. Background & Reproduce

In one of our production job, we found that ExecutorAllocationManager was 
holding more executors than needed. 

We found it's difficult to reproduce in the test environment. In order to 
stably reproduce and debug, we temporarily annotated the scheduling code of 
speculative tasks in TaskSetManager:363 to ensure that the task be completed 
before the speculative task being scheduled.
{code:java}
// Original code
private def dequeueTask(
    execId: String,
    host: String,
    maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, 
Boolean)] = {
  // Tries to schedule a regular task first; if it returns None, then schedules
  // a speculative task
  dequeueTaskHelper(execId, host, maxLocality, false).orElse(
    dequeueTaskHelper(execId, host, maxLocality, true))
} 
// Speculative task will never be scheduled
private def dequeueTask(
    execId: String,
    host: String,
    maxLocality: TaskLocality.Value): Option[(Int, TaskLocality.Value, 
Boolean)] = {
  // Tries to schedule a regular task first; if it returns None, then schedules
  // a speculative task
  dequeueTaskHelper(execId, host, maxLocality, false)
}  {code}
Referring to examples in SPARK-30511

You will see when running the last task, we would be hold 38 executors (see 
attachment), which is exactly (149 + 1) / 4 = 38. But actually there are only 2 
tasks in running, which requires Math.min(20, 2/4) = 20 executors indeed.
{code:java}
./bin/spark-shell --master yarn --conf spark.speculation=true --conf 
spark.executor.cores=4 --conf spark.dynamicAllocation.enabled=true --conf 
spark.dynamicAllocation.minExecutors=20 --conf 
spark.dynamicAllocation.maxExecutors=1000 {code}
{code:java}
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index > 3998) {
    Thread.sleep(1000 * 1000)
} else if (index > 3850) {
    Thread.sleep(20 * 1000) // Fake running tasks
} else {
    Thread.sleep(100)
}
Array.fill[Int](1)(1).iterator{code}
 

I will have a PR ready to fix this issue



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37469) Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI

2021-11-25 Thread Yazhi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yazhi Wang updated SPARK-37469:
---
Attachment: executor-page.png
sql-page.png

> Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI
> ---
>
> Key: SPARK-37469
> URL: https://issues.apache.org/jira/browse/SPARK-37469
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Yazhi Wang
>Priority: Minor
> Attachments: executor-page.png, sql-page.png
>
>
> Metrics in Executor/Task page shown as "
> Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which 
> make us confused
>  
> !image-2021-11-26-12-12-46-896.png!
> !image-2021-11-26-12-15-28-204.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37469) Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI

2021-11-25 Thread Yazhi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yazhi Wang updated SPARK-37469:
---
Description: 
Metrics in Executor/Task page shown as "

Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which 
make us confused  !executor-page.png!

!sql-page.png!

  was:
Metrics in Executor/Task page shown as "

Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which 
make us confused !executor-page.png!


> Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI
> ---
>
> Key: SPARK-37469
> URL: https://issues.apache.org/jira/browse/SPARK-37469
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Yazhi Wang
>Priority: Minor
> Attachments: executor-page.png, sql-page.png
>
>
> Metrics in Executor/Task page shown as "
> Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which 
> make us confused  !executor-page.png!
> !sql-page.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37469) Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI

2021-11-25 Thread Yazhi Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449371#comment-17449371
 ] 

Yazhi Wang commented on SPARK-37469:


I'm working on it

> Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI
> ---
>
> Key: SPARK-37469
> URL: https://issues.apache.org/jira/browse/SPARK-37469
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Yazhi Wang
>Priority: Minor
> Attachments: executor-page.png, sql-page.png
>
>
> Metrics in Executor/Task page shown as "
> Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which 
> make us confused  !executor-page.png!
> !sql-page.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37469) Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI

2021-11-25 Thread Yazhi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yazhi Wang updated SPARK-37469:
---
Description: 
Metrics in Executor/Task page shown as "

Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which 
make us confused !executor-page.png!

  was:
Metrics in Executor/Task page shown as "

Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which 
make us confused

 

!image-2021-11-26-12-12-46-896.png!

!image-2021-11-26-12-15-28-204.png!


> Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI
> ---
>
> Key: SPARK-37469
> URL: https://issues.apache.org/jira/browse/SPARK-37469
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Yazhi Wang
>Priority: Minor
> Attachments: executor-page.png, sql-page.png
>
>
> Metrics in Executor/Task page shown as "
> Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which 
> make us confused !executor-page.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37469) Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI

2021-11-25 Thread Yazhi Wang (Jira)
Yazhi Wang created SPARK-37469:
--

 Summary: Unified "fetchWaitTime" and "shuffleReadTime" metrics On 
UI
 Key: SPARK-37469
 URL: https://issues.apache.org/jira/browse/SPARK-37469
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.2.0
Reporter: Yazhi Wang


Metrics in Executor/Task page shown as "

Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which 
make us confused

 

!image-2021-11-26-12-12-46-896.png!

!image-2021-11-26-12-15-28-204.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37141) WorkerSuite cannot run on Mac OS

2021-10-28 Thread Yazhi Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17435199#comment-17435199
 ] 

Yazhi Wang commented on SPARK-37141:


I'm working on it

> WorkerSuite cannot run on Mac OS
> 
>
> Key: SPARK-37141
> URL: https://issues.apache.org/jira/browse/SPARK-37141
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> After SPARK-35907 run `org.apache.spark.deploy.worker.WorkerSuite` on Mac 
> os(both M1 and Intel) failed
> {code:java}
> mvn clean install -DskipTests -pl core -am
> mvn test -pl core -Dtest=none 
> -DwildcardSuites=org.apache.spark.deploy.worker.WorkerSuite
> {code}
> {code:java}
> WorkerSuite:
> - test isUseLocalNodeSSLConfig
> - test maybeUpdateSSLSettings
> - test clearing of finishedExecutors (small number of executors)
> - test clearing of finishedExecutors (more executors)
> - test clearing of finishedDrivers (small number of drivers)
> - test clearing of finishedDrivers (more drivers)
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time:  47.973 s
> [INFO] Finished at: 2021-10-28T13:46:56+08:00
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project 
> spark-core_2.12: There are test failures -> [Help 1]
> [ERROR] 
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR] 
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> {code}
> {code:java}
> 21/10/28 13:46:56.133 dispatcher-event-loop-1 ERROR Utils: Failed to create 
> directory /tmp
> java.nio.file.FileAlreadyExistsException: /tmp
>         at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
>         at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>         at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>         at 
> sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
>         at java.nio.file.Files.createDirectory(Files.java:674)
>         at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
>         at java.nio.file.Files.createDirectories(Files.java:727)
>         at org.apache.spark.util.Utils$.createDirectory(Utils.scala:292)
>         at 
> org.apache.spark.deploy.worker.Worker.createWorkDir(Worker.scala:221)
>         at org.apache.spark.deploy.worker.Worker.onStart(Worker.scala:232)
>         at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:120)
>         at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>         at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>         at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35862) Watermark timestamp only can be format in UTC timeZone, unfriendly to users in other time zones

2021-06-23 Thread Yazhi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yazhi Wang updated SPARK-35862:
---
Description: 
Timestamp is formatted in `ProgressReporter` by `formatTimestamp` for watermark 
and eventTime stats. the timestampFormat is hardcoded in UTC time zone.

`

private val timestampFormat = new 
SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSS'Z'") // ISO8601
 timestampFormat.setTimeZone(DateTimeUtils.getTimeZone("UTC"))

`

When users set the different timezone by java options `-Duser.timezone` , they 
may be confused by the information mixed with different timezone.

eg

`

 {color:#FF}*2021-06-23 16:12:07*{color} [stream execution thread for [id = 
92f4f363-df85-48e9-aef9-5ea6f2b70316, runId = 
5733ef8e-11d1-46c4-95cc-219bde6e7a20]] INFO [MicroBatchExecution:54]: Streaming 
query made progress: {
 "id" : "92f4f363-df85-48e9-aef9-5ea6f2b70316",
 "runId" : "5733ef8e-11d1-46c4-95cc-219bde6e7a20",
 "name" : null,
 "timestamp" : "2021-06-23T08:11:56.790Z",
 "batchId" : 91740,
 "numInputRows" : 2577,
 "inputRowsPerSecond" : 155.33453887884266,
 "processedRowsPerSecond" : 242.29033471229786,
 "durationMs" :

{ "addBatch" : 8671, "getBatch" : 3, "getOffset" : 1139, "queryPlanning" : 79, 
"triggerExecution" : 10636, "walCommit" : 162 }

,
 "eventTime" :

{color:#FF}*{ "avg" : "2021-06-23T08:11:46.307Z", "max" : 
"2021-06-23T08:11:55.000Z", "min" : "2021-06-23T08:11:37.000Z", "watermark" : 
"2021-06-23T07:41:39.000Z" }*{color}

,

`

maybe we need to unified the timezone for time format

  was:
Timestamp is formatted in `ProgressReporter` by `formatTimestamp` for watermark 
and eventTime stats. the timestampFormat is hardcoded in UTC time zone.

`

private val timestampFormat = new 
SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSS'Z'") // ISO8601
timestampFormat.setTimeZone(DateTimeUtils.getTimeZone("UTC"))

`

When users set the diffenrent timezone by java options `-Duser.timezone` , they 
may be confused by the information mixed with different timezone.

eg

`

 2021-06-23 16:12:07 [stream execution thread for [id = 
92f4f363-df85-48e9-aef9-5ea6f2b70316, runId = 
5733ef8e-11d1-46c4-95cc-219bde6e7a20]] INFO [MicroBatchExecution:54]: Streaming 
query made progress: {
 "id" : "92f4f363-df85-48e9-aef9-5ea6f2b70316",
 "runId" : "5733ef8e-11d1-46c4-95cc-219bde6e7a20",
 "name" : null,
 "timestamp" : "2021-06-23T08:11:56.790Z",
 "batchId" : 91740,
 "numInputRows" : 2577,
 "inputRowsPerSecond" : 155.33453887884266,
 "processedRowsPerSecond" : 242.29033471229786,
 "durationMs" : {
 "addBatch" : 8671,
 "getBatch" : 3,
 "getOffset" : 1139,
 "queryPlanning" : 79,
 "triggerExecution" : 10636,
 "walCommit" : 162
 },
 "eventTime" : {
 "avg" : "2021-06-23T08:11:46.307Z",
 "max" : "2021-06-23T08:11:55.000Z",
 "min" : "2021-06-23T08:11:37.000Z",
 "watermark" : "2021-06-23T07:41:39.000Z"
 },

`

maybe we need to unified the timezone for time format


> Watermark timestamp only can be format in UTC timeZone, unfriendly to users 
> in other time zones
> ---
>
> Key: SPARK-35862
> URL: https://issues.apache.org/jira/browse/SPARK-35862
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.2
>Reporter: Yazhi Wang
>Priority: Minor
>
> Timestamp is formatted in `ProgressReporter` by `formatTimestamp` for 
> watermark and eventTime stats. the timestampFormat is hardcoded in UTC time 
> zone.
> `
> private val timestampFormat = new 
> SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSS'Z'") // ISO8601
>  timestampFormat.setTimeZone(DateTimeUtils.getTimeZone("UTC"))
> `
> When users set the different timezone by java options `-Duser.timezone` , 
> they may be confused by the information mixed with different timezone.
> eg
> `
>  {color:#FF}*2021-06-23 16:12:07*{color} [stream execution thread for [id 
> = 92f4f363-df85-48e9-aef9-5ea6f2b70316, runId = 
> 5733ef8e-11d1-46c4-95cc-219bde6e7a20]] INFO [MicroBatchExecution:54]: 
> Streaming query made progress: {
>  "id" : "92f4f363-df85-48e9-aef9-5ea6f2b70316",
>  "runId" : "5733ef8e-11d1-46c4-95cc-219bde6e7a20",
>  "name" : null,
>  "timestamp" : "2021-06-23T08:11:56.790Z",
>  "batchId" : 91740,
>  "numInputRows" : 2577,
>  "inputRowsPerSecond" : 155.33453887884266,
>  "processedRowsPerSecond" : 242.29033471229786,
>  "durationMs" :
> { "addBatch" : 8671, "getBatch" : 3, "getOffset" : 1139, "queryPlanning" : 
> 79, "triggerExecution" : 10636, "walCommit" : 162 }
> ,
>  "eventTime" :
> {color:#FF}*{ "avg" : "2021-06-23T08:11:46.307Z", "max" : 
> "2021-06-23T08:11:55.000Z", "min" : "2021-06-23T08:11:37.000Z", "watermark" : 
> "2021-06-23T07:41:39.000Z" }*{color}
> ,
> `
> maybe we need to unified the timezone for time format



--
This message was sent by Atlassian 

[jira] [Commented] (SPARK-35862) Watermark timestamp only can be format in UTC timeZone, unfriendly to users in other time zones

2021-06-23 Thread Yazhi Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368022#comment-17368022
 ] 

Yazhi Wang commented on SPARK-35862:


I'm working on it

> Watermark timestamp only can be format in UTC timeZone, unfriendly to users 
> in other time zones
> ---
>
> Key: SPARK-35862
> URL: https://issues.apache.org/jira/browse/SPARK-35862
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.2
>Reporter: Yazhi Wang
>Priority: Minor
>
> Timestamp is formatted in `ProgressReporter` by `formatTimestamp` for 
> watermark and eventTime stats. the timestampFormat is hardcoded in UTC time 
> zone.
> `
> private val timestampFormat = new 
> SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSS'Z'") // ISO8601
> timestampFormat.setTimeZone(DateTimeUtils.getTimeZone("UTC"))
> `
> When users set the diffenrent timezone by java options `-Duser.timezone` , 
> they may be confused by the information mixed with different timezone.
> eg
> `
>  2021-06-23 16:12:07 [stream execution thread for [id = 
> 92f4f363-df85-48e9-aef9-5ea6f2b70316, runId = 
> 5733ef8e-11d1-46c4-95cc-219bde6e7a20]] INFO [MicroBatchExecution:54]: 
> Streaming query made progress: {
>  "id" : "92f4f363-df85-48e9-aef9-5ea6f2b70316",
>  "runId" : "5733ef8e-11d1-46c4-95cc-219bde6e7a20",
>  "name" : null,
>  "timestamp" : "2021-06-23T08:11:56.790Z",
>  "batchId" : 91740,
>  "numInputRows" : 2577,
>  "inputRowsPerSecond" : 155.33453887884266,
>  "processedRowsPerSecond" : 242.29033471229786,
>  "durationMs" : {
>  "addBatch" : 8671,
>  "getBatch" : 3,
>  "getOffset" : 1139,
>  "queryPlanning" : 79,
>  "triggerExecution" : 10636,
>  "walCommit" : 162
>  },
>  "eventTime" : {
>  "avg" : "2021-06-23T08:11:46.307Z",
>  "max" : "2021-06-23T08:11:55.000Z",
>  "min" : "2021-06-23T08:11:37.000Z",
>  "watermark" : "2021-06-23T07:41:39.000Z"
>  },
> `
> maybe we need to unified the timezone for time format



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35862) Watermark timestamp only can be format in UTC timeZone, unfriendly to users in other time zones

2021-06-23 Thread Yazhi Wang (Jira)
Yazhi Wang created SPARK-35862:
--

 Summary: Watermark timestamp only can be format in UTC timeZone, 
unfriendly to users in other time zones
 Key: SPARK-35862
 URL: https://issues.apache.org/jira/browse/SPARK-35862
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.1.2
Reporter: Yazhi Wang


Timestamp is formatted in `ProgressReporter` by `formatTimestamp` for watermark 
and eventTime stats. the timestampFormat is hardcoded in UTC time zone.

`

private val timestampFormat = new 
SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSS'Z'") // ISO8601
timestampFormat.setTimeZone(DateTimeUtils.getTimeZone("UTC"))

`

When users set the diffenrent timezone by java options `-Duser.timezone` , they 
may be confused by the information mixed with different timezone.

eg

`

 2021-06-23 16:12:07 [stream execution thread for [id = 
92f4f363-df85-48e9-aef9-5ea6f2b70316, runId = 
5733ef8e-11d1-46c4-95cc-219bde6e7a20]] INFO [MicroBatchExecution:54]: Streaming 
query made progress: {
 "id" : "92f4f363-df85-48e9-aef9-5ea6f2b70316",
 "runId" : "5733ef8e-11d1-46c4-95cc-219bde6e7a20",
 "name" : null,
 "timestamp" : "2021-06-23T08:11:56.790Z",
 "batchId" : 91740,
 "numInputRows" : 2577,
 "inputRowsPerSecond" : 155.33453887884266,
 "processedRowsPerSecond" : 242.29033471229786,
 "durationMs" : {
 "addBatch" : 8671,
 "getBatch" : 3,
 "getOffset" : 1139,
 "queryPlanning" : 79,
 "triggerExecution" : 10636,
 "walCommit" : 162
 },
 "eventTime" : {
 "avg" : "2021-06-23T08:11:46.307Z",
 "max" : "2021-06-23T08:11:55.000Z",
 "min" : "2021-06-23T08:11:37.000Z",
 "watermark" : "2021-06-23T07:41:39.000Z"
 },

`

maybe we need to unified the timezone for time format



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35796) UT `handles k8s cluster mode` fails on MacOs >= 10.15

2021-06-17 Thread Yazhi Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364819#comment-17364819
 ] 

Yazhi Wang commented on SPARK-35796:


I'm working on it

> UT `handles k8s cluster mode` fails on MacOs >= 10.15
> -
>
> Key: SPARK-35796
> URL: https://issues.apache.org/jira/browse/SPARK-35796
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.3
> Environment: MacOs 10.15.7
>Reporter: Yazhi Wang
>Priority: Minor
>
> When I run SparkSubmitSuite on MacOs 10.15.7, I got AssertionError for 
> `handles k8s cluster mode` test after pr 
> [SPARK-35691|https://issues.apache.org/jira/browse/SPARK-35691] due to 
> `File(path).getCanonicalFile().toURI()` function  with absolute path as 
> parameter will return path begin with /System/Volumes/Data.
> eg.  /home/testjars.jar will get 
> [file:/System/Volumes/Data/home/testjars.jar|file:///System/Volumes/Data/home/testjars.jar]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35796) UT `handles k8s cluster mode` fails on MacOs >= 10.15

2021-06-17 Thread Yazhi Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yazhi Wang updated SPARK-35796:
---
Description: 
When I run SparkSubmitSuite on MacOs 10.15.7, I got AssertionError for `handles 
k8s cluster mode` test after pr 
[SPARK-35691|https://issues.apache.org/jira/browse/SPARK-35691] due to 
`File(path).getCanonicalFile().toURI()` function  with absolute path as 
parameter will return path begin with /System/Volumes/Data.

eg.  /home/testjars.jar will get 

[file:/System/Volumes/Data/home/testjars.jar|file:///System/Volumes/Data/home/testjars.jar]

  was:
When I run SparkSubmitSuite on MacOs 10.15.7, I got AssertionError for `handles 
k8s cluster mode` test after pr SPARK-3561 due to 
`File(path).getCanonicalFile().toURI()` function  with absolute path as 
parameter will return path begin with /System/Volumes/Data.

eg.  /home/testjars.jar will get 

file:/System/Volumes/Data/home/testjars.jar


> UT `handles k8s cluster mode` fails on MacOs >= 10.15
> -
>
> Key: SPARK-35796
> URL: https://issues.apache.org/jira/browse/SPARK-35796
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.3
> Environment: MacOs 10.15.7
>Reporter: Yazhi Wang
>Priority: Minor
>
> When I run SparkSubmitSuite on MacOs 10.15.7, I got AssertionError for 
> `handles k8s cluster mode` test after pr 
> [SPARK-35691|https://issues.apache.org/jira/browse/SPARK-35691] due to 
> `File(path).getCanonicalFile().toURI()` function  with absolute path as 
> parameter will return path begin with /System/Volumes/Data.
> eg.  /home/testjars.jar will get 
> [file:/System/Volumes/Data/home/testjars.jar|file:///System/Volumes/Data/home/testjars.jar]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35796) UT `handles k8s cluster mode` fails on MacOs >= 10.15

2021-06-17 Thread Yazhi Wang (Jira)
Yazhi Wang created SPARK-35796:
--

 Summary: UT `handles k8s cluster mode` fails on MacOs >= 10.15
 Key: SPARK-35796
 URL: https://issues.apache.org/jira/browse/SPARK-35796
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.1.3
 Environment: MacOs 10.15.7
Reporter: Yazhi Wang


When I run SparkSubmitSuite on MacOs 10.15.7, I got AssertionError for `handles 
k8s cluster mode` test after pr SPARK-3561 due to 
`File(path).getCanonicalFile().toURI()` function  with absolute path as 
parameter will return path begin with /System/Volumes/Data.

eg.  /home/testjars.jar will get 

file:/System/Volumes/Data/home/testjars.jar



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org