[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler

2021-06-23 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-35865:

Description: 
SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
 YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
 YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

  was:
SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
 YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
 YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

!openblock.png!

!openblock-compare.png!  


> Remove await (syncMode) in ChunkFetchRequestHandler
> ---
>
> Key: SPARK-35865
> URL: https://issues.apache.org/jira/browse/SPARK-35865
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: openblock-compare.png, openblock.png
>
>
> SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
> throting the max number of threads for sending responses of chunk fetch 
> requests. But it causes severe performance degradation because the throughput 
> of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
> sync mode configurable and makes the async mode the default. 
> SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout 
> issue and we rarely see sasl timeout issues with async mode in our production 
> clusters today. 
> Few days ago we accidentally turned on sync mode on one cluster and we 
> observed severe shuffle performance degradation. As a result, We benchmarked 
> the performance comparison between async and sync mode and *we suggest 
> removing sync mode in the code base* as it seems not to provide any benefits 
> today. We would like to share the benchmark result and 

[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler

2021-06-23 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-35865:

Description: 
SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
 YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
 YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

!openblock.png!

!openblock-compare.png!  

  was:
SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

 


> Remove await (syncMode) in ChunkFetchRequestHandler
> ---
>
> Key: SPARK-35865
> URL: https://issues.apache.org/jira/browse/SPARK-35865
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: openblock-compare.png, openblock.png
>
>
> SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
> throting the max number of threads for sending responses of chunk fetch 
> requests. But it causes severe performance degradation because the throughput 
> of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
> sync mode configurable and makes the async mode the default. 
> SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout 
> issue and we rarely see sasl timeout issues with async mode in our production 
> clusters today. 
> Few days ago we accidentally turned on sync mode on one cluster and we 
> observed severe shuffle performance degradation. As a result, We benchmarked 
> the performance comparison between async and sync mode and *we suggest 
> removing sync mode in the code base* as it seems not to provide any benefits 
> today. We would like to share the benchmark result and

[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler

2021-06-23 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-35865:

Attachment: openblock-compare.png

> Remove await (syncMode) in ChunkFetchRequestHandler
> ---
>
> Key: SPARK-35865
> URL: https://issues.apache.org/jira/browse/SPARK-35865
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: openblock-compare.png, openblock.png
>
>
> SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
> throting the max number of threads for sending responses of chunk fetch 
> requests. But it causes severe performance degradation because the throughput 
> of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
> sync mode configurable and makes the async mode the default. 
> SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout 
> issue and we rarely see sasl timeout issues with async mode in our production 
> clusters today. 
> Few days ago we accidentally turned on sync mode on one cluster and we 
> observed severe shuffle performance degradation. As a result, We benchmarked 
> the performance comparison between async and sync mode and *we suggest 
> removing sync mode in the code base* as it seems not to provide any benefits 
> today. We would like to share the benchmark result and hear the opinion from 
> the community.
>  
> benchmark on job's run time (sync mode is 2x - 3x slower):
> YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
> memory, each node manager has 1GB heap size.
> shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
> 1000 reduce tasks.
> results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 
> 6 mins.
>  
> benchmark on metrics of external shuffle service:
> YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
> as sync mode, shuffling 2.5 GB data.
> results: in openblockreuqestslatencymillis_ratemean and some other metrics, 
> the nodes in sync mode are 3x - 4x higher than nodes in async mode. I 
> attached some screenshots of the metrics.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler

2021-06-23 Thread Baohe Zhang (Jira)
Baohe Zhang created SPARK-35865:
---

 Summary: Remove await (syncMode) in ChunkFetchRequestHandler
 Key: SPARK-35865
 URL: https://issues.apache.org/jira/browse/SPARK-35865
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.1.2, 2.4.8
Reporter: Baohe Zhang
 Attachments: openblock-compare.png, openblock.png

SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler

2021-06-23 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-35865:

Attachment: openblock.png

> Remove await (syncMode) in ChunkFetchRequestHandler
> ---
>
> Key: SPARK-35865
> URL: https://issues.apache.org/jira/browse/SPARK-35865
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: openblock-compare.png, openblock.png
>
>
> SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
> throting the max number of threads for sending responses of chunk fetch 
> requests. But it causes severe performance degradation because the throughput 
> of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
> sync mode configurable and makes the async mode the default. 
> SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout 
> issue and we rarely see sasl timeout issues with async mode in our production 
> clusters today. 
> Few days ago we accidentally turned on sync mode on one cluster and we 
> observed severe shuffle performance degradation. As a result, We benchmarked 
> the performance comparison between async and sync mode and *we suggest 
> removing sync mode in the code base* as it seems not to provide any benefits 
> today. We would like to share the benchmark result and hear the opinion from 
> the community.
>  
> benchmark on job's run time (sync mode is 2x - 3x slower):
> YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
> memory, each node manager has 1GB heap size.
> shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
> 1000 reduce tasks.
> results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 
> 6 mins.
>  
> benchmark on metrics of external shuffle service:
> YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
> as sync mode, shuffling 2.5 GB data.
> results: in openblockreuqestslatencymillis_ratemean and some other metrics, 
> the nodes in sync mode are 3x - 4x higher than nodes in async mode. I 
> attached some screenshots of the metrics.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35010) nestedSchemaPruning causes issue when reading hive generated Orc files

2021-04-09 Thread Baohe Zhang (Jira)
Baohe Zhang created SPARK-35010:
---

 Summary: nestedSchemaPruning causes issue when reading hive 
generated Orc files
 Key: SPARK-35010
 URL: https://issues.apache.org/jira/browse/SPARK-35010
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0, 3.0.0
Reporter: Baohe Zhang


In spark3, we have spark.sql.orc.imple=native and 
spark.sql.optimizer.nestedSchemaPruning.enabled=true as the default settings. 
And these would cause issues when query struct field of hive-generated orc 
files.

For example, we got an error when running this query in spark3
{code:java}
spark.table("testtable").filter(col("utc_date") === 
"20210122").select(col("open_count.d35")).show(false)
{code}
The error is
{code:java}
Caused by: java.lang.AssertionError: assertion failed: The given data schema 
struct>> has less fields than the 
actual ORC physical schema, no idea which columns were dropped, fail to read.
  at scala.Predef$.assert(Predef.scala:223)
  at 
org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:153)
  at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180)
  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539)
  at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:127)
  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
{code}
 

I think the reason is that we apply the nestedSchemaPruning to the dataSchema. 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala#L75]


This nestedSchemaPruning not only prunes the unused fields of the struct, it 
also prunes the unused columns. In my test, the dataSchema originally has 48 
columns, but after nested schema pruning, the dataSchema is pruned to 1 column. 
This pruning result in an assertion error in 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L159]
 because column pruning in hive generated orc files is not supported.

This issue seems also related to the hive version, we use hive 1.2, and it 
doesn't contain field names in the physical schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34779) ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat occurs

2021-03-31 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312498#comment-17312498
 ] 

Baohe Zhang commented on SPARK-34779:
-

Thanks for pointing it out! I didn't aware that task peak metrics will 
contribute to the executor peak metrics.

> ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat 
> occurs
> ---
>
> Key: SPARK-34779
> URL: https://issues.apache.org/jira/browse/SPARK-34779
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Baohe Zhang
>Priority: Major
>
> The current implementation of ExecutoMetricsPoller uses task count in each 
> stage to decide whether to keep a stage entry or not. In the case of the 
> executor only has 1 core, it may have these issues:
>  # Peak metrics missing (due to stage entry being removed within a heartbeat 
> interval)
>  # Unnecessary and frequent hashmap entry removal and insertion.
> Assuming an executor with 1 core has 2 tasks (task1 and task2, both belong to 
> stage (0,0)) to execute in a heartbeat interval, the workflow in current 
> ExecutorMetricsPoller implementation would be:
> 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 2. 1st poll() -> update peak metrics of stage (0, 0)
> 3. task1 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry 
> removed, peak metrics lost.
> 4. task2 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 5. 2nd poll() -> update peak metrics of stage (0, 0)
> 6. task2 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry 
> removed, peak metrics lost
> 7. heartbeat() ->  empty or inaccurate peak metrics for stage(0,0) reported.
> We can fix the issue by keeping entries with task count = 0 in stageTCMP map 
> until a heartbeat occurs. At the heartbeat, after reporting the peak metrics 
> for each stage, we scan each stage in stageTCMP and remove entries with task 
> count = 0.
> After the fix, the workflow would be:
> 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 2. 1st poll() -> update peak metrics of stage (0, 0)
> 3. task1 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) 
> still remain.
> 4. task2 start -> task count of stage (0,0) increment to1
> 5. 2nd poll() -> update peak metrics of stage (0, 0)
> 6. task2 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) 
> still remain.
> 7. heartbeat() ->  accurate peak metrics for stage (0, 0) reported. Remove 
> entry for stage (0,0) in stageTCMP because its task count is 0.
>  
> How to verify the behavior? 
> Submit a job with a custom polling interval (e.g., 2s) and 
> spark.executor.cores=1 and check the debug logs of ExecutoMetricsPoller.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34845) ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of child pids metrics are missing

2021-03-23 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34845:

Description: 
When the procfs metrics of some child pids are unavailable, 
ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum of 
a subset of child pids), instead of an all 0 result. This can be misleading and 
is undesired per the current code comments in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].

How to reproduce it?

This unit test is kind of self-explanatory:

 
{code:java}
val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics"))
val mockedP = spy(p)

// proc file of pid 22764 doesn't exist, so partial metrics shouldn't be 
returned
var ptree = Set(26109, 22764, 22763)
when(mockedP.computeProcessTree).thenReturn(ptree)
var r = mockedP.computeAllMetrics
assert(r.jvmVmemTotal == 0)
assert(r.jvmRSSTotal == 0)
assert(r.pythonVmemTotal == 0)
assert(r.pythonRSSTotal == 0)
{code}
In the current implementation, computeAllMetrics will reset the allMetrics to 0 
when processing 22764 because 22764's proc file doesn't exist, but then it will 
continue processing pid 22763, and update allMetrics to procfs metrics of pid 
22763.

Also, a side effect of this bug is that it can lead to a verbose warning log if 
many pids' stat files are missing. An early terminating can make the warning 
logs more concise.

How to solve it?

The issue can be fixed by throwing IOException to computeAllMetrics(), in that 
case computeAllMetrics can aware that at lease one child pid's procfs metrics 
is missing and then terminate the metrics reporting.

  was:
When the procfs metrics of some child pids are unavailable, 
ProcfsMetricsGetter.computeAllMetrics() returns partial metrics (the sum of a 
subset of child pids), instead of an all 0 result. This can be misleading and 
is undesired per the current code comments in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].

 

Also, a side effect of it is that it can lead to a verbose warning log if many 
pids' stat files are missing. Also, a side effect of it is that it can lead to 
verbose warning logs if many pids' stat files are missing.
{code:java}
e.g.2021-03-21 16:58:25,422 [pool-26-thread-8] WARN  
org.apache.spark.executor.ProcfsMetricsGetter  - There was a problem with 
reading the stat file of the process. java.io.FileNotFoundException: 
/proc/742/stat (No such file or directory) at 
java.io.FileInputStream.open0(Native Method) at 
java.io.FileInputStream.open(FileInputStream.java:195) at 
java.io.FileInputStream.(FileInputStream.java:138) at 
org.apache.spark.executor.ProcfsMetricsGetter.openReader$1(ProcfsMetricsGetter.scala:203)
 at 
org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$addProcfsMetricsFromOneProcess$1(ProcfsMetricsGetter.scala:205)
 at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2540) at 
org.apache.spark.executor.ProcfsMetricsGetter.addProcfsMetricsFromOneProcess(ProcfsMetricsGetter.scala:205)
 at 
org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$computeAllMetrics$1(ProcfsMetricsGetter.scala:297)
{code}
The issue can be fixed by updating the flag isAvailable to false when one of 
the child pid's procfs metric is unavailable. Other methods computePid, 
computePageSize, and getChildPids already have this behavior.

Summary: ProcfsMetricsGetter.computeAllMetrics may return partial 
metrics when some of child pids metrics are missing  (was: 
ProcfsMetricsGetter.computeAllMetrics shouldn't return partial metrics when 
some of child pids metrics are missing)

> ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of 
> child pids metrics are missing
> 
>
> Key: SPARK-34845
> URL: https://issues.apache.org/jira/browse/SPARK-34845
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Baohe Zhang
>Priority: Major
>
> When the procfs metrics of some child pids are unavailable, 
> ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum 
> of a subset of child pids), instead of an all 0 result. This can be 
> misleading and is undesired per the current code comments in 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].
> How to reproduce it?
> This unit test is kind of self-explanatory:
>  
> {code:java}
> val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics"))
> val mockedP = spy(p)
> // proc file of pid 22764

[jira] [Updated] (SPARK-34845) ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of child pids metrics are missing

2021-03-23 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34845:

Description: 
When the procfs metrics of some child pids are unavailable, 
ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum of 
a subset of child pids), instead of an all 0 result. This can be misleading and 
is undesired per the current code comments in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].

How to reproduce it?

This unit test is kind of self-explanatory:
{code:java}
val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics"))
val mockedP = spy(p)

// proc file of pid 22764 doesn't exist, so partial metrics shouldn't be 
returned
var ptree = Set(26109, 22764, 22763)
when(mockedP.computeProcessTree).thenReturn(ptree)
var r = mockedP.computeAllMetrics
assert(r.jvmVmemTotal == 0)
assert(r.jvmRSSTotal == 0)
assert(r.pythonVmemTotal == 0)
assert(r.pythonRSSTotal == 0)
{code}
In the current implementation, computeAllMetrics will reset the allMetrics to 0 
when processing 22764 because 22764's proc file doesn't exist, but then it will 
continue processing pid 22763, and update allMetrics to procfs metrics of pid 
22763.

Also, a side effect of this bug is that it can lead to a verbose warning log if 
many pids' stat files are missing. An early terminating can make the warning 
logs more concise.

How to solve it?

The issue can be fixed by throwing IOException to computeAllMetrics(), in that 
case computeAllMetrics can aware that at lease one child pid's procfs metrics 
is missing and then terminate the metrics reporting.

  was:
When the procfs metrics of some child pids are unavailable, 
ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum of 
a subset of child pids), instead of an all 0 result. This can be misleading and 
is undesired per the current code comments in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].

How to reproduce it?

This unit test is kind of self-explanatory:

 
{code:java}
val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics"))
val mockedP = spy(p)

// proc file of pid 22764 doesn't exist, so partial metrics shouldn't be 
returned
var ptree = Set(26109, 22764, 22763)
when(mockedP.computeProcessTree).thenReturn(ptree)
var r = mockedP.computeAllMetrics
assert(r.jvmVmemTotal == 0)
assert(r.jvmRSSTotal == 0)
assert(r.pythonVmemTotal == 0)
assert(r.pythonRSSTotal == 0)
{code}
In the current implementation, computeAllMetrics will reset the allMetrics to 0 
when processing 22764 because 22764's proc file doesn't exist, but then it will 
continue processing pid 22763, and update allMetrics to procfs metrics of pid 
22763.

Also, a side effect of this bug is that it can lead to a verbose warning log if 
many pids' stat files are missing. An early terminating can make the warning 
logs more concise.

How to solve it?

The issue can be fixed by throwing IOException to computeAllMetrics(), in that 
case computeAllMetrics can aware that at lease one child pid's procfs metrics 
is missing and then terminate the metrics reporting.


> ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of 
> child pids metrics are missing
> 
>
> Key: SPARK-34845
> URL: https://issues.apache.org/jira/browse/SPARK-34845
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Baohe Zhang
>Priority: Major
>
> When the procfs metrics of some child pids are unavailable, 
> ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum 
> of a subset of child pids), instead of an all 0 result. This can be 
> misleading and is undesired per the current code comments in 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].
> How to reproduce it?
> This unit test is kind of self-explanatory:
> {code:java}
> val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics"))
> val mockedP = spy(p)
> // proc file of pid 22764 doesn't exist, so partial metrics shouldn't be 
> returned
> var ptree = Set(26109, 22764, 22763)
> when(mockedP.computeProcessTree).thenReturn(ptree)
> var r = mockedP.computeAllMetrics
> assert(r.jvmVmemTotal == 0)
> assert(r.jvmRSSTotal == 0)
> assert(r.pythonVmemTotal == 0)
> assert(r.pythonRSSTotal == 0)
> {code}
> In the current implementation, computeAllMetrics will reset the allMetric

[jira] [Updated] (SPARK-34845) ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of child pids metrics are missing

2021-03-23 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34845:

Description: 
When the procfs metrics of some child pids are unavailable, 
ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum of 
a subset of child pids), instead of an all 0 result. This can be misleading and 
is undesired per the current code comments in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].

How to reproduce it?

This unit test is kind of self-explanatory:
{code:java}
val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics"))
val mockedP = spy(p)

// proc file of pid 22764 doesn't exist, so partial metrics shouldn't be 
returned
var ptree = Set(26109, 22764, 22763)
when(mockedP.computeProcessTree).thenReturn(ptree)
var r = mockedP.computeAllMetrics
assert(r.jvmVmemTotal == 0)
assert(r.jvmRSSTotal == 0)
assert(r.pythonVmemTotal == 0)
assert(r.pythonRSSTotal == 0)
{code}
In the current implementation, computeAllMetrics will reset the allMetrics to 0 
when processing 22764 because 22764's proc file doesn't exist, but then it will 
continue processing pid 22763, and update allMetrics to procfs metrics of pid 
22763.

Also, a side effect of this bug is that it can lead to a verbose warning log if 
many pids' stat files are missing. An early terminating can make the warning 
logs more concise.

How to solve it?

The issue can be fixed by throwing IOException to computeAllMetrics(), in that 
case, computeAllMetrics can aware that at lease one child pid's procfs metrics 
is missing and then terminate the metrics reporting.

  was:
When the procfs metrics of some child pids are unavailable, 
ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum of 
a subset of child pids), instead of an all 0 result. This can be misleading and 
is undesired per the current code comments in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].

How to reproduce it?

This unit test is kind of self-explanatory:
{code:java}
val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics"))
val mockedP = spy(p)

// proc file of pid 22764 doesn't exist, so partial metrics shouldn't be 
returned
var ptree = Set(26109, 22764, 22763)
when(mockedP.computeProcessTree).thenReturn(ptree)
var r = mockedP.computeAllMetrics
assert(r.jvmVmemTotal == 0)
assert(r.jvmRSSTotal == 0)
assert(r.pythonVmemTotal == 0)
assert(r.pythonRSSTotal == 0)
{code}
In the current implementation, computeAllMetrics will reset the allMetrics to 0 
when processing 22764 because 22764's proc file doesn't exist, but then it will 
continue processing pid 22763, and update allMetrics to procfs metrics of pid 
22763.

Also, a side effect of this bug is that it can lead to a verbose warning log if 
many pids' stat files are missing. An early terminating can make the warning 
logs more concise.

How to solve it?

The issue can be fixed by throwing IOException to computeAllMetrics(), in that 
case computeAllMetrics can aware that at lease one child pid's procfs metrics 
is missing and then terminate the metrics reporting.


> ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of 
> child pids metrics are missing
> 
>
> Key: SPARK-34845
> URL: https://issues.apache.org/jira/browse/SPARK-34845
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Baohe Zhang
>Priority: Major
>
> When the procfs metrics of some child pids are unavailable, 
> ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum 
> of a subset of child pids), instead of an all 0 result. This can be 
> misleading and is undesired per the current code comments in 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].
> How to reproduce it?
> This unit test is kind of self-explanatory:
> {code:java}
> val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics"))
> val mockedP = spy(p)
> // proc file of pid 22764 doesn't exist, so partial metrics shouldn't be 
> returned
> var ptree = Set(26109, 22764, 22763)
> when(mockedP.computeProcessTree).thenReturn(ptree)
> var r = mockedP.computeAllMetrics
> assert(r.jvmVmemTotal == 0)
> assert(r.jvmRSSTotal == 0)
> assert(r.pythonVmemTotal == 0)
> assert(r.pythonRSSTotal == 0)
> {code}
> In the current implementation, computeAllMetrics will reset the allMetrics 

[jira] [Updated] (SPARK-34845) ProcfsMetricsGetter.computeAllMetrics shouldn't return partial metrics when some of child pids metrics are missing

2021-03-23 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34845:

Description: 
When the procfs metrics of some child pids are unavailable, 
ProcfsMetricsGetter.computeAllMetrics() returns partial metrics (the sum of a 
subset of child pids), instead of an all 0 result. This can be misleading and 
is undesired per the current code comments in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].

 

Also, a side effect of it is that it can lead to a verbose warning log if many 
pids' stat files are missing. Also, a side effect of it is that it can lead to 
verbose warning logs if many pids' stat files are missing.
{code:java}
e.g.2021-03-21 16:58:25,422 [pool-26-thread-8] WARN  
org.apache.spark.executor.ProcfsMetricsGetter  - There was a problem with 
reading the stat file of the process. java.io.FileNotFoundException: 
/proc/742/stat (No such file or directory) at 
java.io.FileInputStream.open0(Native Method) at 
java.io.FileInputStream.open(FileInputStream.java:195) at 
java.io.FileInputStream.(FileInputStream.java:138) at 
org.apache.spark.executor.ProcfsMetricsGetter.openReader$1(ProcfsMetricsGetter.scala:203)
 at 
org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$addProcfsMetricsFromOneProcess$1(ProcfsMetricsGetter.scala:205)
 at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2540) at 
org.apache.spark.executor.ProcfsMetricsGetter.addProcfsMetricsFromOneProcess(ProcfsMetricsGetter.scala:205)
 at 
org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$computeAllMetrics$1(ProcfsMetricsGetter.scala:297)
{code}
The issue can be fixed by updating the flag isAvailable to false when one of 
the child pid's procfs metric is unavailable. Other methods computePid, 
computePageSize, and getChildPids already have this behavior.

  was:
When the procfs metrics of some child pids are unavailable, 
ProcfsMetricsGetter.computeAllMetrics() returns partial metrics (the sum of a 
subset of child pids), instead of an all 0 result. This can be misleading and 
is undesired per the current code comments in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].

 

Also, a side effect of it is that it can lead to a verbose warning log if many 
pids' stat files are missing. Also, a side effect of it is that it can lead to 
verbose warning logs if many pids' stat files are missing.
{noformat}
e.g.2021-03-21 16:58:25,422 [pool-26-thread-8] WARN  
org.apache.spark.executor.ProcfsMetricsGetter  - There was a problem with 
reading the stat file of the process. java.io.FileNotFoundException: 
/proc/742/stat (No such file or directory) at 
java.io.FileInputStream.open0(Native Method) at 
java.io.FileInputStream.open(FileInputStream.java:195) at 
java.io.FileInputStream.(FileInputStream.java:138) at 
org.apache.spark.executor.ProcfsMetricsGetter.openReader$1(ProcfsMetricsGetter.scala:203)
 at 
org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$addProcfsMetricsFromOneProcess$1(ProcfsMetricsGetter.scala:205)
 at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2540) at 
org.apache.spark.executor.ProcfsMetricsGetter.addProcfsMetricsFromOneProcess(ProcfsMetricsGetter.scala:205)
 at 
org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$computeAllMetrics$1(ProcfsMetricsGetter.scala:297){noformat}
The issue can be fixed by updating the flag isAvailable to false when one of 
the child pid's procfs metric is unavailable. Other methods computePid, 
computePageSize, and getChildPids already have this behavior.


> ProcfsMetricsGetter.computeAllMetrics shouldn't return partial metrics when 
> some of child pids metrics are missing
> --
>
> Key: SPARK-34845
> URL: https://issues.apache.org/jira/browse/SPARK-34845
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Baohe Zhang
>Priority: Major
>
> When the procfs metrics of some child pids are unavailable, 
> ProcfsMetricsGetter.computeAllMetrics() returns partial metrics (the sum of a 
> subset of child pids), instead of an all 0 result. This can be misleading and 
> is undesired per the current code comments in 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].
>  
> Also, a side effect of it is that it can lead to a verbose warning log if 
> many pids' stat files are missing. Also, a side effect of it is that it can 
> lead to verbose warning logs if many pids' stat files are missing.
> {code:java}
> e.g.2021-03-21 16:58:25,422 [pool-26-thread-8] WARN 

[jira] [Created] (SPARK-34845) ProcfsMetricsGetter.computeAllMetrics shouldn't return partial metrics when some of child pids metrics are missing

2021-03-23 Thread Baohe Zhang (Jira)
Baohe Zhang created SPARK-34845:
---

 Summary: ProcfsMetricsGetter.computeAllMetrics shouldn't return 
partial metrics when some of child pids metrics are missing
 Key: SPARK-34845
 URL: https://issues.apache.org/jira/browse/SPARK-34845
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.1, 3.1.0, 3.0.2, 3.0.1, 3.0.0
Reporter: Baohe Zhang


When the procfs metrics of some child pids are unavailable, 
ProcfsMetricsGetter.computeAllMetrics() returns partial metrics (the sum of a 
subset of child pids), instead of an all 0 result. This can be misleading and 
is undesired per the current code comments in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].

 

Also, a side effect of it is that it can lead to a verbose warning log if many 
pids' stat files are missing. Also, a side effect of it is that it can lead to 
verbose warning logs if many pids' stat files are missing.
{noformat}
e.g.2021-03-21 16:58:25,422 [pool-26-thread-8] WARN  
org.apache.spark.executor.ProcfsMetricsGetter  - There was a problem with 
reading the stat file of the process. java.io.FileNotFoundException: 
/proc/742/stat (No such file or directory) at 
java.io.FileInputStream.open0(Native Method) at 
java.io.FileInputStream.open(FileInputStream.java:195) at 
java.io.FileInputStream.(FileInputStream.java:138) at 
org.apache.spark.executor.ProcfsMetricsGetter.openReader$1(ProcfsMetricsGetter.scala:203)
 at 
org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$addProcfsMetricsFromOneProcess$1(ProcfsMetricsGetter.scala:205)
 at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2540) at 
org.apache.spark.executor.ProcfsMetricsGetter.addProcfsMetricsFromOneProcess(ProcfsMetricsGetter.scala:205)
 at 
org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$computeAllMetrics$1(ProcfsMetricsGetter.scala:297){noformat}
The issue can be fixed by updating the flag isAvailable to false when one of 
the child pid's procfs metric is unavailable. Other methods computePid, 
computePageSize, and getChildPids already have this behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34779) ExecutoMetricsPoller should keep stage entry in stageTCMP until a heartbeat occurs

2021-03-17 Thread Baohe Zhang (Jira)
Baohe Zhang created SPARK-34779:
---

 Summary: ExecutoMetricsPoller should keep stage entry in stageTCMP 
until a heartbeat occurs
 Key: SPARK-34779
 URL: https://issues.apache.org/jira/browse/SPARK-34779
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.1, 3.1.0, 3.0.2, 3.0.1, 3.0.0
Reporter: Baohe Zhang


The current implementation of ExecutoMetricsPoller uses task count in each 
stage to decide whether to keep a stage entry or not. In the case of the 
executor only has 1 core, it may have these issues:
 # Peak metrics missing (due to stage entry being removed within a heartbeat 
interval)
 # Unnecessary and frequent hashmap entry removal and insertion.

Assuming an executor with 1 core has 2 tasks (task1 and task2, both belong to 
stage (0,0)) to execute in a heartbeat interval, the workflow in current 
ExecutorMetricsPoller implementation would be:

1. task1 start -> stage (0, 0) entry created in stageTCMP, task count increment 
to1

2. 1st poll() -> update peak metrics of stage (0, 0)

3. task1 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry 
removed, peak metrics lost.

4. task2 start -> stage (0, 0) entry created in stageTCMP, task count increment 
to1

5. 2nd poll() -> update peak metrics of stage (0, 0)

6. task2 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry 
removed, peak metrics lost

7. heartbeat() ->  empty or inaccurate peak metrics for stage(0,0) reported.

We can fix the issue by keeping entries with task count = 0 in stageTCMP map 
until a heartbeat occurs. At the heartbeat, after reporting the peak metrics 
for each stage, we scan each stage in stageTCMP and remove entries with task 
count = 0.

After the fix, the workflow would be:

1. task1 start -> stage (0, 0) entry created in stageTCMP, task count increment 
to1

2. 1st poll() -> update peak metrics of stage (0, 0)

3. task1 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) 
still remain.

4. task2 start -> task count of stage (0,0) increment to1

5. 2nd poll() -> update peak metrics of stage (0, 0)

6. task2 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) 
still remain.

7. heartbeat() ->  accurate peak metrics for stage (0, 0) reported. Remove 
entry for stage (0,0) in stageTCMP because its task count is 0.

 

How to verify the behavior? 

Submit a job with a custom polling interval (e.g., 2s) and 
spark.executor.cores=1 and check the debug logs of ExecutoMetricsPoller.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32924) Web UI sort on duration is wrong

2021-03-04 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17295639#comment-17295639
 ] 

Baohe Zhang commented on SPARK-32924:
-

[~dongjoon] This is my Jira id.

> Web UI sort on duration is wrong
> 
>
> Key: SPARK-32924
> URL: https://issues.apache.org/jira/browse/SPARK-32924
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.6, 2.4.7, 3.0.2, 3.2.0, 3.1.1
>Reporter: t oo
>Priority: Major
> Fix For: 2.4.8, 3.2.0, 3.1.2, 3.0.3
>
> Attachments: ui_sort.png
>
>
> See attachment, 9 s(econds) is showing as larger than 8.1min



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34545) PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together

2021-02-26 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291954#comment-17291954
 ] 

Baohe Zhang commented on SPARK-34545:
-

A simpler code to reproduce the error:
{code:python}
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import *
>>> 
>>> def udf1(data_type):
... def u1(e):
... return e[0]
... return udf(u1, data_type)
... 
>>> df = spark.createDataFrame(
...   [((1.0, 1.0), (1, 1))],
...['c1', 'c2'])
>>> 
>>> 
>>> df = df.withColumn("c3", udf1(DoubleType())("c1"))
>>> df = df.withColumn("c4", udf1(IntegerType())("c2"))
>>> 
>>> # Show the results
... df.select("c3").show()
+---+
| c3|
+---+
|1.0|
+---+

>>> df.select("c4").show()
+---+
| c4|
+---+
|  1|
+---+

>>> df.select("c3", "c4").show()
+---++
| c3|  c4|
+---++
|1.0|null|
+---++
{code}

> PySpark Python UDF return inconsistent results when applying 2 UDFs with 
> different return type to 2 columns together
> 
>
> Key: SPARK-34545
> URL: https://issues.apache.org/jira/browse/SPARK-34545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Baohe Zhang
>Priority: Blocker
>
> Python UDF returns inconsistent results between evaluating 2 columns together 
> and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
> spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
> (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
> "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
> (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
> def getLastElementWithTime(list_elm):
> # x should be a list of (val, time)
> y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
> return y[-1][0]
> return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
> # Show the results
> df.select("c3").show()
> df.select("c4").show()
> df.select("c3", "c4").show()
> {code}
> Results:
> {noformat}
> >>> df.select("c3").show()
> +---+ 
>   
> | c3|
> +---+
> |1.0|
> |2.0|
> |3.1|
> +---+
> >>> df.select("c4").show()
> +---+
> | c4|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> >>> df.select("c3", "c4").show()
> +---++
> | c3|  c4|
> +---++
> |1.0|null|
> |2.0|null|
> |3.1|   3|
> +---++
> {noformat}
> The test was done in branch-3.1 local mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34545) PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together

2021-02-26 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291896#comment-17291896
 ] 

Baohe Zhang commented on SPARK-34545:
-

This is a correctness bug, so I would like to raise the priority to blocker and 
draw more attention from the community.

> PySpark Python UDF return inconsistent results when applying 2 UDFs with 
> different return type to 2 columns together
> 
>
> Key: SPARK-34545
> URL: https://issues.apache.org/jira/browse/SPARK-34545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Baohe Zhang
>Priority: Blocker
>
> Python UDF returns inconsistent results between evaluating 2 columns together 
> and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
> spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
> (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
> "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
> (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
> def getLastElementWithTime(list_elm):
> # x should be a list of (val, time)
> y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
> return y[-1][0]
> return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
> # Show the results
> df.select("c3").show()
> df.select("c4").show()
> df.select("c3", "c4").show()
> {code}
> Results:
> {noformat}
> >>> df.select("c3").show()
> +---+ 
>   
> | c3|
> +---+
> |1.0|
> |2.0|
> |3.1|
> +---+
> >>> df.select("c4").show()
> +---+
> | c4|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> >>> df.select("c3", "c4").show()
> +---++
> | c3|  c4|
> +---++
> |1.0|null|
> |2.0|null|
> |3.1|   3|
> +---++
> {noformat}
> The test was done in branch-3.1 local mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together

2021-02-26 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34545:

Priority: Blocker  (was: Critical)

> PySpark Python UDF return inconsistent results when applying 2 UDFs with 
> different return type to 2 columns together
> 
>
> Key: SPARK-34545
> URL: https://issues.apache.org/jira/browse/SPARK-34545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Baohe Zhang
>Priority: Blocker
>
> Python UDF returns inconsistent results between evaluating 2 columns together 
> and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
> spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
> (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
> "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
> (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
> def getLastElementWithTime(list_elm):
> # x should be a list of (val, time)
> y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
> return y[-1][0]
> return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
> # Show the results
> df.select("c3").show()
> df.select("c4").show()
> df.select("c3", "c4").show()
> {code}
> Results:
> {noformat}
> >>> df.select("c3").show()
> +---+ 
>   
> | c3|
> +---+
> |1.0|
> |2.0|
> |3.1|
> +---+
> >>> df.select("c4").show()
> +---+
> | c4|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> >>> df.select("c3", "c4").show()
> +---++
> | c3|  c4|
> +---++
> |1.0|null|
> |2.0|null|
> |3.1|   3|
> +---++
> {noformat}
> The test was done in branch-3.1 local mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together

2021-02-25 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34545:

Summary: PySpark Python UDF return inconsistent results when applying 2 
UDFs with different return type to 2 columns together  (was: PySpark Python UDF 
return inconsistent results when applying UDFs to 2 columns together)

> PySpark Python UDF return inconsistent results when applying 2 UDFs with 
> different return type to 2 columns together
> 
>
> Key: SPARK-34545
> URL: https://issues.apache.org/jira/browse/SPARK-34545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Baohe Zhang
>Priority: Critical
>
> Python UDF returns inconsistent results between evaluating 2 columns together 
> and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
> spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
> (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
> "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
> (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
> def getLastElementWithTime(list_elm):
> # x should be a list of (val, time)
> y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
> return y[-1][0]
> return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
> # Show the results
> df.select("c3").show()
> df.select("c4").show()
> df.select("c3", "c4").show()
> {code}
> Results:
> {noformat}
> >>> df.select("c3").show()
> +---+ 
>   
> | c3|
> +---+
> |1.0|
> |2.0|
> |3.1|
> +---+
> >>> df.select("c4").show()
> +---+
> | c4|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> >>> df.select("c3", "c4").show()
> +---++
> | c3|  c4|
> +---++
> |1.0|null|
> |2.0|null|
> |3.1|   3|
> +---++
> {noformat}
> The test was done in branch-3.1 local mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together

2021-02-25 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34545:

Description: 
Python UDF returns inconsistent results between evaluating 2 columns together 
and evaluating one by one.

The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
spark2.

How to reproduce it?

{code:python}
df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
(1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
"2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
(3, "3")])], ['c1', 'c2'])

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def getLastElementWithTimeMaster(data_type):
def getLastElementWithTime(list_elm):
# x should be a list of (val, time)
y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
return y[-1][0]
return udf(getLastElementWithTime, data_type)

# Add 2 columns whcih apply Python UDF
df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))

# Show the results
df.select("c3").show()
df.select("c4").show()
df.select("c3", "c4").show()
{code}

Results:
{noformat}
>>> df.select("c3").show()
+---+   
| c3|
+---+
|1.0|
|2.0|
|3.1|
+---+
>>> df.select("c4").show()
+---+
| c4|
+---+
|  1|
|  2|
|  3|
+---+
>>> df.select("c3", "c4").show()
+---++
| c3|  c4|
+---++
|1.0|null|
|2.0|null|
|3.1|   3|
+---++
{noformat}

The test was done in branch-3.1 local mode.


  was:
Python UDF returns inconsistent results between evaluating 2 columns together 
and evaluating one by one.

The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
spark2.

How to reproduce it?

{code:python}
df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
(1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
"2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
(3, "3")])], ['c1', 'c2'])

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def getLastElementWithTimeMaster(data_type):
def getLastElementWithTime(list_elm):
"""x should be a list of (val, time), val can be a single element or a 
list
"""
y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
return y[-1][0]
return udf(getLastElementWithTime, data_type)

# Add 2 columns whcih apply Python UDF
df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))

# Show the results
df.select("c3").show()
df.select("c4").show()
df.select("c3", "c4").show()
{code}

Results:
{noformat}
>>> df.select("c3").show()
+---+   
| c3|
+---+
|1.0|
|2.0|
|3.1|
+---+
>>> df.select("c4").show()
+---+
| c4|
+---+
|  1|
|  2|
|  3|
+---+
>>> df.select("c3", "c4").show()
+---++
| c3|  c4|
+---++
|1.0|null|
|2.0|null|
|3.1|   3|
+---++
{noformat}

The test was done in branch-3.1 local mode.



> PySpark Python UDF return inconsistent results when applying UDFs to 2 
> columns together
> ---
>
> Key: SPARK-34545
> URL: https://issues.apache.org/jira/browse/SPARK-34545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Baohe Zhang
>Priority: Critical
>
> Python UDF returns inconsistent results between evaluating 2 columns together 
> and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
> spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
> (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
> "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
> (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
> def getLastElementWithTime(list_elm):
> # x should be a list of (val, time)
> y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
> return y[-1][0]
> return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
> # Show the results
> df.select("c3").show()
> df.select("c4").show()
> df.select("c3", "c4").show()
> 

[jira] [Created] (SPARK-34545) PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together

2021-02-25 Thread Baohe Zhang (Jira)
Baohe Zhang created SPARK-34545:
---

 Summary: PySpark Python UDF return inconsistent results when 
applying UDFs to 2 columns together
 Key: SPARK-34545
 URL: https://issues.apache.org/jira/browse/SPARK-34545
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Baohe Zhang


Python UDF returns inconsistent results between evaluating 2 columns together 
and evaluating one by one.

The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
spark2.

How to reproduce it?

{code:python}
df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
(1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
"2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
(3, "3")])], ['c1', 'c2'])

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def getLastElementWithTimeMaster(data_type):
def getLastElementWithTime(list_elm):
"""x should be a list of (val, time), val can be a single element or a 
list
"""
y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
return y[-1][0]
return udf(getLastElementWithTime, data_type)

# Add 2 columns whcih apply Python UDF
df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))

# Show the results
df.select("c3").show()
df.select("c4").show()
df.select("c3", "c4").show()
{code}

Results:
{noformat}
>>> df.select("c3").show()
+---+   
| c3|
+---+
|1.0|
|2.0|
|3.1|
+---+
>>> df.select("c4").show()
+---+
| c4|
+---+
|  1|
|  2|
|  3|
+---+
>>> df.select("c3", "c4").show()
+---++
| c3|  c4|
+---++
|1.0|{color:red}null{color}|
|2.0|{color:red}null{color}|
|3.1|   3|
+---++
{noformat}

The test was done in branch-3.1 local mode.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together

2021-02-25 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34545:

Priority: Critical  (was: Major)

> PySpark Python UDF return inconsistent results when applying UDFs to 2 
> columns together
> ---
>
> Key: SPARK-34545
> URL: https://issues.apache.org/jira/browse/SPARK-34545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Baohe Zhang
>Priority: Critical
>
> Python UDF returns inconsistent results between evaluating 2 columns together 
> and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
> spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
> (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
> "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
> (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
> def getLastElementWithTime(list_elm):
> """x should be a list of (val, time), val can be a single element or 
> a list
> """
> y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
> return y[-1][0]
> return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
> # Show the results
> df.select("c3").show()
> df.select("c4").show()
> df.select("c3", "c4").show()
> {code}
> Results:
> {noformat}
> >>> df.select("c3").show()
> +---+ 
>   
> | c3|
> +---+
> |1.0|
> |2.0|
> |3.1|
> +---+
> >>> df.select("c4").show()
> +---+
> | c4|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> >>> df.select("c3", "c4").show()
> +---++
> | c3|  c4|
> +---++
> |1.0|null|
> |2.0|null|
> |3.1|   3|
> +---++
> {noformat}
> The test was done in branch-3.1 local mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together

2021-02-25 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34545:

Component/s: SQL

> PySpark Python UDF return inconsistent results when applying UDFs to 2 
> columns together
> ---
>
> Key: SPARK-34545
> URL: https://issues.apache.org/jira/browse/SPARK-34545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Baohe Zhang
>Priority: Major
>
> Python UDF returns inconsistent results between evaluating 2 columns together 
> and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
> spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
> (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
> "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
> (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
> def getLastElementWithTime(list_elm):
> """x should be a list of (val, time), val can be a single element or 
> a list
> """
> y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
> return y[-1][0]
> return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
> # Show the results
> df.select("c3").show()
> df.select("c4").show()
> df.select("c3", "c4").show()
> {code}
> Results:
> {noformat}
> >>> df.select("c3").show()
> +---+ 
>   
> | c3|
> +---+
> |1.0|
> |2.0|
> |3.1|
> +---+
> >>> df.select("c4").show()
> +---+
> | c4|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> >>> df.select("c3", "c4").show()
> +---++
> | c3|  c4|
> +---++
> |1.0|null|
> |2.0|null|
> |3.1|   3|
> +---++
> {noformat}
> The test was done in branch-3.1 local mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together

2021-02-25 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34545:

Description: 
Python UDF returns inconsistent results between evaluating 2 columns together 
and evaluating one by one.

The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
spark2.

How to reproduce it?

{code:python}
df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
(1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
"2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
(3, "3")])], ['c1', 'c2'])

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def getLastElementWithTimeMaster(data_type):
def getLastElementWithTime(list_elm):
"""x should be a list of (val, time), val can be a single element or a 
list
"""
y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
return y[-1][0]
return udf(getLastElementWithTime, data_type)

# Add 2 columns whcih apply Python UDF
df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))

# Show the results
df.select("c3").show()
df.select("c4").show()
df.select("c3", "c4").show()
{code}

Results:
{noformat}
>>> df.select("c3").show()
+---+   
| c3|
+---+
|1.0|
|2.0|
|3.1|
+---+
>>> df.select("c4").show()
+---+
| c4|
+---+
|  1|
|  2|
|  3|
+---+
>>> df.select("c3", "c4").show()
+---++
| c3|  c4|
+---++
|1.0|null|
|2.0|null|
|3.1|   3|
+---++
{noformat}

The test was done in branch-3.1 local mode.


  was:
Python UDF returns inconsistent results between evaluating 2 columns together 
and evaluating one by one.

The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
spark2.

How to reproduce it?

{code:python}
df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
(1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
"2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
(3, "3")])], ['c1', 'c2'])

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def getLastElementWithTimeMaster(data_type):
def getLastElementWithTime(list_elm):
"""x should be a list of (val, time), val can be a single element or a 
list
"""
y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
return y[-1][0]
return udf(getLastElementWithTime, data_type)

# Add 2 columns whcih apply Python UDF
df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))

# Show the results
df.select("c3").show()
df.select("c4").show()
df.select("c3", "c4").show()
{code}

Results:
{noformat}
>>> df.select("c3").show()
+---+   
| c3|
+---+
|1.0|
|2.0|
|3.1|
+---+
>>> df.select("c4").show()
+---+
| c4|
+---+
|  1|
|  2|
|  3|
+---+
>>> df.select("c3", "c4").show()
+---++
| c3|  c4|
+---++
|1.0|{color:red}null{color}|
|2.0|{color:red}null{color}|
|3.1|   3|
+---++
{noformat}

The test was done in branch-3.1 local mode.



> PySpark Python UDF return inconsistent results when applying UDFs to 2 
> columns together
> ---
>
> Key: SPARK-34545
> URL: https://issues.apache.org/jira/browse/SPARK-34545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Baohe Zhang
>Priority: Major
>
> Python UDF returns inconsistent results between evaluating 2 columns together 
> and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
> spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
> (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
> "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
> (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
> def getLastElementWithTime(list_elm):
> """x should be a list of (val, time), val can be a single element or 
> a list
> """
> y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
> return y[-1][0]
> return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWit

[jira] [Commented] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance

2021-02-02 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277485#comment-17277485
 ] 

Baohe Zhang commented on SPARK-34336:
-

Full benchmark results are added as txt attachments.

> Use GenericData as Avro serialization data model can improve Avro write/read 
> performance
> 
>
> Key: SPARK-34336
> URL: https://issues.apache.org/jira/browse/SPARK-34336
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: base_read.txt, base_write.txt, generic_data_read.txt, 
> generic_data_write.txt, read_comparison.png, write_comparison.png
>
>
> We found that using "org.apache.avro.generic.GenericData" as Avro 
> serialization data model in Avro writer can significantly improve Avro write 
> performance and slightly improve Avro read performance.
> This optimization was originally put up by [~samkhan]  in this PR 
> https://github.com/apache/spark/pull/29354.
> We re-evaluated the change "Use GenericData instead of ReflectData when 
> writing Avro data" in that PR and verified it can provide performance 
> improvement in Avro write/read benchmarks.
> The base branch is today(2/2/21)'s branch-3.1.
> Besides current Avro read/write benchmarks, I also ran some extra benchmarks 
> for nested structs and arrays read/write, these benchmarks were put up in 
> this PR https://github.com/apache/spark/pull/29352 but haven't been merged.
> Benchmark results are added in the comment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance

2021-02-02 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34336:

Attachment: generic_data_read.txt

> Use GenericData as Avro serialization data model can improve Avro write/read 
> performance
> 
>
> Key: SPARK-34336
> URL: https://issues.apache.org/jira/browse/SPARK-34336
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: base_read.txt, base_write.txt, generic_data_read.txt, 
> generic_data_write.txt, read_comparison.png, write_comparison.png
>
>
> We found that using "org.apache.avro.generic.GenericData" as Avro 
> serialization data model in Avro writer can significantly improve Avro write 
> performance and slightly improve Avro read performance.
> This optimization was originally put up by [~samkhan]  in this PR 
> https://github.com/apache/spark/pull/29354.
> We re-evaluated the change "Use GenericData instead of ReflectData when 
> writing Avro data" in that PR and verified it can provide performance 
> improvement in Avro write/read benchmarks.
> The base branch is today(2/2/21)'s branch-3.1.
> Besides current Avro read/write benchmarks, I also ran some extra benchmarks 
> for nested structs and arrays read/write, these benchmarks were put up in 
> this PR https://github.com/apache/spark/pull/29352 but haven't been merged.
> Benchmark results are added in the comment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance

2021-02-02 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34336:

Attachment: base_read.txt

> Use GenericData as Avro serialization data model can improve Avro write/read 
> performance
> 
>
> Key: SPARK-34336
> URL: https://issues.apache.org/jira/browse/SPARK-34336
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: base_read.txt, base_write.txt, generic_data_read.txt, 
> generic_data_write.txt, read_comparison.png, write_comparison.png
>
>
> We found that using "org.apache.avro.generic.GenericData" as Avro 
> serialization data model in Avro writer can significantly improve Avro write 
> performance and slightly improve Avro read performance.
> This optimization was originally put up by [~samkhan]  in this PR 
> https://github.com/apache/spark/pull/29354.
> We re-evaluated the change "Use GenericData instead of ReflectData when 
> writing Avro data" in that PR and verified it can provide performance 
> improvement in Avro write/read benchmarks.
> The base branch is today(2/2/21)'s branch-3.1.
> Besides current Avro read/write benchmarks, I also ran some extra benchmarks 
> for nested structs and arrays read/write, these benchmarks were put up in 
> this PR https://github.com/apache/spark/pull/29352 but haven't been merged.
> Benchmark results are added in the comment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance

2021-02-02 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34336:

Attachment: generic_data_write.txt

> Use GenericData as Avro serialization data model can improve Avro write/read 
> performance
> 
>
> Key: SPARK-34336
> URL: https://issues.apache.org/jira/browse/SPARK-34336
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: base_read.txt, base_write.txt, generic_data_read.txt, 
> generic_data_write.txt, read_comparison.png, write_comparison.png
>
>
> We found that using "org.apache.avro.generic.GenericData" as Avro 
> serialization data model in Avro writer can significantly improve Avro write 
> performance and slightly improve Avro read performance.
> This optimization was originally put up by [~samkhan]  in this PR 
> https://github.com/apache/spark/pull/29354.
> We re-evaluated the change "Use GenericData instead of ReflectData when 
> writing Avro data" in that PR and verified it can provide performance 
> improvement in Avro write/read benchmarks.
> The base branch is today(2/2/21)'s branch-3.1.
> Besides current Avro read/write benchmarks, I also ran some extra benchmarks 
> for nested structs and arrays read/write, these benchmarks were put up in 
> this PR https://github.com/apache/spark/pull/29352 but haven't been merged.
> Benchmark results are added in the comment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance

2021-02-02 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34336:

Attachment: read_comparison.png

> Use GenericData as Avro serialization data model can improve Avro write/read 
> performance
> 
>
> Key: SPARK-34336
> URL: https://issues.apache.org/jira/browse/SPARK-34336
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: base_read.txt, base_write.txt, generic_data_read.txt, 
> generic_data_write.txt, read_comparison.png, write_comparison.png
>
>
> We found that using "org.apache.avro.generic.GenericData" as Avro 
> serialization data model in Avro writer can significantly improve Avro write 
> performance and slightly improve Avro read performance.
> This optimization was originally put up by [~samkhan]  in this PR 
> https://github.com/apache/spark/pull/29354.
> We re-evaluated the change "Use GenericData instead of ReflectData when 
> writing Avro data" in that PR and verified it can provide performance 
> improvement in Avro write/read benchmarks.
> The base branch is today(2/2/21)'s branch-3.1.
> Besides current Avro read/write benchmarks, I also ran some extra benchmarks 
> for nested structs and arrays read/write, these benchmarks were put up in 
> this PR https://github.com/apache/spark/pull/29352 but haven't been merged.
> Benchmark results are added in the comment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance

2021-02-02 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277483#comment-17277483
 ] 

Baohe Zhang commented on SPARK-34336:
-

Column chart comparison on avg time:

Avro write:
 !write_comparison.png! 

Avro read:
 !read_comparison.png! 

> Use GenericData as Avro serialization data model can improve Avro write/read 
> performance
> 
>
> Key: SPARK-34336
> URL: https://issues.apache.org/jira/browse/SPARK-34336
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: read_comparison.png, write_comparison.png
>
>
> We found that using "org.apache.avro.generic.GenericData" as Avro 
> serialization data model in Avro writer can significantly improve Avro write 
> performance and slightly improve Avro read performance.
> This optimization was originally put up by [~samkhan]  in this PR 
> https://github.com/apache/spark/pull/29354.
> We re-evaluated the change "Use GenericData instead of ReflectData when 
> writing Avro data" in that PR and verified it can provide performance 
> improvement in Avro write/read benchmarks.
> The base branch is today(2/2/21)'s branch-3.1.
> Besides current Avro read/write benchmarks, I also ran some extra benchmarks 
> for nested structs and arrays read/write, these benchmarks were put up in 
> this PR https://github.com/apache/spark/pull/29352 but haven't been merged.
> Benchmark results are added in the comment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance

2021-02-02 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34336:

Attachment: base_write.txt

> Use GenericData as Avro serialization data model can improve Avro write/read 
> performance
> 
>
> Key: SPARK-34336
> URL: https://issues.apache.org/jira/browse/SPARK-34336
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: base_read.txt, base_write.txt, generic_data_read.txt, 
> generic_data_write.txt, read_comparison.png, write_comparison.png
>
>
> We found that using "org.apache.avro.generic.GenericData" as Avro 
> serialization data model in Avro writer can significantly improve Avro write 
> performance and slightly improve Avro read performance.
> This optimization was originally put up by [~samkhan]  in this PR 
> https://github.com/apache/spark/pull/29354.
> We re-evaluated the change "Use GenericData instead of ReflectData when 
> writing Avro data" in that PR and verified it can provide performance 
> improvement in Avro write/read benchmarks.
> The base branch is today(2/2/21)'s branch-3.1.
> Besides current Avro read/write benchmarks, I also ran some extra benchmarks 
> for nested structs and arrays read/write, these benchmarks were put up in 
> this PR https://github.com/apache/spark/pull/29352 but haven't been merged.
> Benchmark results are added in the comment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance

2021-02-02 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-34336:

Attachment: write_comparison.png

> Use GenericData as Avro serialization data model can improve Avro write/read 
> performance
> 
>
> Key: SPARK-34336
> URL: https://issues.apache.org/jira/browse/SPARK-34336
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: write_comparison.png
>
>
> We found that using "org.apache.avro.generic.GenericData" as Avro 
> serialization data model in Avro writer can significantly improve Avro write 
> performance and slightly improve Avro read performance.
> This optimization was originally put up by [~samkhan]  in this PR 
> https://github.com/apache/spark/pull/29354.
> We re-evaluated the change "Use GenericData instead of ReflectData when 
> writing Avro data" in that PR and verified it can provide performance 
> improvement in Avro write/read benchmarks.
> The base branch is today(2/2/21)'s branch-3.1.
> Besides current Avro read/write benchmarks, I also ran some extra benchmarks 
> for nested structs and arrays read/write, these benchmarks were put up in 
> this PR https://github.com/apache/spark/pull/29352 but haven't been merged.
> Benchmark results are added in the comment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance

2021-02-02 Thread Baohe Zhang (Jira)
Baohe Zhang created SPARK-34336:
---

 Summary: Use GenericData as Avro serialization data model can 
improve Avro write/read performance
 Key: SPARK-34336
 URL: https://issues.apache.org/jira/browse/SPARK-34336
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, SQL
Affects Versions: 3.1.2
Reporter: Baohe Zhang


We found that using "org.apache.avro.generic.GenericData" as Avro serialization 
data model in Avro writer can significantly improve Avro write performance and 
slightly improve Avro read performance.

This optimization was originally put up by [~samkhan]  in this PR 
https://github.com/apache/spark/pull/29354.

We re-evaluated the change "Use GenericData instead of ReflectData when writing 
Avro data" in that PR and verified it can provide performance improvement in 
Avro write/read benchmarks.

The base branch is today(2/2/21)'s branch-3.1.

Besides current Avro read/write benchmarks, I also ran some extra benchmarks 
for nested structs and arrays read/write, these benchmarks were put up in this 
PR https://github.com/apache/spark/pull/29352 but haven't been merged.

Benchmark results are added in the comment.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33031) scheduler with blacklisting doesn't appear to pick up new executor added

2020-12-28 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255685#comment-17255685
 ] 

Baohe Zhang commented on SPARK-33031:
-

New tasks won't be scheduled because the node is marked as blacklisted after 2 
executors on that node are blacklisted. The behavior seems correct if the 
experiment is done on a single node.

> scheduler with blacklisting doesn't appear to pick up new executor added
> 
>
> Key: SPARK-33031
> URL: https://issues.apache.org/jira/browse/SPARK-33031
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a test with blacklisting  standalone mode and all the executors 
> were initially blacklisted.  Then one of the executors died and we got 
> allocated another one. The scheduler did not appear to pick up the new one 
> and try to schedule on it though.
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }
> rdd.collect(){code}
>  
> Note that I tried both with and without dynamic allocation enabled.
>  
> You can see screen shot related on 
> https://issues.apache.org/jira/browse/SPARK-33029



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33029) Standalone mode blacklist executors page UI marks driver as blacklisted

2020-12-28 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255682#comment-17255682
 ] 

Baohe Zhang commented on SPARK-33029:
-

With the blacklist feature enabled, by default, a node will be excluded when 2 
executors on this node have been excluded. In this case, the node is excluded 
and we will mark all executors in that node as excluded. Since we are running 
standalone mode in a single node, the driver and all executors share the same 
hostname. the driver will be marked as excluded on AppStatusListener when 
handling "SparkListenerNodeExcludedForStage" event. We can fix it by filter out 
the driver entity when handling this event, hence the UI won't show the driver 
is excluded.

> Standalone mode blacklist executors page UI marks driver as blacklisted
> ---
>
> Key: SPARK-33029
> URL: https://issues.apache.org/jira/browse/SPARK-33029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: Screen Shot 2020-09-29 at 1.52.09 PM.png, Screen Shot 
> 2020-09-29 at 1.53.37 PM.png
>
>
> I am running a spark shell on a 1 node standalone cluster.  I noticed that 
> the executors page ui was marking the driver as blacklisted for the stage 
> that is running.  Attached a screen shot.
> Also, in my case one of the executors died and it doesn't seem like the 
> schedule rpicked up the new one.  It doesn't show up on the stages page and 
> just shows it as active but none of the tasks ran there.
>  
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
>  
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33906) SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset

2020-12-24 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17254669#comment-17254669
 ] 

Baohe Zhang commented on SPARK-33906:
-

The more underlay reason seems to be that the stage complete within a heartbeat 
period, so the heartbeat doesn't piggyback executor peak memory metrics.

> SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset
> -
>
> Key: SPARK-33906
> URL: https://issues.apache.org/jira/browse/SPARK-33906
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Baohe Zhang
>Priority: Blocker
> Attachments: executor-page.png
>
>
> How to reproduce it?
> In mac OS standalone mode, open a spark-shell and run
> $SPARK_HOME/bin/spark-shell --master spark://localhost:7077
> {code:scala}
> val x = sc.makeRDD(1 to 10, 5)
> x.count()
> {code}
> Then open the app UI in the browser, and click the Executors page, will get 
> stuck at this page: 
>  !executor-page.png! 
> Also the return JSON of REST API endpoint 
> http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors 
> miss "peakMemoryMetrics" for executors.
> {noformat}
> [ {
>   "id" : "driver",
>   "hostPort" : "192.168.1.241:50042",
>   "isActive" : true,
>   "rddBlocks" : 0,
>   "memoryUsed" : 0,
>   "diskUsed" : 0,
>   "totalCores" : 0,
>   "maxTasks" : 0,
>   "activeTasks" : 0,
>   "failedTasks" : 0,
>   "completedTasks" : 0,
>   "totalTasks" : 0,
>   "totalDuration" : 0,
>   "totalGCTime" : 0,
>   "totalInputBytes" : 0,
>   "totalShuffleRead" : 0,
>   "totalShuffleWrite" : 0,
>   "isBlacklisted" : false,
>   "maxMemory" : 455501414,
>   "addTime" : "2020-12-24T19:44:18.033GMT",
>   "executorLogs" : { },
>   "memoryMetrics" : {
> "usedOnHeapStorageMemory" : 0,
> "usedOffHeapStorageMemory" : 0,
> "totalOnHeapStorageMemory" : 455501414,
> "totalOffHeapStorageMemory" : 0
>   },
>   "blacklistedInStages" : [ ],
>   "peakMemoryMetrics" : {
> "JVMHeapMemory" : 135021152,
> "JVMOffHeapMemory" : 149558576,
> "OnHeapExecutionMemory" : 0,
> "OffHeapExecutionMemory" : 0,
> "OnHeapStorageMemory" : 3301,
> "OffHeapStorageMemory" : 0,
> "OnHeapUnifiedMemory" : 3301,
> "OffHeapUnifiedMemory" : 0,
> "DirectPoolMemory" : 67963178,
> "MappedPoolMemory" : 0,
> "ProcessTreeJVMVMemory" : 0,
> "ProcessTreeJVMRSSMemory" : 0,
> "ProcessTreePythonVMemory" : 0,
> "ProcessTreePythonRSSMemory" : 0,
> "ProcessTreeOtherVMemory" : 0,
> "ProcessTreeOtherRSSMemory" : 0,
> "MinorGCCount" : 15,
> "MinorGCTime" : 101,
> "MajorGCCount" : 0,
> "MajorGCTime" : 0
>   },
>   "attributes" : { },
>   "resources" : { },
>   "resourceProfileId" : 0,
>   "isExcluded" : false,
>   "excludedInStages" : [ ]
> }, {
>   "id" : "0",
>   "hostPort" : "192.168.1.241:50054",
>   "isActive" : true,
>   "rddBlocks" : 0,
>   "memoryUsed" : 0,
>   "diskUsed" : 0,
>   "totalCores" : 12,
>   "maxTasks" : 12,
>   "activeTasks" : 0,
>   "failedTasks" : 0,
>   "completedTasks" : 5,
>   "totalTasks" : 5,
>   "totalDuration" : 2107,
>   "totalGCTime" : 25,
>   "totalInputBytes" : 0,
>   "totalShuffleRead" : 0,
>   "totalShuffleWrite" : 0,
>   "isBlacklisted" : false,
>   "maxMemory" : 455501414,
>   "addTime" : "2020-12-24T19:44:20.335GMT",
>   "executorLogs" : {
> "stdout" : 
> "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stdout";,
> "stderr" : 
> "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stderr";
>   },
>   "memoryMetrics" : {
> "usedOnHeapStorageMemory" : 0,
> "usedOffHeapStorageMemory" : 0,
> "totalOnHeapStorageMemory" : 455501414,
> "totalOffHeapStorageMemory" : 0
>   },
>   "blacklistedInStages" : [ ],
>   "attributes" : { },
>   "resources" : { },
>   "resourceProfileId" : 0,
>   "isExcluded" : false,
>   "excludedInStages" : [ ]
> } ]
> {noformat}
> I debugged it and observed that ExecutorMetricsPoller
> .getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to 
> None in 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345.
>  The possible reason for returning the empty map is that the stage completion 
> time is shorter than the heartbeat interval, so the stage entry in stageTCMP 
> has already been removed before the reportHeartbeat is called.
> How to fix it?
> Check if the peakMemoryMetrics is undefined in executorspage.js.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-

[jira] [Commented] (SPARK-33906) SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset

2020-12-24 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17254668#comment-17254668
 ] 

Baohe Zhang commented on SPARK-33906:
-

[~dongjoon] Yes.

> SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset
> -
>
> Key: SPARK-33906
> URL: https://issues.apache.org/jira/browse/SPARK-33906
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Baohe Zhang
>Priority: Blocker
> Attachments: executor-page.png
>
>
> How to reproduce it?
> In mac OS standalone mode, open a spark-shell and run
> $SPARK_HOME/bin/spark-shell --master spark://localhost:7077
> {code:scala}
> val x = sc.makeRDD(1 to 10, 5)
> x.count()
> {code}
> Then open the app UI in the browser, and click the Executors page, will get 
> stuck at this page: 
>  !executor-page.png! 
> Also the return JSON of REST API endpoint 
> http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors 
> miss "peakMemoryMetrics" for executors.
> {noformat}
> [ {
>   "id" : "driver",
>   "hostPort" : "192.168.1.241:50042",
>   "isActive" : true,
>   "rddBlocks" : 0,
>   "memoryUsed" : 0,
>   "diskUsed" : 0,
>   "totalCores" : 0,
>   "maxTasks" : 0,
>   "activeTasks" : 0,
>   "failedTasks" : 0,
>   "completedTasks" : 0,
>   "totalTasks" : 0,
>   "totalDuration" : 0,
>   "totalGCTime" : 0,
>   "totalInputBytes" : 0,
>   "totalShuffleRead" : 0,
>   "totalShuffleWrite" : 0,
>   "isBlacklisted" : false,
>   "maxMemory" : 455501414,
>   "addTime" : "2020-12-24T19:44:18.033GMT",
>   "executorLogs" : { },
>   "memoryMetrics" : {
> "usedOnHeapStorageMemory" : 0,
> "usedOffHeapStorageMemory" : 0,
> "totalOnHeapStorageMemory" : 455501414,
> "totalOffHeapStorageMemory" : 0
>   },
>   "blacklistedInStages" : [ ],
>   "peakMemoryMetrics" : {
> "JVMHeapMemory" : 135021152,
> "JVMOffHeapMemory" : 149558576,
> "OnHeapExecutionMemory" : 0,
> "OffHeapExecutionMemory" : 0,
> "OnHeapStorageMemory" : 3301,
> "OffHeapStorageMemory" : 0,
> "OnHeapUnifiedMemory" : 3301,
> "OffHeapUnifiedMemory" : 0,
> "DirectPoolMemory" : 67963178,
> "MappedPoolMemory" : 0,
> "ProcessTreeJVMVMemory" : 0,
> "ProcessTreeJVMRSSMemory" : 0,
> "ProcessTreePythonVMemory" : 0,
> "ProcessTreePythonRSSMemory" : 0,
> "ProcessTreeOtherVMemory" : 0,
> "ProcessTreeOtherRSSMemory" : 0,
> "MinorGCCount" : 15,
> "MinorGCTime" : 101,
> "MajorGCCount" : 0,
> "MajorGCTime" : 0
>   },
>   "attributes" : { },
>   "resources" : { },
>   "resourceProfileId" : 0,
>   "isExcluded" : false,
>   "excludedInStages" : [ ]
> }, {
>   "id" : "0",
>   "hostPort" : "192.168.1.241:50054",
>   "isActive" : true,
>   "rddBlocks" : 0,
>   "memoryUsed" : 0,
>   "diskUsed" : 0,
>   "totalCores" : 12,
>   "maxTasks" : 12,
>   "activeTasks" : 0,
>   "failedTasks" : 0,
>   "completedTasks" : 5,
>   "totalTasks" : 5,
>   "totalDuration" : 2107,
>   "totalGCTime" : 25,
>   "totalInputBytes" : 0,
>   "totalShuffleRead" : 0,
>   "totalShuffleWrite" : 0,
>   "isBlacklisted" : false,
>   "maxMemory" : 455501414,
>   "addTime" : "2020-12-24T19:44:20.335GMT",
>   "executorLogs" : {
> "stdout" : 
> "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stdout";,
> "stderr" : 
> "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stderr";
>   },
>   "memoryMetrics" : {
> "usedOnHeapStorageMemory" : 0,
> "usedOffHeapStorageMemory" : 0,
> "totalOnHeapStorageMemory" : 455501414,
> "totalOffHeapStorageMemory" : 0
>   },
>   "blacklistedInStages" : [ ],
>   "attributes" : { },
>   "resources" : { },
>   "resourceProfileId" : 0,
>   "isExcluded" : false,
>   "excludedInStages" : [ ]
> } ]
> {noformat}
> I debugged it and observed that ExecutorMetricsPoller
> .getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to 
> None in 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345.
>  The possible reason for returning the empty map is that the stage completion 
> time is shorter than the heartbeat interval, so the stage entry in stageTCMP 
> has already been removed before the reportHeartbeat is called.
> How to fix it?
> Check if the peakMemoryMetrics is undefined in executorspage.js.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33906) SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset

2020-12-24 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-33906:

Description: 
How to reproduce it?

In mac OS standalone mode, open a spark-shell and run

$SPARK_HOME/bin/spark-shell --master spark://localhost:7077
{code:scala}
val x = sc.makeRDD(1 to 10, 5)
x.count()
{code}
Then open the app UI in the browser, and click the Executors page, will get 
stuck at this page: 

 !executor-page.png! 

Also the return JSON of REST API endpoint 
http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors 
miss "peakMemoryMetrics" for executors.
{noformat}
[ {
  "id" : "driver",
  "hostPort" : "192.168.1.241:50042",
  "isActive" : true,
  "rddBlocks" : 0,
  "memoryUsed" : 0,
  "diskUsed" : 0,
  "totalCores" : 0,
  "maxTasks" : 0,
  "activeTasks" : 0,
  "failedTasks" : 0,
  "completedTasks" : 0,
  "totalTasks" : 0,
  "totalDuration" : 0,
  "totalGCTime" : 0,
  "totalInputBytes" : 0,
  "totalShuffleRead" : 0,
  "totalShuffleWrite" : 0,
  "isBlacklisted" : false,
  "maxMemory" : 455501414,
  "addTime" : "2020-12-24T19:44:18.033GMT",
  "executorLogs" : { },
  "memoryMetrics" : {
"usedOnHeapStorageMemory" : 0,
"usedOffHeapStorageMemory" : 0,
"totalOnHeapStorageMemory" : 455501414,
"totalOffHeapStorageMemory" : 0
  },
  "blacklistedInStages" : [ ],
  "peakMemoryMetrics" : {
"JVMHeapMemory" : 135021152,
"JVMOffHeapMemory" : 149558576,
"OnHeapExecutionMemory" : 0,
"OffHeapExecutionMemory" : 0,
"OnHeapStorageMemory" : 3301,
"OffHeapStorageMemory" : 0,
"OnHeapUnifiedMemory" : 3301,
"OffHeapUnifiedMemory" : 0,
"DirectPoolMemory" : 67963178,
"MappedPoolMemory" : 0,
"ProcessTreeJVMVMemory" : 0,
"ProcessTreeJVMRSSMemory" : 0,
"ProcessTreePythonVMemory" : 0,
"ProcessTreePythonRSSMemory" : 0,
"ProcessTreeOtherVMemory" : 0,
"ProcessTreeOtherRSSMemory" : 0,
"MinorGCCount" : 15,
"MinorGCTime" : 101,
"MajorGCCount" : 0,
"MajorGCTime" : 0
  },
  "attributes" : { },
  "resources" : { },
  "resourceProfileId" : 0,
  "isExcluded" : false,
  "excludedInStages" : [ ]
}, {
  "id" : "0",
  "hostPort" : "192.168.1.241:50054",
  "isActive" : true,
  "rddBlocks" : 0,
  "memoryUsed" : 0,
  "diskUsed" : 0,
  "totalCores" : 12,
  "maxTasks" : 12,
  "activeTasks" : 0,
  "failedTasks" : 0,
  "completedTasks" : 5,
  "totalTasks" : 5,
  "totalDuration" : 2107,
  "totalGCTime" : 25,
  "totalInputBytes" : 0,
  "totalShuffleRead" : 0,
  "totalShuffleWrite" : 0,
  "isBlacklisted" : false,
  "maxMemory" : 455501414,
  "addTime" : "2020-12-24T19:44:20.335GMT",
  "executorLogs" : {
"stdout" : 
"http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stdout";,
"stderr" : 
"http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stderr";
  },
  "memoryMetrics" : {
"usedOnHeapStorageMemory" : 0,
"usedOffHeapStorageMemory" : 0,
"totalOnHeapStorageMemory" : 455501414,
"totalOffHeapStorageMemory" : 0
  },
  "blacklistedInStages" : [ ],
  "attributes" : { },
  "resources" : { },
  "resourceProfileId" : 0,
  "isExcluded" : false,
  "excludedInStages" : [ ]
} ]
{noformat}

I debugged it and observed that ExecutorMetricsPoller
.getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to 
None in 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345.
 The possible reason for returning the empty map is that the stage completion 
time is shorter than the heartbeat interval, so the stage entry in stageTCMP 
has already been removed before the reportHeartbeat is called.

How to fix it?

Check if the peakMemoryMetrics is undefined in executorspage.js.

  was:
How to reproduce it?

In mac OS standalone mode, open a spark-shell and run

$SPARK_HOME/bin/spark-shell --master spark://localhost:7077
{code:scala}
val x = sc.makeRDD(1 to 10, 5)
x.count()
{code}
Then open the app UI in the browser, and click the Executors page, will get 
stuck at this page: 

 !executor-page.png! 

Also the return JSON of REST API endpoint 
http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors 
miss "peakMemoryMetrics" for executors.
{noformat}
[ {
  "id" : "driver",
  "hostPort" : "192.168.1.241:50042",
  "isActive" : true,
  "rddBlocks" : 0,
  "memoryUsed" : 0,
  "diskUsed" : 0,
  "totalCores" : 0,
  "maxTasks" : 0,
  "activeTasks" : 0,
  "failedTasks" : 0,
  "completedTasks" : 0,
  "totalTasks" : 0,
  "totalDuration" : 0,
  "totalGCTime" : 0,
  "totalInputBytes" : 0,
  "totalShuffleRead" : 0,
  "totalShuffleWrite" : 0,
  "isBlacklisted" : false,
  "maxMemory" : 455501414,
  "addTime" : "2020-12-24T19:44:18.033GMT",
  "executorLogs" : { },
  "memoryMetrics" : {
"usedOnHeapStorageMemory" : 0,
"usedOffHeapStorageMemory" : 0,
"totalOnHeapStorageMemory" : 45550141

[jira] [Commented] (SPARK-33906) SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset

2020-12-24 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17254647#comment-17254647
 ] 

Baohe Zhang commented on SPARK-33906:
-

I will put a PR soon.

> SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset
> -
>
> Key: SPARK-33906
> URL: https://issues.apache.org/jira/browse/SPARK-33906
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: executor-page.png
>
>
> How to reproduce it?
> In mac OS standalone mode, open a spark-shell and run
> $SPARK_HOME/bin/spark-shell --master spark://localhost:7077
> {code:scala}
> val x = sc.makeRDD(1 to 10, 5)
> x.count()
> {code}
> Then open the app UI in the browser, and click the Executors page, will get 
> stuck at this page: 
>  !executor-page.png! 
> Also the return JSON of REST API endpoint 
> http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors 
> miss "peakMemoryMetrics" for executors.
> {noformat}
> [ {
>   "id" : "driver",
>   "hostPort" : "192.168.1.241:50042",
>   "isActive" : true,
>   "rddBlocks" : 0,
>   "memoryUsed" : 0,
>   "diskUsed" : 0,
>   "totalCores" : 0,
>   "maxTasks" : 0,
>   "activeTasks" : 0,
>   "failedTasks" : 0,
>   "completedTasks" : 0,
>   "totalTasks" : 0,
>   "totalDuration" : 0,
>   "totalGCTime" : 0,
>   "totalInputBytes" : 0,
>   "totalShuffleRead" : 0,
>   "totalShuffleWrite" : 0,
>   "isBlacklisted" : false,
>   "maxMemory" : 455501414,
>   "addTime" : "2020-12-24T19:44:18.033GMT",
>   "executorLogs" : { },
>   "memoryMetrics" : {
> "usedOnHeapStorageMemory" : 0,
> "usedOffHeapStorageMemory" : 0,
> "totalOnHeapStorageMemory" : 455501414,
> "totalOffHeapStorageMemory" : 0
>   },
>   "blacklistedInStages" : [ ],
>   "peakMemoryMetrics" : {
> "JVMHeapMemory" : 135021152,
> "JVMOffHeapMemory" : 149558576,
> "OnHeapExecutionMemory" : 0,
> "OffHeapExecutionMemory" : 0,
> "OnHeapStorageMemory" : 3301,
> "OffHeapStorageMemory" : 0,
> "OnHeapUnifiedMemory" : 3301,
> "OffHeapUnifiedMemory" : 0,
> "DirectPoolMemory" : 67963178,
> "MappedPoolMemory" : 0,
> "ProcessTreeJVMVMemory" : 0,
> "ProcessTreeJVMRSSMemory" : 0,
> "ProcessTreePythonVMemory" : 0,
> "ProcessTreePythonRSSMemory" : 0,
> "ProcessTreeOtherVMemory" : 0,
> "ProcessTreeOtherRSSMemory" : 0,
> "MinorGCCount" : 15,
> "MinorGCTime" : 101,
> "MajorGCCount" : 0,
> "MajorGCTime" : 0
>   },
>   "attributes" : { },
>   "resources" : { },
>   "resourceProfileId" : 0,
>   "isExcluded" : false,
>   "excludedInStages" : [ ]
> }, {
>   "id" : "0",
>   "hostPort" : "192.168.1.241:50054",
>   "isActive" : true,
>   "rddBlocks" : 0,
>   "memoryUsed" : 0,
>   "diskUsed" : 0,
>   "totalCores" : 12,
>   "maxTasks" : 12,
>   "activeTasks" : 0,
>   "failedTasks" : 0,
>   "completedTasks" : 5,
>   "totalTasks" : 5,
>   "totalDuration" : 2107,
>   "totalGCTime" : 25,
>   "totalInputBytes" : 0,
>   "totalShuffleRead" : 0,
>   "totalShuffleWrite" : 0,
>   "isBlacklisted" : false,
>   "maxMemory" : 455501414,
>   "addTime" : "2020-12-24T19:44:20.335GMT",
>   "executorLogs" : {
> "stdout" : 
> "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stdout";,
> "stderr" : 
> "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stderr";
>   },
>   "memoryMetrics" : {
> "usedOnHeapStorageMemory" : 0,
> "usedOffHeapStorageMemory" : 0,
> "totalOnHeapStorageMemory" : 455501414,
> "totalOffHeapStorageMemory" : 0
>   },
>   "blacklistedInStages" : [ ],
>   "attributes" : { },
>   "resources" : { },
>   "resourceProfileId" : 0,
>   "isExcluded" : false,
>   "excludedInStages" : [ ]
> } ]
> {noformat}
> I debugged it and observed that ExecutorMetricsPoller
> .getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to 
> None in 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345.
>  The possible reason for returning the empty map is that the stage completion 
> time is shorter than the heartbeat interval, so the stage entry in stageTCMP 
> has already been removed before the reportHeartbeat is called.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33906) SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset

2020-12-24 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-33906:

Description: 
How to reproduce it?

In mac OS standalone mode, open a spark-shell and run

$SPARK_HOME/bin/spark-shell --master spark://localhost:7077
{code:scala}
val x = sc.makeRDD(1 to 10, 5)
x.count()
{code}
Then open the app UI in the browser, and click the Executors page, will get 
stuck at this page: 

 !executor-page.png! 

Also the return JSON of REST API endpoint 
http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors 
miss "peakMemoryMetrics" for executors.
{noformat}
[ {
  "id" : "driver",
  "hostPort" : "192.168.1.241:50042",
  "isActive" : true,
  "rddBlocks" : 0,
  "memoryUsed" : 0,
  "diskUsed" : 0,
  "totalCores" : 0,
  "maxTasks" : 0,
  "activeTasks" : 0,
  "failedTasks" : 0,
  "completedTasks" : 0,
  "totalTasks" : 0,
  "totalDuration" : 0,
  "totalGCTime" : 0,
  "totalInputBytes" : 0,
  "totalShuffleRead" : 0,
  "totalShuffleWrite" : 0,
  "isBlacklisted" : false,
  "maxMemory" : 455501414,
  "addTime" : "2020-12-24T19:44:18.033GMT",
  "executorLogs" : { },
  "memoryMetrics" : {
"usedOnHeapStorageMemory" : 0,
"usedOffHeapStorageMemory" : 0,
"totalOnHeapStorageMemory" : 455501414,
"totalOffHeapStorageMemory" : 0
  },
  "blacklistedInStages" : [ ],
  "peakMemoryMetrics" : {
"JVMHeapMemory" : 135021152,
"JVMOffHeapMemory" : 149558576,
"OnHeapExecutionMemory" : 0,
"OffHeapExecutionMemory" : 0,
"OnHeapStorageMemory" : 3301,
"OffHeapStorageMemory" : 0,
"OnHeapUnifiedMemory" : 3301,
"OffHeapUnifiedMemory" : 0,
"DirectPoolMemory" : 67963178,
"MappedPoolMemory" : 0,
"ProcessTreeJVMVMemory" : 0,
"ProcessTreeJVMRSSMemory" : 0,
"ProcessTreePythonVMemory" : 0,
"ProcessTreePythonRSSMemory" : 0,
"ProcessTreeOtherVMemory" : 0,
"ProcessTreeOtherRSSMemory" : 0,
"MinorGCCount" : 15,
"MinorGCTime" : 101,
"MajorGCCount" : 0,
"MajorGCTime" : 0
  },
  "attributes" : { },
  "resources" : { },
  "resourceProfileId" : 0,
  "isExcluded" : false,
  "excludedInStages" : [ ]
}, {
  "id" : "0",
  "hostPort" : "192.168.1.241:50054",
  "isActive" : true,
  "rddBlocks" : 0,
  "memoryUsed" : 0,
  "diskUsed" : 0,
  "totalCores" : 12,
  "maxTasks" : 12,
  "activeTasks" : 0,
  "failedTasks" : 0,
  "completedTasks" : 5,
  "totalTasks" : 5,
  "totalDuration" : 2107,
  "totalGCTime" : 25,
  "totalInputBytes" : 0,
  "totalShuffleRead" : 0,
  "totalShuffleWrite" : 0,
  "isBlacklisted" : false,
  "maxMemory" : 455501414,
  "addTime" : "2020-12-24T19:44:20.335GMT",
  "executorLogs" : {
"stdout" : 
"http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stdout";,
"stderr" : 
"http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stderr";
  },
  "memoryMetrics" : {
"usedOnHeapStorageMemory" : 0,
"usedOffHeapStorageMemory" : 0,
"totalOnHeapStorageMemory" : 455501414,
"totalOffHeapStorageMemory" : 0
  },
  "blacklistedInStages" : [ ],
  "attributes" : { },
  "resources" : { },
  "resourceProfileId" : 0,
  "isExcluded" : false,
  "excludedInStages" : [ ]
} ]
{noformat}

I debugged it and observed that ExecutorMetricsPoller
.getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to 
None in 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345.
 The possible reason for returning the empty map is that the stage completion 
time is shorter than the heartbeat interval, so the stage entry in stageTCMP 
has already been removed before the reportHeartbeat is called.

  was:
How to reproduce it?

In mac OS standalone mode, open a spark-shell and run

$SPARK_HOME/bin/spark-shell --master spark://localhost:7077
{code:scala}
val x = sc.makeRDD(1 to 10, 5)
x.count()
{code}
Then open the app UI in the browser, and click the Executors page, will get 
stuck at this page: 

!image-2020-12-24-14-12-22-983.png!

Also the return JSON of REST API endpoint 
http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors 
miss "peakMemoryMetrics" for executors.
{noformat}
[ {
  "id" : "driver",
  "hostPort" : "192.168.1.241:50042",
  "isActive" : true,
  "rddBlocks" : 0,
  "memoryUsed" : 0,
  "diskUsed" : 0,
  "totalCores" : 0,
  "maxTasks" : 0,
  "activeTasks" : 0,
  "failedTasks" : 0,
  "completedTasks" : 0,
  "totalTasks" : 0,
  "totalDuration" : 0,
  "totalGCTime" : 0,
  "totalInputBytes" : 0,
  "totalShuffleRead" : 0,
  "totalShuffleWrite" : 0,
  "isBlacklisted" : false,
  "maxMemory" : 455501414,
  "addTime" : "2020-12-24T19:44:18.033GMT",
  "executorLogs" : { },
  "memoryMetrics" : {
"usedOnHeapStorageMemory" : 0,
"usedOffHeapStorageMemory" : 0,
"totalOnHeapStorageMemory" : 455501414,
"totalOffHeapStorageMemory" : 0
  },
  "blacklistedInStages" 

[jira] [Updated] (SPARK-33906) SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset

2020-12-24 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-33906:

Attachment: executor-page.png

> SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset
> -
>
> Key: SPARK-33906
> URL: https://issues.apache.org/jira/browse/SPARK-33906
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: executor-page.png
>
>
> How to reproduce it?
> In mac OS standalone mode, open a spark-shell and run
> $SPARK_HOME/bin/spark-shell --master spark://localhost:7077
> {code:scala}
> val x = sc.makeRDD(1 to 10, 5)
> x.count()
> {code}
> Then open the app UI in the browser, and click the Executors page, will get 
> stuck at this page: 
> !image-2020-12-24-14-12-22-983.png!
> Also the return JSON of REST API endpoint 
> http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors 
> miss "peakMemoryMetrics" for executors.
> {noformat}
> [ {
>   "id" : "driver",
>   "hostPort" : "192.168.1.241:50042",
>   "isActive" : true,
>   "rddBlocks" : 0,
>   "memoryUsed" : 0,
>   "diskUsed" : 0,
>   "totalCores" : 0,
>   "maxTasks" : 0,
>   "activeTasks" : 0,
>   "failedTasks" : 0,
>   "completedTasks" : 0,
>   "totalTasks" : 0,
>   "totalDuration" : 0,
>   "totalGCTime" : 0,
>   "totalInputBytes" : 0,
>   "totalShuffleRead" : 0,
>   "totalShuffleWrite" : 0,
>   "isBlacklisted" : false,
>   "maxMemory" : 455501414,
>   "addTime" : "2020-12-24T19:44:18.033GMT",
>   "executorLogs" : { },
>   "memoryMetrics" : {
> "usedOnHeapStorageMemory" : 0,
> "usedOffHeapStorageMemory" : 0,
> "totalOnHeapStorageMemory" : 455501414,
> "totalOffHeapStorageMemory" : 0
>   },
>   "blacklistedInStages" : [ ],
>   "peakMemoryMetrics" : {
> "JVMHeapMemory" : 135021152,
> "JVMOffHeapMemory" : 149558576,
> "OnHeapExecutionMemory" : 0,
> "OffHeapExecutionMemory" : 0,
> "OnHeapStorageMemory" : 3301,
> "OffHeapStorageMemory" : 0,
> "OnHeapUnifiedMemory" : 3301,
> "OffHeapUnifiedMemory" : 0,
> "DirectPoolMemory" : 67963178,
> "MappedPoolMemory" : 0,
> "ProcessTreeJVMVMemory" : 0,
> "ProcessTreeJVMRSSMemory" : 0,
> "ProcessTreePythonVMemory" : 0,
> "ProcessTreePythonRSSMemory" : 0,
> "ProcessTreeOtherVMemory" : 0,
> "ProcessTreeOtherRSSMemory" : 0,
> "MinorGCCount" : 15,
> "MinorGCTime" : 101,
> "MajorGCCount" : 0,
> "MajorGCTime" : 0
>   },
>   "attributes" : { },
>   "resources" : { },
>   "resourceProfileId" : 0,
>   "isExcluded" : false,
>   "excludedInStages" : [ ]
> }, {
>   "id" : "0",
>   "hostPort" : "192.168.1.241:50054",
>   "isActive" : true,
>   "rddBlocks" : 0,
>   "memoryUsed" : 0,
>   "diskUsed" : 0,
>   "totalCores" : 12,
>   "maxTasks" : 12,
>   "activeTasks" : 0,
>   "failedTasks" : 0,
>   "completedTasks" : 5,
>   "totalTasks" : 5,
>   "totalDuration" : 2107,
>   "totalGCTime" : 25,
>   "totalInputBytes" : 0,
>   "totalShuffleRead" : 0,
>   "totalShuffleWrite" : 0,
>   "isBlacklisted" : false,
>   "maxMemory" : 455501414,
>   "addTime" : "2020-12-24T19:44:20.335GMT",
>   "executorLogs" : {
> "stdout" : 
> "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stdout";,
> "stderr" : 
> "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stderr";
>   },
>   "memoryMetrics" : {
> "usedOnHeapStorageMemory" : 0,
> "usedOffHeapStorageMemory" : 0,
> "totalOnHeapStorageMemory" : 455501414,
> "totalOffHeapStorageMemory" : 0
>   },
>   "blacklistedInStages" : [ ],
>   "attributes" : { },
>   "resources" : { },
>   "resourceProfileId" : 0,
>   "isExcluded" : false,
>   "excludedInStages" : [ ]
> } ]
> {noformat}
> I debugged it and observed that ExecutorMetricsPoller
> .getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to 
> None in 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345.
>  The possible reason for returning the empty map is that the stage completion 
> time is shorter than the heartbeat interval, so the stage entry in stageTCMP 
> has already been removed before the reportHeartbeat is called.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33906) SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset

2020-12-24 Thread Baohe Zhang (Jira)
Baohe Zhang created SPARK-33906:
---

 Summary: SPARK UI Executors page stuck when 
ExecutorSummary.peakMemoryMetrics is unset
 Key: SPARK-33906
 URL: https://issues.apache.org/jira/browse/SPARK-33906
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.2.0
Reporter: Baohe Zhang


How to reproduce it?

In mac OS standalone mode, open a spark-shell and run

$SPARK_HOME/bin/spark-shell --master spark://localhost:7077
{code:scala}
val x = sc.makeRDD(1 to 10, 5)
x.count()
{code}
Then open the app UI in the browser, and click the Executors page, will get 
stuck at this page: 

!image-2020-12-24-14-12-22-983.png!

Also the return JSON of REST API endpoint 
http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors 
miss "peakMemoryMetrics" for executors.
{noformat}
[ {
  "id" : "driver",
  "hostPort" : "192.168.1.241:50042",
  "isActive" : true,
  "rddBlocks" : 0,
  "memoryUsed" : 0,
  "diskUsed" : 0,
  "totalCores" : 0,
  "maxTasks" : 0,
  "activeTasks" : 0,
  "failedTasks" : 0,
  "completedTasks" : 0,
  "totalTasks" : 0,
  "totalDuration" : 0,
  "totalGCTime" : 0,
  "totalInputBytes" : 0,
  "totalShuffleRead" : 0,
  "totalShuffleWrite" : 0,
  "isBlacklisted" : false,
  "maxMemory" : 455501414,
  "addTime" : "2020-12-24T19:44:18.033GMT",
  "executorLogs" : { },
  "memoryMetrics" : {
"usedOnHeapStorageMemory" : 0,
"usedOffHeapStorageMemory" : 0,
"totalOnHeapStorageMemory" : 455501414,
"totalOffHeapStorageMemory" : 0
  },
  "blacklistedInStages" : [ ],
  "peakMemoryMetrics" : {
"JVMHeapMemory" : 135021152,
"JVMOffHeapMemory" : 149558576,
"OnHeapExecutionMemory" : 0,
"OffHeapExecutionMemory" : 0,
"OnHeapStorageMemory" : 3301,
"OffHeapStorageMemory" : 0,
"OnHeapUnifiedMemory" : 3301,
"OffHeapUnifiedMemory" : 0,
"DirectPoolMemory" : 67963178,
"MappedPoolMemory" : 0,
"ProcessTreeJVMVMemory" : 0,
"ProcessTreeJVMRSSMemory" : 0,
"ProcessTreePythonVMemory" : 0,
"ProcessTreePythonRSSMemory" : 0,
"ProcessTreeOtherVMemory" : 0,
"ProcessTreeOtherRSSMemory" : 0,
"MinorGCCount" : 15,
"MinorGCTime" : 101,
"MajorGCCount" : 0,
"MajorGCTime" : 0
  },
  "attributes" : { },
  "resources" : { },
  "resourceProfileId" : 0,
  "isExcluded" : false,
  "excludedInStages" : [ ]
}, {
  "id" : "0",
  "hostPort" : "192.168.1.241:50054",
  "isActive" : true,
  "rddBlocks" : 0,
  "memoryUsed" : 0,
  "diskUsed" : 0,
  "totalCores" : 12,
  "maxTasks" : 12,
  "activeTasks" : 0,
  "failedTasks" : 0,
  "completedTasks" : 5,
  "totalTasks" : 5,
  "totalDuration" : 2107,
  "totalGCTime" : 25,
  "totalInputBytes" : 0,
  "totalShuffleRead" : 0,
  "totalShuffleWrite" : 0,
  "isBlacklisted" : false,
  "maxMemory" : 455501414,
  "addTime" : "2020-12-24T19:44:20.335GMT",
  "executorLogs" : {
"stdout" : 
"http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stdout";,
"stderr" : 
"http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stderr";
  },
  "memoryMetrics" : {
"usedOnHeapStorageMemory" : 0,
"usedOffHeapStorageMemory" : 0,
"totalOnHeapStorageMemory" : 455501414,
"totalOffHeapStorageMemory" : 0
  },
  "blacklistedInStages" : [ ],
  "attributes" : { },
  "resources" : { },
  "resourceProfileId" : 0,
  "isExcluded" : false,
  "excludedInStages" : [ ]
} ]
{noformat}

I debugged it and observed that ExecutorMetricsPoller
.getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to 
None in 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345.
 The possible reason for returning the empty map is that the stage completion 
time is shorter than the heartbeat interval, so the stage entry in stageTCMP 
has already been removed before the reportHeartbeat is called.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26399) Add new stage-level REST APIs and parameters

2020-12-04 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244341#comment-17244341
 ] 

Baohe Zhang commented on SPARK-26399:
-

Seems the work of this ticket is already done by 
https://issues.apache.org/jira/browse/SPARK-23431 and 
https://issues.apache.org/jira/browse/SPARK-32446.

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http://:18080/api/v1/applications/ id>// attempt>/executorSummary{code}
> Add parameters to the stages REST API to specify:
> *  filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
> * task metric quantiles, add adding the task summary if specified
> * executor metric quantiles, and adding the executor summary if specified



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33215) Speed up event log download by skipping UI rebuild

2020-10-21 Thread Baohe Zhang (Jira)
Baohe Zhang created SPARK-33215:
---

 Summary: Speed up event log download by skipping UI rebuild
 Key: SPARK-33215
 URL: https://issues.apache.org/jira/browse/SPARK-33215
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.1, 2.4.7
Reporter: Baohe Zhang


Right now, when we want to download the event logs from the spark history 
server(SHS), SHS will need to parse entire the event log to rebuild UI, and 
this is just for view permission checks. UI rebuilding is a time-consuming and 
memory-intensive task, especially for large logs. However, this process is 
unnecessary for event log download.

This patch enables SHS to check UI view permissions of a given app/attempt for 
a given user, without rebuilding the UI. This is achieved by adding a method 
"checkUIViewPermissions(appId: String, attemptId: Option[String], user: 
String): Boolean" to many layers of history server components.

With this patch, UI rebuild can be skipped when downloading event logs from the 
history server. Thus the time of downloading a GB scale event log can be 
reduced from several minutes to several seconds, and the memory consumption of 
UI rebuilding can be avoided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32350) Add batch write support on LevelDB to improve performance of HybridStore

2020-07-17 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-32350:

Description: 
The idea is to improve the performance of HybridStore by adding batch write 
support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 
introduces HybridStore. HybridStore will write data to InMemoryStore at first 
and use a background thread to dump data to LevelDB once the writing to 
InMemoryStore is completed. In the comments section of 
[https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned 
using batch writing can improve the performance of this dumping process and he 
wrote the code of writeAll().

I did the comparison of the HybridStore switching time between one-by-one write 
and batch write on an HDD disk. When the disk is free, the batch-write has 
around 25% improvement, and when the disk is 100% busy, the batch-write has 7x 
- 10x improvement.

when the disk is at 0% utilization:

 
||log size, jobs and tasks per job||original switching time, with 
write()||switching time with writeAll()||
|133m, 400 jobs, 100 tasks per job|16s|13s|
|265m, 400 jobs, 200 tasks per job|30s|23s|
|1.3g, 1000 jobs, 400 tasks per job|136s|108s|

 

when the disk is at 100% utilization:
||log size, jobs and tasks per job||original switching time, with 
write()||switching time with writeAll()||
|133m, 400 jobs, 100 tasks per job|116s|17s|
|265m, 400 jobs, 200 tasks per job|251s|26s|

I also ran some write related benchmarking tests on LevelDBBenchmark.java and 
measured the total time of writing 1024 objects.

when the disk is at 0% utilization:

 
||Benchmark test||with write(), ms||with writeAll(), ms ||
|randomUpdatesIndexed|213.060|157.356|
|randomUpdatesNoIndex|57.869|35.439|
|randomWritesIndexed|298.854|229.274|
|randomWritesNoIndex|66.764|38.361|
|sequentialUpdatesIndexed|87.019|56.219|
|sequentialUpdatesNoIndex|61.851|41.942|
|sequentialWritesIndexed|94.044|56.534|
|sequentialWritesNoIndex|118.345|66.483|

 

when the disk is at 50% utilization:
||Benchmark test||with write(), ms||with writeAll(), ms||
|randomUpdatesIndexed|230.386|180.817|
|randomUpdatesNoIndex|58.935|50.113|
|randomWritesIndexed|315.241|254.400|
|randomWritesNoIndex|96.709|41.164|
|sequentialUpdatesIndexed|89.971|70.387|
|sequentialUpdatesNoIndex|72.021|53.769|
|sequentialWritesIndexed|103.052|67.358|
|sequentialWritesNoIndex|76.194|99.037|

  was:
The idea is to improve the performance of HybridStore by adding batch write 
support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 
introduces HybridStore. HybridStore will write data to InMemoryStore at first 
and use a background thread to dump data to LevelDB once the writing to 
InMemoryStore is completed. In the comments section of 
[https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned 
using batch writing can improve the performance of this dumping process and he 
wrote the code of writeAll().

I did the comparison of the HybridStore switching time between one-by-one write 
and batch write on an HDD disk. When the disk is free, the batch-write has 
around 25% improvement, and when the disk is 100% busy, the batch-write has 7x 
- 10x improvement.

when the disk is at 0% utilization:

 
||log size, jobs and tasks per job||original switching time, with 
write()||switching time with writeAll()||
|133m, 400 jobs, 100 tasks per job|16s|13s|
|265m, 400 jobs, 200 tasks per job|30s|23s|
|1.3g, 1000 jobs, 400 tasks per job|136s|108s|

 

when the disk is at 100% utilization:
||log size, jobs and tasks per job||original switching time, with 
write()||switching time with writeAll()||
|133m, 400 jobs, 100 tasks per job|116s|17s|
|265m, 400 jobs, 200 tasks per job|251s|26s|

I also ran some write related benchmarking tests on LevelDBBenchmark.java and 
measured the total time of writing 1024 objects. The test was conducted when 
disk at 0% utilization.
||Benchmark test||with write(), ms||with writeAll(), ms||
|randomUpdatesIndexed|230.386|180.817|
|randomUpdatesNoIndex|58.935|50.113|
|randomWritesIndexed|315.241|254.400|
|randomWritesNoIndex|96.709|41.164|
|sequentialUpdatesIndexed|89.971|70.387|
|sequentialUpdatesNoIndex|72.021|53.769|
|sequentialWritesIndexed|103.052|67.358|
|sequentialWritesNoIndex|76.194|99.037|


> Add batch write support on LevelDB to improve performance of HybridStore
> 
>
> Key: SPARK-32350
> URL: https://issues.apache.org/jira/browse/SPARK-32350
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Baohe Zhang
>Priority: Major
>
> The idea is to improve the performance of HybridStore by adding batch write 
> support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 
> introduce

[jira] [Created] (SPARK-32350) Add batch write support on LevelDB to improve performance of HybridStore

2020-07-17 Thread Baohe Zhang (Jira)
Baohe Zhang created SPARK-32350:
---

 Summary: Add batch write support on LevelDB to improve performance 
of HybridStore
 Key: SPARK-32350
 URL: https://issues.apache.org/jira/browse/SPARK-32350
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.1, 3.1.0
Reporter: Baohe Zhang


The idea is to improve the performance of HybridStore by adding batch write 
support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 
introduces HybridStore. HybridStore will write data to InMemoryStore at first 
and use a background thread to dump data to LevelDB once the writing to 
InMemoryStore is completed. In the comments section of 
[https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned 
using batch writing can improve the performance of this dumping process and he 
wrote the code of writeAll().

I did the comparison of the HybridStore switching time between one-by-one write 
and batch write on an HDD disk. When the disk is free, the batch-write has 
around 25% improvement, and when the disk is 100% busy, the batch-write has 7x 
- 10x improvement.

when the disk is at 0% utilization:

 
||log size, jobs and tasks per job||original switching time, with 
write()||switching time with writeAll()||
|133m, 400 jobs, 100 tasks per job|16s|13s|
|265m, 400 jobs, 200 tasks per job|30s|23s|
|1.3g, 1000 jobs, 400 tasks per job|136s|108s|

 

when the disk is at 100% utilization:
||log size, jobs and tasks per job||original switching time, with 
write()||switching time with writeAll()||
|133m, 400 jobs, 100 tasks per job|116s|17s|
|265m, 400 jobs, 200 tasks per job|251s|26s|

I also ran some write related benchmarking tests on LevelDBBenchmark.java and 
measured the total time of writing 1024 objects. The test was conducted when 
disk at 0% utilization.
||Benchmark test||with write(), ms||with writeAll(), ms||
|randomUpdatesIndexed|230.386|180.817|
|randomUpdatesNoIndex|58.935|50.113|
|randomWritesIndexed|315.241|254.400|
|randomWritesNoIndex|96.709|41.164|
|sequentialUpdatesIndexed|89.971|70.387|
|sequentialUpdatesNoIndex|72.021|53.769|
|sequentialWritesIndexed|103.052|67.358|
|sequentialWritesNoIndex|76.194|99.037|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31664) Race in YARN scheduler shutdown leads to uncaught SparkException "Could not find CoarseGrainedScheduler"

2020-05-08 Thread Baohe Zhang (Jira)
Baohe Zhang created SPARK-31664:
---

 Summary: Race in YARN scheduler shutdown leads to uncaught 
SparkException "Could not find CoarseGrainedScheduler"
 Key: SPARK-31664
 URL: https://issues.apache.org/jira/browse/SPARK-31664
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 3.0.0, 3.0.1, 3.1.0
Reporter: Baohe Zhang


I used this command to run SparkPi on a yarn cluster with dynamicAllocation 
enabled: "$SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster 
--class org.apache.spark.examples.SparkPi ./spark-examples.jar 1000" and 
received error log below every time.

 
{code:java}
20/05/06 16:31:44 ERROR TransportRequestHandler: Error while invoking 
RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:169)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:684)
at 
org.apache.spark.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:66)
at 
org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
20/05/06 16:31:45 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
20/05/06 16:31:45 INFO MemoryStore: MemoryStore cleared
20/05/06 16:31:45 INFO BlockManager

[jira] [Updated] (SPARK-31608) Add a hybrid KVStore to make UI loading faster

2020-04-30 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-31608:

Description: 
This is a follow-up for the work done by Hieu Huynh in 2019.

Add a new class HybridKVStore to make the history server faster when loading 
event files. When rebuilding the application state from event logs, 
HybridKVStore will first write data to an in-memory store and having a 
background thread that keeps pushing the change to levelDB.

I ran some tests on 3.0.1 on mac os:
||kvstore type / log size||100m||200m||500m||1g||2g||
|HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
leveldb|
|LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|
For example when loading a 1g file, HybridKVStore takes 23s to parse (that 
means, users only need to wait for 23s to see the UI), the background thread 
will still run 17s to copy data to leveldb. And after that, the in memory store 
can be closed, the entire store now moves to leveldb. So in general, it has 3x 
- 4x UI loading speed improvement.

  was:
This is a follow-up for the work done by Hieu Huynh in 2019.

Add a new class HybridKVStore to make the history server faster when loading 
event files. When writing to this kvstore, it will first write to an in-memory 
store and having a background thread that keeps pushing the change to levelDB.

I ran some tests on 3.0.1 on mac os:
||kvstore type / log size||100m||200m||500m||1g||2g||
|HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
leveldb|
|LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|

 


> Add a hybrid KVStore to make UI loading faster
> --
>
> Key: SPARK-31608
> URL: https://issues.apache.org/jira/browse/SPARK-31608
> Project: Spark
>  Issue Type: Story
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Baohe Zhang
>Priority: Major
>
> This is a follow-up for the work done by Hieu Huynh in 2019.
> Add a new class HybridKVStore to make the history server faster when loading 
> event files. When rebuilding the application state from event logs, 
> HybridKVStore will first write data to an in-memory store and having a 
> background thread that keeps pushing the change to levelDB.
> I ran some tests on 3.0.1 on mac os:
> ||kvstore type / log size||100m||200m||500m||1g||2g||
> |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
> leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
> leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
> leveldb|
> |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|
> For example when loading a 1g file, HybridKVStore takes 23s to parse (that 
> means, users only need to wait for 23s to see the UI), the background thread 
> will still run 17s to copy data to leveldb. And after that, the in memory 
> store can be closed, the entire store now moves to leveldb. So in general, it 
> has 3x - 4x UI loading speed improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31608) Add a hybrid KVStore to make UI loading faster

2020-04-29 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-31608:

Description: 
This is a follow-up for the work done by Hieu Huynh in 2019.

Add a new class HybridKVStore to make the history server faster when loading 
event files. When writing to this kvstore, it will first write to an in-memory 
store and having a background thread that keeps pushing the change to levelDB.

I ran some tests on 3.0.1 on mac os:
||kvstore type / log size||100m||200m||500m||1g||2g||
|HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
leveldb|
|LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|

 

  was:
Add a new class HybridKVStore to make the history server faster when loading 
event files. When writing to this kvstore, it will first write to an in-memory 
store and having a background thread that keeps pushing the change to levelDB.

I ran some tests on 3.0.1 on mac os:
||kvstore type / log size||100m||200m||500m||1g||2g||
|HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
leveldb|
|LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|

 


> Add a hybrid KVStore to make UI loading faster
> --
>
> Key: SPARK-31608
> URL: https://issues.apache.org/jira/browse/SPARK-31608
> Project: Spark
>  Issue Type: Story
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Baohe Zhang
>Priority: Major
>
> This is a follow-up for the work done by Hieu Huynh in 2019.
> Add a new class HybridKVStore to make the history server faster when loading 
> event files. When writing to this kvstore, it will first write to an 
> in-memory store and having a background thread that keeps pushing the change 
> to levelDB.
> I ran some tests on 3.0.1 on mac os:
> ||kvstore type / log size||100m||200m||500m||1g||2g||
> |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
> leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
> leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
> leveldb|
> |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31608) Add a hybrid KVStore to make UI loading faster

2020-04-29 Thread Baohe Zhang (Jira)
Baohe Zhang created SPARK-31608:
---

 Summary: Add a hybrid KVStore to make UI loading faster
 Key: SPARK-31608
 URL: https://issues.apache.org/jira/browse/SPARK-31608
 Project: Spark
  Issue Type: Story
  Components: Web UI
Affects Versions: 3.0.1
Reporter: Baohe Zhang


Add a new class HybridKVStore to make the history server faster when loading 
event files. When writing to this kvstore, it will first write to an in-memory 
store and having a background thread that keeps pushing the change to levelDB.

I ran some tests on 3.0.1 on mac os:
||kvstore type / log size||100m||200m||500m||1g||2g||
|HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
leveldb|
|LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31584) NullPointerException when parsing event log with InMemoryStore

2020-04-27 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-31584:

Attachment: errorstack.txt

> NullPointerException when parsing event log with InMemoryStore
> --
>
> Key: SPARK-31584
> URL: https://issues.apache.org/jira/browse/SPARK-31584
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Baohe Zhang
>Priority: Minor
> Fix For: 3.0.1
>
> Attachments: errorstack.txt
>
>
> I compiled with the current branch-3.0 source and tested it in mac os. A 
> java.lang.NullPointerException will be thrown when below conditions are met: 
>  # Using InMemoryStore as kvstore when parsing the event log file (e.g., when 
> spark.history.store.path is unset). 
>  # At least one stage in this event log has task number greater than 
> spark.ui.retainedTasks (by default is 10). In this case, kvstore needs to 
> delete extra task records.
>  # The job has more than one stage, so parentToChildrenMap in 
> InMemoryStore.java will have more than one key.
> The java.lang.NullPointerException is thrown in InMemoryStore.java :296. In 
> the method deleteParentIndex().
> {code:java}
> private void deleteParentIndex(Object key) {
>   if (hasNaturalParentIndex) {
> for (NaturalKeys v : parentToChildrenMap.values()) {
>   if (v.remove(asKey(key))) {
> // `v` can be empty after removing the natural key and we can 
> remove it from
> // `parentToChildrenMap`. However, `parentToChildrenMap` is a 
> ConcurrentMap and such
> // checking and deleting can be slow.
> // This method is to delete one object with certain key, let's 
> make it simple here.
> break;
>   }
> }
>   }
> }{code}
> In “if (v.remove(asKey(key)))”, if the key is not contained in v,  
> "v.remove(asKey(key))" will return null, and java will throw a 
> NullPointerException when executing "if (null)".
> An exception stack trace is attached.
> This issue can be fixed by updating if statement to
> {code:java}
> if (v.remove(asKey(key)) != null){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31584) NullPointerException when parsing event log with InMemoryStore

2020-04-27 Thread Baohe Zhang (Jira)
Baohe Zhang created SPARK-31584:
---

 Summary: NullPointerException when parsing event log with 
InMemoryStore
 Key: SPARK-31584
 URL: https://issues.apache.org/jira/browse/SPARK-31584
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.0.1
Reporter: Baohe Zhang
 Fix For: 3.0.1


I compiled with the current branch-3.0 source and tested it in mac os. A 
java.lang.NullPointerException will be thrown when below conditions are met: 
 # Using InMemoryStore as kvstore when parsing the event log file (e.g., when 
spark.history.store.path is unset). 
 # At least one stage in this event log has task number greater than 
spark.ui.retainedTasks (by default is 10). In this case, kvstore needs to 
delete extra task records.
 # The job has more than one stage, so parentToChildrenMap in 
InMemoryStore.java will have more than one key.

The java.lang.NullPointerException is thrown in InMemoryStore.java :296. In the 
method deleteParentIndex().
{code:java}
private void deleteParentIndex(Object key) {
  if (hasNaturalParentIndex) {
for (NaturalKeys v : parentToChildrenMap.values()) {
  if (v.remove(asKey(key))) {
// `v` can be empty after removing the natural key and we can 
remove it from
// `parentToChildrenMap`. However, `parentToChildrenMap` is a 
ConcurrentMap and such
// checking and deleting can be slow.
// This method is to delete one object with certain key, let's make 
it simple here.
break;
  }
}
  }
}{code}
In “if (v.remove(asKey(key)))”, if the key is not contained in v,  
"v.remove(asKey(key))" will return null, and java will throw a 
NullPointerException when executing "if (null)".

An exception stack trace is attached.

This issue can be fixed by updating if statement to
{code:java}
if (v.remove(asKey(key)) != null){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31380) Peak Execution Memory Quantile is not displayed in Spark History Server UI

2020-04-15 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084431#comment-17084431
 ] 

Baohe Zhang commented on SPARK-31380:
-

Did you run your application with Spark 3? I tested it in my local 3.0 build 
and I saw the spark history server UI can display non-zero Peak Execution 
Memory. 

!image-2020-04-15-18-16-18-254.png!

> Peak Execution Memory Quantile is not displayed in Spark History Server UI
> --
>
> Key: SPARK-31380
> URL: https://issues.apache.org/jira/browse/SPARK-31380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Major
> Attachments: image-2020-04-15-18-16-18-254.png
>
>
> Peak Execution Memory Quantile is displayed in the regular Spark UI 
> correctly. If the same application is viewed in Spark History Server UI, Peak 
> Execution Memory is always displayed as zero.
> Spark event log for the application seem to contain Peak Execution 
> Memory(under the tag "internal.metrics.peakExecutionMemory") correctly.  
> However this is not reflected in the History Server UI.
> *Steps to produce non-zero Peak Execution Memory*
> spark.range(0, 20).map\{x => (x , x % 20)}.toDF("a", 
> "b").createOrReplaceTempView("fred")
> spark.range(0, 20).map\{x => (x , x + 1)}.toDF("a", 
> "b").createOrReplaceTempView("phil")
> sql("select p.**,* f.* from phil p join fred f on f.b = p.b").count
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31380) Peak Execution Memory Quantile is not displayed in Spark History Server UI

2020-04-15 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-31380:

Attachment: image-2020-04-15-18-16-18-254.png

> Peak Execution Memory Quantile is not displayed in Spark History Server UI
> --
>
> Key: SPARK-31380
> URL: https://issues.apache.org/jira/browse/SPARK-31380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Major
> Attachments: image-2020-04-15-18-16-18-254.png
>
>
> Peak Execution Memory Quantile is displayed in the regular Spark UI 
> correctly. If the same application is viewed in Spark History Server UI, Peak 
> Execution Memory is always displayed as zero.
> Spark event log for the application seem to contain Peak Execution 
> Memory(under the tag "internal.metrics.peakExecutionMemory") correctly.  
> However this is not reflected in the History Server UI.
> *Steps to produce non-zero Peak Execution Memory*
> spark.range(0, 20).map\{x => (x , x % 20)}.toDF("a", 
> "b").createOrReplaceTempView("fred")
> spark.range(0, 20).map\{x => (x , x + 1)}.toDF("a", 
> "b").createOrReplaceTempView("phil")
> sql("select p.**,* f.* from phil p join fred f on f.b = p.b").count
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org