[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler
[ https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-35865: Description: SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by throting the max number of threads for sending responses of chunk fetch requests. But it causes severe performance degradation because the throughput of handling chunk fetch requests is reduced. SPARK-30623 makes the async and sync mode configurable and makes the async mode the default. SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue and we rarely see sasl timeout issues with async mode in our production clusters today. Few days ago we accidentally turned on sync mode on one cluster and we observed severe shuffle performance degradation. As a result, We benchmarked the performance comparison between async and sync mode and *we suggest removing sync mode in the code base* as it seems not to provide any benefits today. We would like to share the benchmark result and hear the opinion from the community. benchmark on job's run time (sync mode is 2x - 3x slower): YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB memory, each node manager has 1GB heap size. shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 1000 reduce tasks. results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 mins. benchmark on metrics of external shuffle service: YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes as sync mode, shuffling 2.5 GB data. results: in openblockreuqestslatencymillis_ratemean and some other metrics, the nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some screenshots of the metrics. was: SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by throting the max number of threads for sending responses of chunk fetch requests. But it causes severe performance degradation because the throughput of handling chunk fetch requests is reduced. SPARK-30623 makes the async and sync mode configurable and makes the async mode the default. SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue and we rarely see sasl timeout issues with async mode in our production clusters today. Few days ago we accidentally turned on sync mode on one cluster and we observed severe shuffle performance degradation. As a result, We benchmarked the performance comparison between async and sync mode and *we suggest removing sync mode in the code base* as it seems not to provide any benefits today. We would like to share the benchmark result and hear the opinion from the community. benchmark on job's run time (sync mode is 2x - 3x slower): YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB memory, each node manager has 1GB heap size. shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 1000 reduce tasks. results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 mins. benchmark on metrics of external shuffle service: YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes as sync mode, shuffling 2.5 GB data. results: in openblockreuqestslatencymillis_ratemean and some other metrics, the nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some screenshots of the metrics. !openblock.png! !openblock-compare.png! > Remove await (syncMode) in ChunkFetchRequestHandler > --- > > Key: SPARK-35865 > URL: https://issues.apache.org/jira/browse/SPARK-35865 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.8, 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: openblock-compare.png, openblock.png > > > SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by > throting the max number of threads for sending responses of chunk fetch > requests. But it causes severe performance degradation because the throughput > of handling chunk fetch requests is reduced. SPARK-30623 makes the async and > sync mode configurable and makes the async mode the default. > SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout > issue and we rarely see sasl timeout issues with async mode in our production > clusters today. > Few days ago we accidentally turned on sync mode on one cluster and we > observed severe shuffle performance degradation. As a result, We benchmarked > the performance comparison between async and sync mode and *we suggest > removing sync mode in the code base* as it seems not to provide any benefits > today. We would like to share the benchmark result and
[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler
[ https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-35865: Description: SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by throting the max number of threads for sending responses of chunk fetch requests. But it causes severe performance degradation because the throughput of handling chunk fetch requests is reduced. SPARK-30623 makes the async and sync mode configurable and makes the async mode the default. SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue and we rarely see sasl timeout issues with async mode in our production clusters today. Few days ago we accidentally turned on sync mode on one cluster and we observed severe shuffle performance degradation. As a result, We benchmarked the performance comparison between async and sync mode and *we suggest removing sync mode in the code base* as it seems not to provide any benefits today. We would like to share the benchmark result and hear the opinion from the community. benchmark on job's run time (sync mode is 2x - 3x slower): YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB memory, each node manager has 1GB heap size. shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 1000 reduce tasks. results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 mins. benchmark on metrics of external shuffle service: YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes as sync mode, shuffling 2.5 GB data. results: in openblockreuqestslatencymillis_ratemean and some other metrics, the nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some screenshots of the metrics. !openblock.png! !openblock-compare.png! was: SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by throting the max number of threads for sending responses of chunk fetch requests. But it causes severe performance degradation because the throughput of handling chunk fetch requests is reduced. SPARK-30623 makes the async and sync mode configurable and makes the async mode the default. SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue and we rarely see sasl timeout issues with async mode in our production clusters today. Few days ago we accidentally turned on sync mode on one cluster and we observed severe shuffle performance degradation. As a result, We benchmarked the performance comparison between async and sync mode and *we suggest removing sync mode in the code base* as it seems not to provide any benefits today. We would like to share the benchmark result and hear the opinion from the community. benchmark on job's run time (sync mode is 2x - 3x slower): YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB memory, each node manager has 1GB heap size. shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 1000 reduce tasks. results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 mins. benchmark on metrics of external shuffle service: YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes as sync mode, shuffling 2.5 GB data. results: in openblockreuqestslatencymillis_ratemean and some other metrics, the nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some screenshots of the metrics. > Remove await (syncMode) in ChunkFetchRequestHandler > --- > > Key: SPARK-35865 > URL: https://issues.apache.org/jira/browse/SPARK-35865 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.8, 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: openblock-compare.png, openblock.png > > > SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by > throting the max number of threads for sending responses of chunk fetch > requests. But it causes severe performance degradation because the throughput > of handling chunk fetch requests is reduced. SPARK-30623 makes the async and > sync mode configurable and makes the async mode the default. > SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout > issue and we rarely see sasl timeout issues with async mode in our production > clusters today. > Few days ago we accidentally turned on sync mode on one cluster and we > observed severe shuffle performance degradation. As a result, We benchmarked > the performance comparison between async and sync mode and *we suggest > removing sync mode in the code base* as it seems not to provide any benefits > today. We would like to share the benchmark result and
[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler
[ https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-35865: Attachment: openblock-compare.png > Remove await (syncMode) in ChunkFetchRequestHandler > --- > > Key: SPARK-35865 > URL: https://issues.apache.org/jira/browse/SPARK-35865 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.8, 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: openblock-compare.png, openblock.png > > > SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by > throting the max number of threads for sending responses of chunk fetch > requests. But it causes severe performance degradation because the throughput > of handling chunk fetch requests is reduced. SPARK-30623 makes the async and > sync mode configurable and makes the async mode the default. > SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout > issue and we rarely see sasl timeout issues with async mode in our production > clusters today. > Few days ago we accidentally turned on sync mode on one cluster and we > observed severe shuffle performance degradation. As a result, We benchmarked > the performance comparison between async and sync mode and *we suggest > removing sync mode in the code base* as it seems not to provide any benefits > today. We would like to share the benchmark result and hear the opinion from > the community. > > benchmark on job's run time (sync mode is 2x - 3x slower): > YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB > memory, each node manager has 1GB heap size. > shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and > 1000 reduce tasks. > results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes > 6 mins. > > benchmark on metrics of external shuffle service: > YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes > as sync mode, shuffling 2.5 GB data. > results: in openblockreuqestslatencymillis_ratemean and some other metrics, > the nodes in sync mode are 3x - 4x higher than nodes in async mode. I > attached some screenshots of the metrics. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler
Baohe Zhang created SPARK-35865: --- Summary: Remove await (syncMode) in ChunkFetchRequestHandler Key: SPARK-35865 URL: https://issues.apache.org/jira/browse/SPARK-35865 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.1.2, 2.4.8 Reporter: Baohe Zhang Attachments: openblock-compare.png, openblock.png SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by throting the max number of threads for sending responses of chunk fetch requests. But it causes severe performance degradation because the throughput of handling chunk fetch requests is reduced. SPARK-30623 makes the async and sync mode configurable and makes the async mode the default. SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue and we rarely see sasl timeout issues with async mode in our production clusters today. Few days ago we accidentally turned on sync mode on one cluster and we observed severe shuffle performance degradation. As a result, We benchmarked the performance comparison between async and sync mode and *we suggest removing sync mode in the code base* as it seems not to provide any benefits today. We would like to share the benchmark result and hear the opinion from the community. benchmark on job's run time (sync mode is 2x - 3x slower): YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB memory, each node manager has 1GB heap size. shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 1000 reduce tasks. results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 mins. benchmark on metrics of external shuffle service: YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes as sync mode, shuffling 2.5 GB data. results: in openblockreuqestslatencymillis_ratemean and some other metrics, the nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some screenshots of the metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler
[ https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-35865: Attachment: openblock.png > Remove await (syncMode) in ChunkFetchRequestHandler > --- > > Key: SPARK-35865 > URL: https://issues.apache.org/jira/browse/SPARK-35865 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.8, 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: openblock-compare.png, openblock.png > > > SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by > throting the max number of threads for sending responses of chunk fetch > requests. But it causes severe performance degradation because the throughput > of handling chunk fetch requests is reduced. SPARK-30623 makes the async and > sync mode configurable and makes the async mode the default. > SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout > issue and we rarely see sasl timeout issues with async mode in our production > clusters today. > Few days ago we accidentally turned on sync mode on one cluster and we > observed severe shuffle performance degradation. As a result, We benchmarked > the performance comparison between async and sync mode and *we suggest > removing sync mode in the code base* as it seems not to provide any benefits > today. We would like to share the benchmark result and hear the opinion from > the community. > > benchmark on job's run time (sync mode is 2x - 3x slower): > YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB > memory, each node manager has 1GB heap size. > shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and > 1000 reduce tasks. > results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes > 6 mins. > > benchmark on metrics of external shuffle service: > YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes > as sync mode, shuffling 2.5 GB data. > results: in openblockreuqestslatencymillis_ratemean and some other metrics, > the nodes in sync mode are 3x - 4x higher than nodes in async mode. I > attached some screenshots of the metrics. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35010) nestedSchemaPruning causes issue when reading hive generated Orc files
Baohe Zhang created SPARK-35010: --- Summary: nestedSchemaPruning causes issue when reading hive generated Orc files Key: SPARK-35010 URL: https://issues.apache.org/jira/browse/SPARK-35010 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0, 3.0.0 Reporter: Baohe Zhang In spark3, we have spark.sql.orc.imple=native and spark.sql.optimizer.nestedSchemaPruning.enabled=true as the default settings. And these would cause issues when query struct field of hive-generated orc files. For example, we got an error when running this query in spark3 {code:java} spark.table("testtable").filter(col("utc_date") === "20210122").select(col("open_count.d35")).show(false) {code} The error is {code:java} Caused by: java.lang.AssertionError: assertion failed: The given data schema struct>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read. at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:153) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} I think the reason is that we apply the nestedSchemaPruning to the dataSchema. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala#L75] This nestedSchemaPruning not only prunes the unused fields of the struct, it also prunes the unused columns. In my test, the dataSchema originally has 48 columns, but after nested schema pruning, the dataSchema is pruned to 1 column. This pruning result in an assertion error in [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L159] because column pruning in hive generated orc files is not supported. This issue seems also related to the hive version, we use hive 1.2, and it doesn't contain field names in the physical schema. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34779) ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat occurs
[ https://issues.apache.org/jira/browse/SPARK-34779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312498#comment-17312498 ] Baohe Zhang commented on SPARK-34779: - Thanks for pointing it out! I didn't aware that task peak metrics will contribute to the executor peak metrics. > ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat > occurs > --- > > Key: SPARK-34779 > URL: https://issues.apache.org/jira/browse/SPARK-34779 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: Baohe Zhang >Priority: Major > > The current implementation of ExecutoMetricsPoller uses task count in each > stage to decide whether to keep a stage entry or not. In the case of the > executor only has 1 core, it may have these issues: > # Peak metrics missing (due to stage entry being removed within a heartbeat > interval) > # Unnecessary and frequent hashmap entry removal and insertion. > Assuming an executor with 1 core has 2 tasks (task1 and task2, both belong to > stage (0,0)) to execute in a heartbeat interval, the workflow in current > ExecutorMetricsPoller implementation would be: > 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count > increment to1 > 2. 1st poll() -> update peak metrics of stage (0, 0) > 3. task1 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry > removed, peak metrics lost. > 4. task2 start -> stage (0, 0) entry created in stageTCMP, task count > increment to1 > 5. 2nd poll() -> update peak metrics of stage (0, 0) > 6. task2 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry > removed, peak metrics lost > 7. heartbeat() -> empty or inaccurate peak metrics for stage(0,0) reported. > We can fix the issue by keeping entries with task count = 0 in stageTCMP map > until a heartbeat occurs. At the heartbeat, after reporting the peak metrics > for each stage, we scan each stage in stageTCMP and remove entries with task > count = 0. > After the fix, the workflow would be: > 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count > increment to1 > 2. 1st poll() -> update peak metrics of stage (0, 0) > 3. task1 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) > still remain. > 4. task2 start -> task count of stage (0,0) increment to1 > 5. 2nd poll() -> update peak metrics of stage (0, 0) > 6. task2 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) > still remain. > 7. heartbeat() -> accurate peak metrics for stage (0, 0) reported. Remove > entry for stage (0,0) in stageTCMP because its task count is 0. > > How to verify the behavior? > Submit a job with a custom polling interval (e.g., 2s) and > spark.executor.cores=1 and check the debug logs of ExecutoMetricsPoller. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34845) ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of child pids metrics are missing
[ https://issues.apache.org/jira/browse/SPARK-34845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34845: Description: When the procfs metrics of some child pids are unavailable, ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum of a subset of child pids), instead of an all 0 result. This can be misleading and is undesired per the current code comments in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214]. How to reproduce it? This unit test is kind of self-explanatory: {code:java} val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics")) val mockedP = spy(p) // proc file of pid 22764 doesn't exist, so partial metrics shouldn't be returned var ptree = Set(26109, 22764, 22763) when(mockedP.computeProcessTree).thenReturn(ptree) var r = mockedP.computeAllMetrics assert(r.jvmVmemTotal == 0) assert(r.jvmRSSTotal == 0) assert(r.pythonVmemTotal == 0) assert(r.pythonRSSTotal == 0) {code} In the current implementation, computeAllMetrics will reset the allMetrics to 0 when processing 22764 because 22764's proc file doesn't exist, but then it will continue processing pid 22763, and update allMetrics to procfs metrics of pid 22763. Also, a side effect of this bug is that it can lead to a verbose warning log if many pids' stat files are missing. An early terminating can make the warning logs more concise. How to solve it? The issue can be fixed by throwing IOException to computeAllMetrics(), in that case computeAllMetrics can aware that at lease one child pid's procfs metrics is missing and then terminate the metrics reporting. was: When the procfs metrics of some child pids are unavailable, ProcfsMetricsGetter.computeAllMetrics() returns partial metrics (the sum of a subset of child pids), instead of an all 0 result. This can be misleading and is undesired per the current code comments in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214]. Also, a side effect of it is that it can lead to a verbose warning log if many pids' stat files are missing. Also, a side effect of it is that it can lead to verbose warning logs if many pids' stat files are missing. {code:java} e.g.2021-03-21 16:58:25,422 [pool-26-thread-8] WARN org.apache.spark.executor.ProcfsMetricsGetter - There was a problem with reading the stat file of the process. java.io.FileNotFoundException: /proc/742/stat (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at org.apache.spark.executor.ProcfsMetricsGetter.openReader$1(ProcfsMetricsGetter.scala:203) at org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$addProcfsMetricsFromOneProcess$1(ProcfsMetricsGetter.scala:205) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2540) at org.apache.spark.executor.ProcfsMetricsGetter.addProcfsMetricsFromOneProcess(ProcfsMetricsGetter.scala:205) at org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$computeAllMetrics$1(ProcfsMetricsGetter.scala:297) {code} The issue can be fixed by updating the flag isAvailable to false when one of the child pid's procfs metric is unavailable. Other methods computePid, computePageSize, and getChildPids already have this behavior. Summary: ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of child pids metrics are missing (was: ProcfsMetricsGetter.computeAllMetrics shouldn't return partial metrics when some of child pids metrics are missing) > ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of > child pids metrics are missing > > > Key: SPARK-34845 > URL: https://issues.apache.org/jira/browse/SPARK-34845 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: Baohe Zhang >Priority: Major > > When the procfs metrics of some child pids are unavailable, > ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum > of a subset of child pids), instead of an all 0 result. This can be > misleading and is undesired per the current code comments in > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214]. > How to reproduce it? > This unit test is kind of self-explanatory: > > {code:java} > val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics")) > val mockedP = spy(p) > // proc file of pid 22764
[jira] [Updated] (SPARK-34845) ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of child pids metrics are missing
[ https://issues.apache.org/jira/browse/SPARK-34845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34845: Description: When the procfs metrics of some child pids are unavailable, ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum of a subset of child pids), instead of an all 0 result. This can be misleading and is undesired per the current code comments in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214]. How to reproduce it? This unit test is kind of self-explanatory: {code:java} val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics")) val mockedP = spy(p) // proc file of pid 22764 doesn't exist, so partial metrics shouldn't be returned var ptree = Set(26109, 22764, 22763) when(mockedP.computeProcessTree).thenReturn(ptree) var r = mockedP.computeAllMetrics assert(r.jvmVmemTotal == 0) assert(r.jvmRSSTotal == 0) assert(r.pythonVmemTotal == 0) assert(r.pythonRSSTotal == 0) {code} In the current implementation, computeAllMetrics will reset the allMetrics to 0 when processing 22764 because 22764's proc file doesn't exist, but then it will continue processing pid 22763, and update allMetrics to procfs metrics of pid 22763. Also, a side effect of this bug is that it can lead to a verbose warning log if many pids' stat files are missing. An early terminating can make the warning logs more concise. How to solve it? The issue can be fixed by throwing IOException to computeAllMetrics(), in that case computeAllMetrics can aware that at lease one child pid's procfs metrics is missing and then terminate the metrics reporting. was: When the procfs metrics of some child pids are unavailable, ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum of a subset of child pids), instead of an all 0 result. This can be misleading and is undesired per the current code comments in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214]. How to reproduce it? This unit test is kind of self-explanatory: {code:java} val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics")) val mockedP = spy(p) // proc file of pid 22764 doesn't exist, so partial metrics shouldn't be returned var ptree = Set(26109, 22764, 22763) when(mockedP.computeProcessTree).thenReturn(ptree) var r = mockedP.computeAllMetrics assert(r.jvmVmemTotal == 0) assert(r.jvmRSSTotal == 0) assert(r.pythonVmemTotal == 0) assert(r.pythonRSSTotal == 0) {code} In the current implementation, computeAllMetrics will reset the allMetrics to 0 when processing 22764 because 22764's proc file doesn't exist, but then it will continue processing pid 22763, and update allMetrics to procfs metrics of pid 22763. Also, a side effect of this bug is that it can lead to a verbose warning log if many pids' stat files are missing. An early terminating can make the warning logs more concise. How to solve it? The issue can be fixed by throwing IOException to computeAllMetrics(), in that case computeAllMetrics can aware that at lease one child pid's procfs metrics is missing and then terminate the metrics reporting. > ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of > child pids metrics are missing > > > Key: SPARK-34845 > URL: https://issues.apache.org/jira/browse/SPARK-34845 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: Baohe Zhang >Priority: Major > > When the procfs metrics of some child pids are unavailable, > ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum > of a subset of child pids), instead of an all 0 result. This can be > misleading and is undesired per the current code comments in > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214]. > How to reproduce it? > This unit test is kind of self-explanatory: > {code:java} > val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics")) > val mockedP = spy(p) > // proc file of pid 22764 doesn't exist, so partial metrics shouldn't be > returned > var ptree = Set(26109, 22764, 22763) > when(mockedP.computeProcessTree).thenReturn(ptree) > var r = mockedP.computeAllMetrics > assert(r.jvmVmemTotal == 0) > assert(r.jvmRSSTotal == 0) > assert(r.pythonVmemTotal == 0) > assert(r.pythonRSSTotal == 0) > {code} > In the current implementation, computeAllMetrics will reset the allMetric
[jira] [Updated] (SPARK-34845) ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of child pids metrics are missing
[ https://issues.apache.org/jira/browse/SPARK-34845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34845: Description: When the procfs metrics of some child pids are unavailable, ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum of a subset of child pids), instead of an all 0 result. This can be misleading and is undesired per the current code comments in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214]. How to reproduce it? This unit test is kind of self-explanatory: {code:java} val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics")) val mockedP = spy(p) // proc file of pid 22764 doesn't exist, so partial metrics shouldn't be returned var ptree = Set(26109, 22764, 22763) when(mockedP.computeProcessTree).thenReturn(ptree) var r = mockedP.computeAllMetrics assert(r.jvmVmemTotal == 0) assert(r.jvmRSSTotal == 0) assert(r.pythonVmemTotal == 0) assert(r.pythonRSSTotal == 0) {code} In the current implementation, computeAllMetrics will reset the allMetrics to 0 when processing 22764 because 22764's proc file doesn't exist, but then it will continue processing pid 22763, and update allMetrics to procfs metrics of pid 22763. Also, a side effect of this bug is that it can lead to a verbose warning log if many pids' stat files are missing. An early terminating can make the warning logs more concise. How to solve it? The issue can be fixed by throwing IOException to computeAllMetrics(), in that case, computeAllMetrics can aware that at lease one child pid's procfs metrics is missing and then terminate the metrics reporting. was: When the procfs metrics of some child pids are unavailable, ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum of a subset of child pids), instead of an all 0 result. This can be misleading and is undesired per the current code comments in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214]. How to reproduce it? This unit test is kind of self-explanatory: {code:java} val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics")) val mockedP = spy(p) // proc file of pid 22764 doesn't exist, so partial metrics shouldn't be returned var ptree = Set(26109, 22764, 22763) when(mockedP.computeProcessTree).thenReturn(ptree) var r = mockedP.computeAllMetrics assert(r.jvmVmemTotal == 0) assert(r.jvmRSSTotal == 0) assert(r.pythonVmemTotal == 0) assert(r.pythonRSSTotal == 0) {code} In the current implementation, computeAllMetrics will reset the allMetrics to 0 when processing 22764 because 22764's proc file doesn't exist, but then it will continue processing pid 22763, and update allMetrics to procfs metrics of pid 22763. Also, a side effect of this bug is that it can lead to a verbose warning log if many pids' stat files are missing. An early terminating can make the warning logs more concise. How to solve it? The issue can be fixed by throwing IOException to computeAllMetrics(), in that case computeAllMetrics can aware that at lease one child pid's procfs metrics is missing and then terminate the metrics reporting. > ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of > child pids metrics are missing > > > Key: SPARK-34845 > URL: https://issues.apache.org/jira/browse/SPARK-34845 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: Baohe Zhang >Priority: Major > > When the procfs metrics of some child pids are unavailable, > ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum > of a subset of child pids), instead of an all 0 result. This can be > misleading and is undesired per the current code comments in > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214]. > How to reproduce it? > This unit test is kind of self-explanatory: > {code:java} > val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics")) > val mockedP = spy(p) > // proc file of pid 22764 doesn't exist, so partial metrics shouldn't be > returned > var ptree = Set(26109, 22764, 22763) > when(mockedP.computeProcessTree).thenReturn(ptree) > var r = mockedP.computeAllMetrics > assert(r.jvmVmemTotal == 0) > assert(r.jvmRSSTotal == 0) > assert(r.pythonVmemTotal == 0) > assert(r.pythonRSSTotal == 0) > {code} > In the current implementation, computeAllMetrics will reset the allMetrics
[jira] [Updated] (SPARK-34845) ProcfsMetricsGetter.computeAllMetrics shouldn't return partial metrics when some of child pids metrics are missing
[ https://issues.apache.org/jira/browse/SPARK-34845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34845: Description: When the procfs metrics of some child pids are unavailable, ProcfsMetricsGetter.computeAllMetrics() returns partial metrics (the sum of a subset of child pids), instead of an all 0 result. This can be misleading and is undesired per the current code comments in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214]. Also, a side effect of it is that it can lead to a verbose warning log if many pids' stat files are missing. Also, a side effect of it is that it can lead to verbose warning logs if many pids' stat files are missing. {code:java} e.g.2021-03-21 16:58:25,422 [pool-26-thread-8] WARN org.apache.spark.executor.ProcfsMetricsGetter - There was a problem with reading the stat file of the process. java.io.FileNotFoundException: /proc/742/stat (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at org.apache.spark.executor.ProcfsMetricsGetter.openReader$1(ProcfsMetricsGetter.scala:203) at org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$addProcfsMetricsFromOneProcess$1(ProcfsMetricsGetter.scala:205) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2540) at org.apache.spark.executor.ProcfsMetricsGetter.addProcfsMetricsFromOneProcess(ProcfsMetricsGetter.scala:205) at org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$computeAllMetrics$1(ProcfsMetricsGetter.scala:297) {code} The issue can be fixed by updating the flag isAvailable to false when one of the child pid's procfs metric is unavailable. Other methods computePid, computePageSize, and getChildPids already have this behavior. was: When the procfs metrics of some child pids are unavailable, ProcfsMetricsGetter.computeAllMetrics() returns partial metrics (the sum of a subset of child pids), instead of an all 0 result. This can be misleading and is undesired per the current code comments in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214]. Also, a side effect of it is that it can lead to a verbose warning log if many pids' stat files are missing. Also, a side effect of it is that it can lead to verbose warning logs if many pids' stat files are missing. {noformat} e.g.2021-03-21 16:58:25,422 [pool-26-thread-8] WARN org.apache.spark.executor.ProcfsMetricsGetter - There was a problem with reading the stat file of the process. java.io.FileNotFoundException: /proc/742/stat (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at org.apache.spark.executor.ProcfsMetricsGetter.openReader$1(ProcfsMetricsGetter.scala:203) at org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$addProcfsMetricsFromOneProcess$1(ProcfsMetricsGetter.scala:205) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2540) at org.apache.spark.executor.ProcfsMetricsGetter.addProcfsMetricsFromOneProcess(ProcfsMetricsGetter.scala:205) at org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$computeAllMetrics$1(ProcfsMetricsGetter.scala:297){noformat} The issue can be fixed by updating the flag isAvailable to false when one of the child pid's procfs metric is unavailable. Other methods computePid, computePageSize, and getChildPids already have this behavior. > ProcfsMetricsGetter.computeAllMetrics shouldn't return partial metrics when > some of child pids metrics are missing > -- > > Key: SPARK-34845 > URL: https://issues.apache.org/jira/browse/SPARK-34845 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: Baohe Zhang >Priority: Major > > When the procfs metrics of some child pids are unavailable, > ProcfsMetricsGetter.computeAllMetrics() returns partial metrics (the sum of a > subset of child pids), instead of an all 0 result. This can be misleading and > is undesired per the current code comments in > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214]. > > Also, a side effect of it is that it can lead to a verbose warning log if > many pids' stat files are missing. Also, a side effect of it is that it can > lead to verbose warning logs if many pids' stat files are missing. > {code:java} > e.g.2021-03-21 16:58:25,422 [pool-26-thread-8] WARN
[jira] [Created] (SPARK-34845) ProcfsMetricsGetter.computeAllMetrics shouldn't return partial metrics when some of child pids metrics are missing
Baohe Zhang created SPARK-34845: --- Summary: ProcfsMetricsGetter.computeAllMetrics shouldn't return partial metrics when some of child pids metrics are missing Key: SPARK-34845 URL: https://issues.apache.org/jira/browse/SPARK-34845 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.1, 3.1.0, 3.0.2, 3.0.1, 3.0.0 Reporter: Baohe Zhang When the procfs metrics of some child pids are unavailable, ProcfsMetricsGetter.computeAllMetrics() returns partial metrics (the sum of a subset of child pids), instead of an all 0 result. This can be misleading and is undesired per the current code comments in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214]. Also, a side effect of it is that it can lead to a verbose warning log if many pids' stat files are missing. Also, a side effect of it is that it can lead to verbose warning logs if many pids' stat files are missing. {noformat} e.g.2021-03-21 16:58:25,422 [pool-26-thread-8] WARN org.apache.spark.executor.ProcfsMetricsGetter - There was a problem with reading the stat file of the process. java.io.FileNotFoundException: /proc/742/stat (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at org.apache.spark.executor.ProcfsMetricsGetter.openReader$1(ProcfsMetricsGetter.scala:203) at org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$addProcfsMetricsFromOneProcess$1(ProcfsMetricsGetter.scala:205) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2540) at org.apache.spark.executor.ProcfsMetricsGetter.addProcfsMetricsFromOneProcess(ProcfsMetricsGetter.scala:205) at org.apache.spark.executor.ProcfsMetricsGetter.$anonfun$computeAllMetrics$1(ProcfsMetricsGetter.scala:297){noformat} The issue can be fixed by updating the flag isAvailable to false when one of the child pid's procfs metric is unavailable. Other methods computePid, computePageSize, and getChildPids already have this behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34779) ExecutoMetricsPoller should keep stage entry in stageTCMP until a heartbeat occurs
Baohe Zhang created SPARK-34779: --- Summary: ExecutoMetricsPoller should keep stage entry in stageTCMP until a heartbeat occurs Key: SPARK-34779 URL: https://issues.apache.org/jira/browse/SPARK-34779 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.1, 3.1.0, 3.0.2, 3.0.1, 3.0.0 Reporter: Baohe Zhang The current implementation of ExecutoMetricsPoller uses task count in each stage to decide whether to keep a stage entry or not. In the case of the executor only has 1 core, it may have these issues: # Peak metrics missing (due to stage entry being removed within a heartbeat interval) # Unnecessary and frequent hashmap entry removal and insertion. Assuming an executor with 1 core has 2 tasks (task1 and task2, both belong to stage (0,0)) to execute in a heartbeat interval, the workflow in current ExecutorMetricsPoller implementation would be: 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count increment to1 2. 1st poll() -> update peak metrics of stage (0, 0) 3. task1 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry removed, peak metrics lost. 4. task2 start -> stage (0, 0) entry created in stageTCMP, task count increment to1 5. 2nd poll() -> update peak metrics of stage (0, 0) 6. task2 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry removed, peak metrics lost 7. heartbeat() -> empty or inaccurate peak metrics for stage(0,0) reported. We can fix the issue by keeping entries with task count = 0 in stageTCMP map until a heartbeat occurs. At the heartbeat, after reporting the peak metrics for each stage, we scan each stage in stageTCMP and remove entries with task count = 0. After the fix, the workflow would be: 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count increment to1 2. 1st poll() -> update peak metrics of stage (0, 0) 3. task1 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) still remain. 4. task2 start -> task count of stage (0,0) increment to1 5. 2nd poll() -> update peak metrics of stage (0, 0) 6. task2 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) still remain. 7. heartbeat() -> accurate peak metrics for stage (0, 0) reported. Remove entry for stage (0,0) in stageTCMP because its task count is 0. How to verify the behavior? Submit a job with a custom polling interval (e.g., 2s) and spark.executor.cores=1 and check the debug logs of ExecutoMetricsPoller. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32924) Web UI sort on duration is wrong
[ https://issues.apache.org/jira/browse/SPARK-32924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17295639#comment-17295639 ] Baohe Zhang commented on SPARK-32924: - [~dongjoon] This is my Jira id. > Web UI sort on duration is wrong > > > Key: SPARK-32924 > URL: https://issues.apache.org/jira/browse/SPARK-32924 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.6, 2.4.7, 3.0.2, 3.2.0, 3.1.1 >Reporter: t oo >Priority: Major > Fix For: 2.4.8, 3.2.0, 3.1.2, 3.0.3 > > Attachments: ui_sort.png > > > See attachment, 9 s(econds) is showing as larger than 8.1min -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34545) PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together
[ https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291954#comment-17291954 ] Baohe Zhang commented on SPARK-34545: - A simpler code to reproduce the error: {code:python} >>> from pyspark.sql.functions import udf >>> from pyspark.sql.types import * >>> >>> def udf1(data_type): ... def u1(e): ... return e[0] ... return udf(u1, data_type) ... >>> df = spark.createDataFrame( ... [((1.0, 1.0), (1, 1))], ...['c1', 'c2']) >>> >>> >>> df = df.withColumn("c3", udf1(DoubleType())("c1")) >>> df = df.withColumn("c4", udf1(IntegerType())("c2")) >>> >>> # Show the results ... df.select("c3").show() +---+ | c3| +---+ |1.0| +---+ >>> df.select("c4").show() +---+ | c4| +---+ | 1| +---+ >>> df.select("c3", "c4").show() +---++ | c3| c4| +---++ |1.0|null| +---++ {code} > PySpark Python UDF return inconsistent results when applying 2 UDFs with > different return type to 2 columns together > > > Key: SPARK-34545 > URL: https://issues.apache.org/jira/browse/SPARK-34545 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Baohe Zhang >Priority: Blocker > > Python UDF returns inconsistent results between evaluating 2 columns together > and evaluating one by one. > The issue occurs after we upgrading to spark3, so seems it doesn't exist in > spark2. > How to reproduce it? > {code:python} > df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), > (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, > "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), > (3, "3")])], ['c1', 'c2']) > from pyspark.sql.functions import udf > from pyspark.sql.types import * > def getLastElementWithTimeMaster(data_type): > def getLastElementWithTime(list_elm): > # x should be a list of (val, time) > y = sorted(list_elm, key=lambda x: x[1]) # default is ascending > return y[-1][0] > return udf(getLastElementWithTime, data_type) > # Add 2 columns whcih apply Python UDF > df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) > df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) > # Show the results > df.select("c3").show() > df.select("c4").show() > df.select("c3", "c4").show() > {code} > Results: > {noformat} > >>> df.select("c3").show() > +---+ > > | c3| > +---+ > |1.0| > |2.0| > |3.1| > +---+ > >>> df.select("c4").show() > +---+ > | c4| > +---+ > | 1| > | 2| > | 3| > +---+ > >>> df.select("c3", "c4").show() > +---++ > | c3| c4| > +---++ > |1.0|null| > |2.0|null| > |3.1| 3| > +---++ > {noformat} > The test was done in branch-3.1 local mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34545) PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together
[ https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291896#comment-17291896 ] Baohe Zhang commented on SPARK-34545: - This is a correctness bug, so I would like to raise the priority to blocker and draw more attention from the community. > PySpark Python UDF return inconsistent results when applying 2 UDFs with > different return type to 2 columns together > > > Key: SPARK-34545 > URL: https://issues.apache.org/jira/browse/SPARK-34545 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Baohe Zhang >Priority: Blocker > > Python UDF returns inconsistent results between evaluating 2 columns together > and evaluating one by one. > The issue occurs after we upgrading to spark3, so seems it doesn't exist in > spark2. > How to reproduce it? > {code:python} > df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), > (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, > "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), > (3, "3")])], ['c1', 'c2']) > from pyspark.sql.functions import udf > from pyspark.sql.types import * > def getLastElementWithTimeMaster(data_type): > def getLastElementWithTime(list_elm): > # x should be a list of (val, time) > y = sorted(list_elm, key=lambda x: x[1]) # default is ascending > return y[-1][0] > return udf(getLastElementWithTime, data_type) > # Add 2 columns whcih apply Python UDF > df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) > df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) > # Show the results > df.select("c3").show() > df.select("c4").show() > df.select("c3", "c4").show() > {code} > Results: > {noformat} > >>> df.select("c3").show() > +---+ > > | c3| > +---+ > |1.0| > |2.0| > |3.1| > +---+ > >>> df.select("c4").show() > +---+ > | c4| > +---+ > | 1| > | 2| > | 3| > +---+ > >>> df.select("c3", "c4").show() > +---++ > | c3| c4| > +---++ > |1.0|null| > |2.0|null| > |3.1| 3| > +---++ > {noformat} > The test was done in branch-3.1 local mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together
[ https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34545: Priority: Blocker (was: Critical) > PySpark Python UDF return inconsistent results when applying 2 UDFs with > different return type to 2 columns together > > > Key: SPARK-34545 > URL: https://issues.apache.org/jira/browse/SPARK-34545 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Baohe Zhang >Priority: Blocker > > Python UDF returns inconsistent results between evaluating 2 columns together > and evaluating one by one. > The issue occurs after we upgrading to spark3, so seems it doesn't exist in > spark2. > How to reproduce it? > {code:python} > df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), > (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, > "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), > (3, "3")])], ['c1', 'c2']) > from pyspark.sql.functions import udf > from pyspark.sql.types import * > def getLastElementWithTimeMaster(data_type): > def getLastElementWithTime(list_elm): > # x should be a list of (val, time) > y = sorted(list_elm, key=lambda x: x[1]) # default is ascending > return y[-1][0] > return udf(getLastElementWithTime, data_type) > # Add 2 columns whcih apply Python UDF > df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) > df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) > # Show the results > df.select("c3").show() > df.select("c4").show() > df.select("c3", "c4").show() > {code} > Results: > {noformat} > >>> df.select("c3").show() > +---+ > > | c3| > +---+ > |1.0| > |2.0| > |3.1| > +---+ > >>> df.select("c4").show() > +---+ > | c4| > +---+ > | 1| > | 2| > | 3| > +---+ > >>> df.select("c3", "c4").show() > +---++ > | c3| c4| > +---++ > |1.0|null| > |2.0|null| > |3.1| 3| > +---++ > {noformat} > The test was done in branch-3.1 local mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together
[ https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34545: Summary: PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together (was: PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together) > PySpark Python UDF return inconsistent results when applying 2 UDFs with > different return type to 2 columns together > > > Key: SPARK-34545 > URL: https://issues.apache.org/jira/browse/SPARK-34545 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Baohe Zhang >Priority: Critical > > Python UDF returns inconsistent results between evaluating 2 columns together > and evaluating one by one. > The issue occurs after we upgrading to spark3, so seems it doesn't exist in > spark2. > How to reproduce it? > {code:python} > df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), > (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, > "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), > (3, "3")])], ['c1', 'c2']) > from pyspark.sql.functions import udf > from pyspark.sql.types import * > def getLastElementWithTimeMaster(data_type): > def getLastElementWithTime(list_elm): > # x should be a list of (val, time) > y = sorted(list_elm, key=lambda x: x[1]) # default is ascending > return y[-1][0] > return udf(getLastElementWithTime, data_type) > # Add 2 columns whcih apply Python UDF > df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) > df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) > # Show the results > df.select("c3").show() > df.select("c4").show() > df.select("c3", "c4").show() > {code} > Results: > {noformat} > >>> df.select("c3").show() > +---+ > > | c3| > +---+ > |1.0| > |2.0| > |3.1| > +---+ > >>> df.select("c4").show() > +---+ > | c4| > +---+ > | 1| > | 2| > | 3| > +---+ > >>> df.select("c3", "c4").show() > +---++ > | c3| c4| > +---++ > |1.0|null| > |2.0|null| > |3.1| 3| > +---++ > {noformat} > The test was done in branch-3.1 local mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together
[ https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34545: Description: Python UDF returns inconsistent results between evaluating 2 columns together and evaluating one by one. The issue occurs after we upgrading to spark3, so seems it doesn't exist in spark2. How to reproduce it? {code:python} df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), (3, "3")])], ['c1', 'c2']) from pyspark.sql.functions import udf from pyspark.sql.types import * def getLastElementWithTimeMaster(data_type): def getLastElementWithTime(list_elm): # x should be a list of (val, time) y = sorted(list_elm, key=lambda x: x[1]) # default is ascending return y[-1][0] return udf(getLastElementWithTime, data_type) # Add 2 columns whcih apply Python UDF df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) # Show the results df.select("c3").show() df.select("c4").show() df.select("c3", "c4").show() {code} Results: {noformat} >>> df.select("c3").show() +---+ | c3| +---+ |1.0| |2.0| |3.1| +---+ >>> df.select("c4").show() +---+ | c4| +---+ | 1| | 2| | 3| +---+ >>> df.select("c3", "c4").show() +---++ | c3| c4| +---++ |1.0|null| |2.0|null| |3.1| 3| +---++ {noformat} The test was done in branch-3.1 local mode. was: Python UDF returns inconsistent results between evaluating 2 columns together and evaluating one by one. The issue occurs after we upgrading to spark3, so seems it doesn't exist in spark2. How to reproduce it? {code:python} df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), (3, "3")])], ['c1', 'c2']) from pyspark.sql.functions import udf from pyspark.sql.types import * def getLastElementWithTimeMaster(data_type): def getLastElementWithTime(list_elm): """x should be a list of (val, time), val can be a single element or a list """ y = sorted(list_elm, key=lambda x: x[1]) # default is ascending return y[-1][0] return udf(getLastElementWithTime, data_type) # Add 2 columns whcih apply Python UDF df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) # Show the results df.select("c3").show() df.select("c4").show() df.select("c3", "c4").show() {code} Results: {noformat} >>> df.select("c3").show() +---+ | c3| +---+ |1.0| |2.0| |3.1| +---+ >>> df.select("c4").show() +---+ | c4| +---+ | 1| | 2| | 3| +---+ >>> df.select("c3", "c4").show() +---++ | c3| c4| +---++ |1.0|null| |2.0|null| |3.1| 3| +---++ {noformat} The test was done in branch-3.1 local mode. > PySpark Python UDF return inconsistent results when applying UDFs to 2 > columns together > --- > > Key: SPARK-34545 > URL: https://issues.apache.org/jira/browse/SPARK-34545 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Baohe Zhang >Priority: Critical > > Python UDF returns inconsistent results between evaluating 2 columns together > and evaluating one by one. > The issue occurs after we upgrading to spark3, so seems it doesn't exist in > spark2. > How to reproduce it? > {code:python} > df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), > (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, > "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), > (3, "3")])], ['c1', 'c2']) > from pyspark.sql.functions import udf > from pyspark.sql.types import * > def getLastElementWithTimeMaster(data_type): > def getLastElementWithTime(list_elm): > # x should be a list of (val, time) > y = sorted(list_elm, key=lambda x: x[1]) # default is ascending > return y[-1][0] > return udf(getLastElementWithTime, data_type) > # Add 2 columns whcih apply Python UDF > df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) > df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) > # Show the results > df.select("c3").show() > df.select("c4").show() > df.select("c3", "c4").show() >
[jira] [Created] (SPARK-34545) PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together
Baohe Zhang created SPARK-34545: --- Summary: PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together Key: SPARK-34545 URL: https://issues.apache.org/jira/browse/SPARK-34545 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.0 Reporter: Baohe Zhang Python UDF returns inconsistent results between evaluating 2 columns together and evaluating one by one. The issue occurs after we upgrading to spark3, so seems it doesn't exist in spark2. How to reproduce it? {code:python} df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), (3, "3")])], ['c1', 'c2']) from pyspark.sql.functions import udf from pyspark.sql.types import * def getLastElementWithTimeMaster(data_type): def getLastElementWithTime(list_elm): """x should be a list of (val, time), val can be a single element or a list """ y = sorted(list_elm, key=lambda x: x[1]) # default is ascending return y[-1][0] return udf(getLastElementWithTime, data_type) # Add 2 columns whcih apply Python UDF df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) # Show the results df.select("c3").show() df.select("c4").show() df.select("c3", "c4").show() {code} Results: {noformat} >>> df.select("c3").show() +---+ | c3| +---+ |1.0| |2.0| |3.1| +---+ >>> df.select("c4").show() +---+ | c4| +---+ | 1| | 2| | 3| +---+ >>> df.select("c3", "c4").show() +---++ | c3| c4| +---++ |1.0|{color:red}null{color}| |2.0|{color:red}null{color}| |3.1| 3| +---++ {noformat} The test was done in branch-3.1 local mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together
[ https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34545: Priority: Critical (was: Major) > PySpark Python UDF return inconsistent results when applying UDFs to 2 > columns together > --- > > Key: SPARK-34545 > URL: https://issues.apache.org/jira/browse/SPARK-34545 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Baohe Zhang >Priority: Critical > > Python UDF returns inconsistent results between evaluating 2 columns together > and evaluating one by one. > The issue occurs after we upgrading to spark3, so seems it doesn't exist in > spark2. > How to reproduce it? > {code:python} > df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), > (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, > "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), > (3, "3")])], ['c1', 'c2']) > from pyspark.sql.functions import udf > from pyspark.sql.types import * > def getLastElementWithTimeMaster(data_type): > def getLastElementWithTime(list_elm): > """x should be a list of (val, time), val can be a single element or > a list > """ > y = sorted(list_elm, key=lambda x: x[1]) # default is ascending > return y[-1][0] > return udf(getLastElementWithTime, data_type) > # Add 2 columns whcih apply Python UDF > df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) > df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) > # Show the results > df.select("c3").show() > df.select("c4").show() > df.select("c3", "c4").show() > {code} > Results: > {noformat} > >>> df.select("c3").show() > +---+ > > | c3| > +---+ > |1.0| > |2.0| > |3.1| > +---+ > >>> df.select("c4").show() > +---+ > | c4| > +---+ > | 1| > | 2| > | 3| > +---+ > >>> df.select("c3", "c4").show() > +---++ > | c3| c4| > +---++ > |1.0|null| > |2.0|null| > |3.1| 3| > +---++ > {noformat} > The test was done in branch-3.1 local mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together
[ https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34545: Component/s: SQL > PySpark Python UDF return inconsistent results when applying UDFs to 2 > columns together > --- > > Key: SPARK-34545 > URL: https://issues.apache.org/jira/browse/SPARK-34545 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Baohe Zhang >Priority: Major > > Python UDF returns inconsistent results between evaluating 2 columns together > and evaluating one by one. > The issue occurs after we upgrading to spark3, so seems it doesn't exist in > spark2. > How to reproduce it? > {code:python} > df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), > (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, > "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), > (3, "3")])], ['c1', 'c2']) > from pyspark.sql.functions import udf > from pyspark.sql.types import * > def getLastElementWithTimeMaster(data_type): > def getLastElementWithTime(list_elm): > """x should be a list of (val, time), val can be a single element or > a list > """ > y = sorted(list_elm, key=lambda x: x[1]) # default is ascending > return y[-1][0] > return udf(getLastElementWithTime, data_type) > # Add 2 columns whcih apply Python UDF > df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) > df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) > # Show the results > df.select("c3").show() > df.select("c4").show() > df.select("c3", "c4").show() > {code} > Results: > {noformat} > >>> df.select("c3").show() > +---+ > > | c3| > +---+ > |1.0| > |2.0| > |3.1| > +---+ > >>> df.select("c4").show() > +---+ > | c4| > +---+ > | 1| > | 2| > | 3| > +---+ > >>> df.select("c3", "c4").show() > +---++ > | c3| c4| > +---++ > |1.0|null| > |2.0|null| > |3.1| 3| > +---++ > {noformat} > The test was done in branch-3.1 local mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34545) PySpark Python UDF return inconsistent results when applying UDFs to 2 columns together
[ https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34545: Description: Python UDF returns inconsistent results between evaluating 2 columns together and evaluating one by one. The issue occurs after we upgrading to spark3, so seems it doesn't exist in spark2. How to reproduce it? {code:python} df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), (3, "3")])], ['c1', 'c2']) from pyspark.sql.functions import udf from pyspark.sql.types import * def getLastElementWithTimeMaster(data_type): def getLastElementWithTime(list_elm): """x should be a list of (val, time), val can be a single element or a list """ y = sorted(list_elm, key=lambda x: x[1]) # default is ascending return y[-1][0] return udf(getLastElementWithTime, data_type) # Add 2 columns whcih apply Python UDF df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) # Show the results df.select("c3").show() df.select("c4").show() df.select("c3", "c4").show() {code} Results: {noformat} >>> df.select("c3").show() +---+ | c3| +---+ |1.0| |2.0| |3.1| +---+ >>> df.select("c4").show() +---+ | c4| +---+ | 1| | 2| | 3| +---+ >>> df.select("c3", "c4").show() +---++ | c3| c4| +---++ |1.0|null| |2.0|null| |3.1| 3| +---++ {noformat} The test was done in branch-3.1 local mode. was: Python UDF returns inconsistent results between evaluating 2 columns together and evaluating one by one. The issue occurs after we upgrading to spark3, so seems it doesn't exist in spark2. How to reproduce it? {code:python} df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), (3, "3")])], ['c1', 'c2']) from pyspark.sql.functions import udf from pyspark.sql.types import * def getLastElementWithTimeMaster(data_type): def getLastElementWithTime(list_elm): """x should be a list of (val, time), val can be a single element or a list """ y = sorted(list_elm, key=lambda x: x[1]) # default is ascending return y[-1][0] return udf(getLastElementWithTime, data_type) # Add 2 columns whcih apply Python UDF df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) # Show the results df.select("c3").show() df.select("c4").show() df.select("c3", "c4").show() {code} Results: {noformat} >>> df.select("c3").show() +---+ | c3| +---+ |1.0| |2.0| |3.1| +---+ >>> df.select("c4").show() +---+ | c4| +---+ | 1| | 2| | 3| +---+ >>> df.select("c3", "c4").show() +---++ | c3| c4| +---++ |1.0|{color:red}null{color}| |2.0|{color:red}null{color}| |3.1| 3| +---++ {noformat} The test was done in branch-3.1 local mode. > PySpark Python UDF return inconsistent results when applying UDFs to 2 > columns together > --- > > Key: SPARK-34545 > URL: https://issues.apache.org/jira/browse/SPARK-34545 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Baohe Zhang >Priority: Major > > Python UDF returns inconsistent results between evaluating 2 columns together > and evaluating one by one. > The issue occurs after we upgrading to spark3, so seems it doesn't exist in > spark2. > How to reproduce it? > {code:python} > df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), > (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, > "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), > (3, "3")])], ['c1', 'c2']) > from pyspark.sql.functions import udf > from pyspark.sql.types import * > def getLastElementWithTimeMaster(data_type): > def getLastElementWithTime(list_elm): > """x should be a list of (val, time), val can be a single element or > a list > """ > y = sorted(list_elm, key=lambda x: x[1]) # default is ascending > return y[-1][0] > return udf(getLastElementWithTime, data_type) > # Add 2 columns whcih apply Python UDF > df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) > df = df.withColumn("c4", getLastElementWit
[jira] [Commented] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance
[ https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277485#comment-17277485 ] Baohe Zhang commented on SPARK-34336: - Full benchmark results are added as txt attachments. > Use GenericData as Avro serialization data model can improve Avro write/read > performance > > > Key: SPARK-34336 > URL: https://issues.apache.org/jira/browse/SPARK-34336 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: base_read.txt, base_write.txt, generic_data_read.txt, > generic_data_write.txt, read_comparison.png, write_comparison.png > > > We found that using "org.apache.avro.generic.GenericData" as Avro > serialization data model in Avro writer can significantly improve Avro write > performance and slightly improve Avro read performance. > This optimization was originally put up by [~samkhan] in this PR > https://github.com/apache/spark/pull/29354. > We re-evaluated the change "Use GenericData instead of ReflectData when > writing Avro data" in that PR and verified it can provide performance > improvement in Avro write/read benchmarks. > The base branch is today(2/2/21)'s branch-3.1. > Besides current Avro read/write benchmarks, I also ran some extra benchmarks > for nested structs and arrays read/write, these benchmarks were put up in > this PR https://github.com/apache/spark/pull/29352 but haven't been merged. > Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance
[ https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34336: Attachment: generic_data_read.txt > Use GenericData as Avro serialization data model can improve Avro write/read > performance > > > Key: SPARK-34336 > URL: https://issues.apache.org/jira/browse/SPARK-34336 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: base_read.txt, base_write.txt, generic_data_read.txt, > generic_data_write.txt, read_comparison.png, write_comparison.png > > > We found that using "org.apache.avro.generic.GenericData" as Avro > serialization data model in Avro writer can significantly improve Avro write > performance and slightly improve Avro read performance. > This optimization was originally put up by [~samkhan] in this PR > https://github.com/apache/spark/pull/29354. > We re-evaluated the change "Use GenericData instead of ReflectData when > writing Avro data" in that PR and verified it can provide performance > improvement in Avro write/read benchmarks. > The base branch is today(2/2/21)'s branch-3.1. > Besides current Avro read/write benchmarks, I also ran some extra benchmarks > for nested structs and arrays read/write, these benchmarks were put up in > this PR https://github.com/apache/spark/pull/29352 but haven't been merged. > Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance
[ https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34336: Attachment: base_read.txt > Use GenericData as Avro serialization data model can improve Avro write/read > performance > > > Key: SPARK-34336 > URL: https://issues.apache.org/jira/browse/SPARK-34336 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: base_read.txt, base_write.txt, generic_data_read.txt, > generic_data_write.txt, read_comparison.png, write_comparison.png > > > We found that using "org.apache.avro.generic.GenericData" as Avro > serialization data model in Avro writer can significantly improve Avro write > performance and slightly improve Avro read performance. > This optimization was originally put up by [~samkhan] in this PR > https://github.com/apache/spark/pull/29354. > We re-evaluated the change "Use GenericData instead of ReflectData when > writing Avro data" in that PR and verified it can provide performance > improvement in Avro write/read benchmarks. > The base branch is today(2/2/21)'s branch-3.1. > Besides current Avro read/write benchmarks, I also ran some extra benchmarks > for nested structs and arrays read/write, these benchmarks were put up in > this PR https://github.com/apache/spark/pull/29352 but haven't been merged. > Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance
[ https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34336: Attachment: generic_data_write.txt > Use GenericData as Avro serialization data model can improve Avro write/read > performance > > > Key: SPARK-34336 > URL: https://issues.apache.org/jira/browse/SPARK-34336 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: base_read.txt, base_write.txt, generic_data_read.txt, > generic_data_write.txt, read_comparison.png, write_comparison.png > > > We found that using "org.apache.avro.generic.GenericData" as Avro > serialization data model in Avro writer can significantly improve Avro write > performance and slightly improve Avro read performance. > This optimization was originally put up by [~samkhan] in this PR > https://github.com/apache/spark/pull/29354. > We re-evaluated the change "Use GenericData instead of ReflectData when > writing Avro data" in that PR and verified it can provide performance > improvement in Avro write/read benchmarks. > The base branch is today(2/2/21)'s branch-3.1. > Besides current Avro read/write benchmarks, I also ran some extra benchmarks > for nested structs and arrays read/write, these benchmarks were put up in > this PR https://github.com/apache/spark/pull/29352 but haven't been merged. > Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance
[ https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34336: Attachment: read_comparison.png > Use GenericData as Avro serialization data model can improve Avro write/read > performance > > > Key: SPARK-34336 > URL: https://issues.apache.org/jira/browse/SPARK-34336 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: base_read.txt, base_write.txt, generic_data_read.txt, > generic_data_write.txt, read_comparison.png, write_comparison.png > > > We found that using "org.apache.avro.generic.GenericData" as Avro > serialization data model in Avro writer can significantly improve Avro write > performance and slightly improve Avro read performance. > This optimization was originally put up by [~samkhan] in this PR > https://github.com/apache/spark/pull/29354. > We re-evaluated the change "Use GenericData instead of ReflectData when > writing Avro data" in that PR and verified it can provide performance > improvement in Avro write/read benchmarks. > The base branch is today(2/2/21)'s branch-3.1. > Besides current Avro read/write benchmarks, I also ran some extra benchmarks > for nested structs and arrays read/write, these benchmarks were put up in > this PR https://github.com/apache/spark/pull/29352 but haven't been merged. > Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance
[ https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277483#comment-17277483 ] Baohe Zhang commented on SPARK-34336: - Column chart comparison on avg time: Avro write: !write_comparison.png! Avro read: !read_comparison.png! > Use GenericData as Avro serialization data model can improve Avro write/read > performance > > > Key: SPARK-34336 > URL: https://issues.apache.org/jira/browse/SPARK-34336 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: read_comparison.png, write_comparison.png > > > We found that using "org.apache.avro.generic.GenericData" as Avro > serialization data model in Avro writer can significantly improve Avro write > performance and slightly improve Avro read performance. > This optimization was originally put up by [~samkhan] in this PR > https://github.com/apache/spark/pull/29354. > We re-evaluated the change "Use GenericData instead of ReflectData when > writing Avro data" in that PR and verified it can provide performance > improvement in Avro write/read benchmarks. > The base branch is today(2/2/21)'s branch-3.1. > Besides current Avro read/write benchmarks, I also ran some extra benchmarks > for nested structs and arrays read/write, these benchmarks were put up in > this PR https://github.com/apache/spark/pull/29352 but haven't been merged. > Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance
[ https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34336: Attachment: base_write.txt > Use GenericData as Avro serialization data model can improve Avro write/read > performance > > > Key: SPARK-34336 > URL: https://issues.apache.org/jira/browse/SPARK-34336 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: base_read.txt, base_write.txt, generic_data_read.txt, > generic_data_write.txt, read_comparison.png, write_comparison.png > > > We found that using "org.apache.avro.generic.GenericData" as Avro > serialization data model in Avro writer can significantly improve Avro write > performance and slightly improve Avro read performance. > This optimization was originally put up by [~samkhan] in this PR > https://github.com/apache/spark/pull/29354. > We re-evaluated the change "Use GenericData instead of ReflectData when > writing Avro data" in that PR and verified it can provide performance > improvement in Avro write/read benchmarks. > The base branch is today(2/2/21)'s branch-3.1. > Besides current Avro read/write benchmarks, I also ran some extra benchmarks > for nested structs and arrays read/write, these benchmarks were put up in > this PR https://github.com/apache/spark/pull/29352 but haven't been merged. > Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance
[ https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-34336: Attachment: write_comparison.png > Use GenericData as Avro serialization data model can improve Avro write/read > performance > > > Key: SPARK-34336 > URL: https://issues.apache.org/jira/browse/SPARK-34336 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: write_comparison.png > > > We found that using "org.apache.avro.generic.GenericData" as Avro > serialization data model in Avro writer can significantly improve Avro write > performance and slightly improve Avro read performance. > This optimization was originally put up by [~samkhan] in this PR > https://github.com/apache/spark/pull/29354. > We re-evaluated the change "Use GenericData instead of ReflectData when > writing Avro data" in that PR and verified it can provide performance > improvement in Avro write/read benchmarks. > The base branch is today(2/2/21)'s branch-3.1. > Besides current Avro read/write benchmarks, I also ran some extra benchmarks > for nested structs and arrays read/write, these benchmarks were put up in > this PR https://github.com/apache/spark/pull/29352 but haven't been merged. > Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance
Baohe Zhang created SPARK-34336: --- Summary: Use GenericData as Avro serialization data model can improve Avro write/read performance Key: SPARK-34336 URL: https://issues.apache.org/jira/browse/SPARK-34336 Project: Spark Issue Type: Improvement Components: Input/Output, SQL Affects Versions: 3.1.2 Reporter: Baohe Zhang We found that using "org.apache.avro.generic.GenericData" as Avro serialization data model in Avro writer can significantly improve Avro write performance and slightly improve Avro read performance. This optimization was originally put up by [~samkhan] in this PR https://github.com/apache/spark/pull/29354. We re-evaluated the change "Use GenericData instead of ReflectData when writing Avro data" in that PR and verified it can provide performance improvement in Avro write/read benchmarks. The base branch is today(2/2/21)'s branch-3.1. Besides current Avro read/write benchmarks, I also ran some extra benchmarks for nested structs and arrays read/write, these benchmarks were put up in this PR https://github.com/apache/spark/pull/29352 but haven't been merged. Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33031) scheduler with blacklisting doesn't appear to pick up new executor added
[ https://issues.apache.org/jira/browse/SPARK-33031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255685#comment-17255685 ] Baohe Zhang commented on SPARK-33031: - New tasks won't be scheduled because the node is marked as blacklisted after 2 executors on that node are blacklisted. The behavior seems correct if the experiment is done on a single node. > scheduler with blacklisting doesn't appear to pick up new executor added > > > Key: SPARK-33031 > URL: https://issues.apache.org/jira/browse/SPARK-33031 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.0.0, 3.1.0 >Reporter: Thomas Graves >Priority: Critical > > I was running a test with blacklisting standalone mode and all the executors > were initially blacklisted. Then one of the executors died and we got > allocated another one. The scheduler did not appear to pick up the new one > and try to schedule on it though. > You can reproduce this by starting a master and slave on a single node, then > launch a shell like where you will get multiple executors (in this case I got > 3) > $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 > --conf spark.blacklist.enabled=true > From shell run: > {code:java} > import org.apache.spark.TaskContext > val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it => > val context = TaskContext.get() > if (context.attemptNumber() < 2) { > throw new Exception("test attempt num") > } > it > } > rdd.collect(){code} > > Note that I tried both with and without dynamic allocation enabled. > > You can see screen shot related on > https://issues.apache.org/jira/browse/SPARK-33029 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33029) Standalone mode blacklist executors page UI marks driver as blacklisted
[ https://issues.apache.org/jira/browse/SPARK-33029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255682#comment-17255682 ] Baohe Zhang commented on SPARK-33029: - With the blacklist feature enabled, by default, a node will be excluded when 2 executors on this node have been excluded. In this case, the node is excluded and we will mark all executors in that node as excluded. Since we are running standalone mode in a single node, the driver and all executors share the same hostname. the driver will be marked as excluded on AppStatusListener when handling "SparkListenerNodeExcludedForStage" event. We can fix it by filter out the driver entity when handling this event, hence the UI won't show the driver is excluded. > Standalone mode blacklist executors page UI marks driver as blacklisted > --- > > Key: SPARK-33029 > URL: https://issues.apache.org/jira/browse/SPARK-33029 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > Attachments: Screen Shot 2020-09-29 at 1.52.09 PM.png, Screen Shot > 2020-09-29 at 1.53.37 PM.png > > > I am running a spark shell on a 1 node standalone cluster. I noticed that > the executors page ui was marking the driver as blacklisted for the stage > that is running. Attached a screen shot. > Also, in my case one of the executors died and it doesn't seem like the > schedule rpicked up the new one. It doesn't show up on the stages page and > just shows it as active but none of the tasks ran there. > > You can reproduce this by starting a master and slave on a single node, then > launch a shell like where you will get multiple executors (in this case I got > 3) > $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 > --conf spark.blacklist.enabled=true > > From shell run: > {code:java} > import org.apache.spark.TaskContext > val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it => > val context = TaskContext.get() > if (context.attemptNumber() < 2) { > throw new Exception("test attempt num") > } > it > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33906) SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset
[ https://issues.apache.org/jira/browse/SPARK-33906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17254669#comment-17254669 ] Baohe Zhang commented on SPARK-33906: - The more underlay reason seems to be that the stage complete within a heartbeat period, so the heartbeat doesn't piggyback executor peak memory metrics. > SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset > - > > Key: SPARK-33906 > URL: https://issues.apache.org/jira/browse/SPARK-33906 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Baohe Zhang >Priority: Blocker > Attachments: executor-page.png > > > How to reproduce it? > In mac OS standalone mode, open a spark-shell and run > $SPARK_HOME/bin/spark-shell --master spark://localhost:7077 > {code:scala} > val x = sc.makeRDD(1 to 10, 5) > x.count() > {code} > Then open the app UI in the browser, and click the Executors page, will get > stuck at this page: > !executor-page.png! > Also the return JSON of REST API endpoint > http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors > miss "peakMemoryMetrics" for executors. > {noformat} > [ { > "id" : "driver", > "hostPort" : "192.168.1.241:50042", > "isActive" : true, > "rddBlocks" : 0, > "memoryUsed" : 0, > "diskUsed" : 0, > "totalCores" : 0, > "maxTasks" : 0, > "activeTasks" : 0, > "failedTasks" : 0, > "completedTasks" : 0, > "totalTasks" : 0, > "totalDuration" : 0, > "totalGCTime" : 0, > "totalInputBytes" : 0, > "totalShuffleRead" : 0, > "totalShuffleWrite" : 0, > "isBlacklisted" : false, > "maxMemory" : 455501414, > "addTime" : "2020-12-24T19:44:18.033GMT", > "executorLogs" : { }, > "memoryMetrics" : { > "usedOnHeapStorageMemory" : 0, > "usedOffHeapStorageMemory" : 0, > "totalOnHeapStorageMemory" : 455501414, > "totalOffHeapStorageMemory" : 0 > }, > "blacklistedInStages" : [ ], > "peakMemoryMetrics" : { > "JVMHeapMemory" : 135021152, > "JVMOffHeapMemory" : 149558576, > "OnHeapExecutionMemory" : 0, > "OffHeapExecutionMemory" : 0, > "OnHeapStorageMemory" : 3301, > "OffHeapStorageMemory" : 0, > "OnHeapUnifiedMemory" : 3301, > "OffHeapUnifiedMemory" : 0, > "DirectPoolMemory" : 67963178, > "MappedPoolMemory" : 0, > "ProcessTreeJVMVMemory" : 0, > "ProcessTreeJVMRSSMemory" : 0, > "ProcessTreePythonVMemory" : 0, > "ProcessTreePythonRSSMemory" : 0, > "ProcessTreeOtherVMemory" : 0, > "ProcessTreeOtherRSSMemory" : 0, > "MinorGCCount" : 15, > "MinorGCTime" : 101, > "MajorGCCount" : 0, > "MajorGCTime" : 0 > }, > "attributes" : { }, > "resources" : { }, > "resourceProfileId" : 0, > "isExcluded" : false, > "excludedInStages" : [ ] > }, { > "id" : "0", > "hostPort" : "192.168.1.241:50054", > "isActive" : true, > "rddBlocks" : 0, > "memoryUsed" : 0, > "diskUsed" : 0, > "totalCores" : 12, > "maxTasks" : 12, > "activeTasks" : 0, > "failedTasks" : 0, > "completedTasks" : 5, > "totalTasks" : 5, > "totalDuration" : 2107, > "totalGCTime" : 25, > "totalInputBytes" : 0, > "totalShuffleRead" : 0, > "totalShuffleWrite" : 0, > "isBlacklisted" : false, > "maxMemory" : 455501414, > "addTime" : "2020-12-24T19:44:20.335GMT", > "executorLogs" : { > "stdout" : > "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stdout";, > "stderr" : > "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stderr"; > }, > "memoryMetrics" : { > "usedOnHeapStorageMemory" : 0, > "usedOffHeapStorageMemory" : 0, > "totalOnHeapStorageMemory" : 455501414, > "totalOffHeapStorageMemory" : 0 > }, > "blacklistedInStages" : [ ], > "attributes" : { }, > "resources" : { }, > "resourceProfileId" : 0, > "isExcluded" : false, > "excludedInStages" : [ ] > } ] > {noformat} > I debugged it and observed that ExecutorMetricsPoller > .getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to > None in > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. > The possible reason for returning the empty map is that the stage completion > time is shorter than the heartbeat interval, so the stage entry in stageTCMP > has already been removed before the reportHeartbeat is called. > How to fix it? > Check if the peakMemoryMetrics is undefined in executorspage.js. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-
[jira] [Commented] (SPARK-33906) SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset
[ https://issues.apache.org/jira/browse/SPARK-33906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17254668#comment-17254668 ] Baohe Zhang commented on SPARK-33906: - [~dongjoon] Yes. > SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset > - > > Key: SPARK-33906 > URL: https://issues.apache.org/jira/browse/SPARK-33906 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Baohe Zhang >Priority: Blocker > Attachments: executor-page.png > > > How to reproduce it? > In mac OS standalone mode, open a spark-shell and run > $SPARK_HOME/bin/spark-shell --master spark://localhost:7077 > {code:scala} > val x = sc.makeRDD(1 to 10, 5) > x.count() > {code} > Then open the app UI in the browser, and click the Executors page, will get > stuck at this page: > !executor-page.png! > Also the return JSON of REST API endpoint > http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors > miss "peakMemoryMetrics" for executors. > {noformat} > [ { > "id" : "driver", > "hostPort" : "192.168.1.241:50042", > "isActive" : true, > "rddBlocks" : 0, > "memoryUsed" : 0, > "diskUsed" : 0, > "totalCores" : 0, > "maxTasks" : 0, > "activeTasks" : 0, > "failedTasks" : 0, > "completedTasks" : 0, > "totalTasks" : 0, > "totalDuration" : 0, > "totalGCTime" : 0, > "totalInputBytes" : 0, > "totalShuffleRead" : 0, > "totalShuffleWrite" : 0, > "isBlacklisted" : false, > "maxMemory" : 455501414, > "addTime" : "2020-12-24T19:44:18.033GMT", > "executorLogs" : { }, > "memoryMetrics" : { > "usedOnHeapStorageMemory" : 0, > "usedOffHeapStorageMemory" : 0, > "totalOnHeapStorageMemory" : 455501414, > "totalOffHeapStorageMemory" : 0 > }, > "blacklistedInStages" : [ ], > "peakMemoryMetrics" : { > "JVMHeapMemory" : 135021152, > "JVMOffHeapMemory" : 149558576, > "OnHeapExecutionMemory" : 0, > "OffHeapExecutionMemory" : 0, > "OnHeapStorageMemory" : 3301, > "OffHeapStorageMemory" : 0, > "OnHeapUnifiedMemory" : 3301, > "OffHeapUnifiedMemory" : 0, > "DirectPoolMemory" : 67963178, > "MappedPoolMemory" : 0, > "ProcessTreeJVMVMemory" : 0, > "ProcessTreeJVMRSSMemory" : 0, > "ProcessTreePythonVMemory" : 0, > "ProcessTreePythonRSSMemory" : 0, > "ProcessTreeOtherVMemory" : 0, > "ProcessTreeOtherRSSMemory" : 0, > "MinorGCCount" : 15, > "MinorGCTime" : 101, > "MajorGCCount" : 0, > "MajorGCTime" : 0 > }, > "attributes" : { }, > "resources" : { }, > "resourceProfileId" : 0, > "isExcluded" : false, > "excludedInStages" : [ ] > }, { > "id" : "0", > "hostPort" : "192.168.1.241:50054", > "isActive" : true, > "rddBlocks" : 0, > "memoryUsed" : 0, > "diskUsed" : 0, > "totalCores" : 12, > "maxTasks" : 12, > "activeTasks" : 0, > "failedTasks" : 0, > "completedTasks" : 5, > "totalTasks" : 5, > "totalDuration" : 2107, > "totalGCTime" : 25, > "totalInputBytes" : 0, > "totalShuffleRead" : 0, > "totalShuffleWrite" : 0, > "isBlacklisted" : false, > "maxMemory" : 455501414, > "addTime" : "2020-12-24T19:44:20.335GMT", > "executorLogs" : { > "stdout" : > "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stdout";, > "stderr" : > "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stderr"; > }, > "memoryMetrics" : { > "usedOnHeapStorageMemory" : 0, > "usedOffHeapStorageMemory" : 0, > "totalOnHeapStorageMemory" : 455501414, > "totalOffHeapStorageMemory" : 0 > }, > "blacklistedInStages" : [ ], > "attributes" : { }, > "resources" : { }, > "resourceProfileId" : 0, > "isExcluded" : false, > "excludedInStages" : [ ] > } ] > {noformat} > I debugged it and observed that ExecutorMetricsPoller > .getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to > None in > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. > The possible reason for returning the empty map is that the stage completion > time is shorter than the heartbeat interval, so the stage entry in stageTCMP > has already been removed before the reportHeartbeat is called. > How to fix it? > Check if the peakMemoryMetrics is undefined in executorspage.js. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33906) SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset
[ https://issues.apache.org/jira/browse/SPARK-33906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-33906: Description: How to reproduce it? In mac OS standalone mode, open a spark-shell and run $SPARK_HOME/bin/spark-shell --master spark://localhost:7077 {code:scala} val x = sc.makeRDD(1 to 10, 5) x.count() {code} Then open the app UI in the browser, and click the Executors page, will get stuck at this page: !executor-page.png! Also the return JSON of REST API endpoint http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors miss "peakMemoryMetrics" for executors. {noformat} [ { "id" : "driver", "hostPort" : "192.168.1.241:50042", "isActive" : true, "rddBlocks" : 0, "memoryUsed" : 0, "diskUsed" : 0, "totalCores" : 0, "maxTasks" : 0, "activeTasks" : 0, "failedTasks" : 0, "completedTasks" : 0, "totalTasks" : 0, "totalDuration" : 0, "totalGCTime" : 0, "totalInputBytes" : 0, "totalShuffleRead" : 0, "totalShuffleWrite" : 0, "isBlacklisted" : false, "maxMemory" : 455501414, "addTime" : "2020-12-24T19:44:18.033GMT", "executorLogs" : { }, "memoryMetrics" : { "usedOnHeapStorageMemory" : 0, "usedOffHeapStorageMemory" : 0, "totalOnHeapStorageMemory" : 455501414, "totalOffHeapStorageMemory" : 0 }, "blacklistedInStages" : [ ], "peakMemoryMetrics" : { "JVMHeapMemory" : 135021152, "JVMOffHeapMemory" : 149558576, "OnHeapExecutionMemory" : 0, "OffHeapExecutionMemory" : 0, "OnHeapStorageMemory" : 3301, "OffHeapStorageMemory" : 0, "OnHeapUnifiedMemory" : 3301, "OffHeapUnifiedMemory" : 0, "DirectPoolMemory" : 67963178, "MappedPoolMemory" : 0, "ProcessTreeJVMVMemory" : 0, "ProcessTreeJVMRSSMemory" : 0, "ProcessTreePythonVMemory" : 0, "ProcessTreePythonRSSMemory" : 0, "ProcessTreeOtherVMemory" : 0, "ProcessTreeOtherRSSMemory" : 0, "MinorGCCount" : 15, "MinorGCTime" : 101, "MajorGCCount" : 0, "MajorGCTime" : 0 }, "attributes" : { }, "resources" : { }, "resourceProfileId" : 0, "isExcluded" : false, "excludedInStages" : [ ] }, { "id" : "0", "hostPort" : "192.168.1.241:50054", "isActive" : true, "rddBlocks" : 0, "memoryUsed" : 0, "diskUsed" : 0, "totalCores" : 12, "maxTasks" : 12, "activeTasks" : 0, "failedTasks" : 0, "completedTasks" : 5, "totalTasks" : 5, "totalDuration" : 2107, "totalGCTime" : 25, "totalInputBytes" : 0, "totalShuffleRead" : 0, "totalShuffleWrite" : 0, "isBlacklisted" : false, "maxMemory" : 455501414, "addTime" : "2020-12-24T19:44:20.335GMT", "executorLogs" : { "stdout" : "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stdout";, "stderr" : "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stderr"; }, "memoryMetrics" : { "usedOnHeapStorageMemory" : 0, "usedOffHeapStorageMemory" : 0, "totalOnHeapStorageMemory" : 455501414, "totalOffHeapStorageMemory" : 0 }, "blacklistedInStages" : [ ], "attributes" : { }, "resources" : { }, "resourceProfileId" : 0, "isExcluded" : false, "excludedInStages" : [ ] } ] {noformat} I debugged it and observed that ExecutorMetricsPoller .getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to None in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. The possible reason for returning the empty map is that the stage completion time is shorter than the heartbeat interval, so the stage entry in stageTCMP has already been removed before the reportHeartbeat is called. How to fix it? Check if the peakMemoryMetrics is undefined in executorspage.js. was: How to reproduce it? In mac OS standalone mode, open a spark-shell and run $SPARK_HOME/bin/spark-shell --master spark://localhost:7077 {code:scala} val x = sc.makeRDD(1 to 10, 5) x.count() {code} Then open the app UI in the browser, and click the Executors page, will get stuck at this page: !executor-page.png! Also the return JSON of REST API endpoint http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors miss "peakMemoryMetrics" for executors. {noformat} [ { "id" : "driver", "hostPort" : "192.168.1.241:50042", "isActive" : true, "rddBlocks" : 0, "memoryUsed" : 0, "diskUsed" : 0, "totalCores" : 0, "maxTasks" : 0, "activeTasks" : 0, "failedTasks" : 0, "completedTasks" : 0, "totalTasks" : 0, "totalDuration" : 0, "totalGCTime" : 0, "totalInputBytes" : 0, "totalShuffleRead" : 0, "totalShuffleWrite" : 0, "isBlacklisted" : false, "maxMemory" : 455501414, "addTime" : "2020-12-24T19:44:18.033GMT", "executorLogs" : { }, "memoryMetrics" : { "usedOnHeapStorageMemory" : 0, "usedOffHeapStorageMemory" : 0, "totalOnHeapStorageMemory" : 45550141
[jira] [Commented] (SPARK-33906) SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset
[ https://issues.apache.org/jira/browse/SPARK-33906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17254647#comment-17254647 ] Baohe Zhang commented on SPARK-33906: - I will put a PR soon. > SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset > - > > Key: SPARK-33906 > URL: https://issues.apache.org/jira/browse/SPARK-33906 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Baohe Zhang >Priority: Major > Attachments: executor-page.png > > > How to reproduce it? > In mac OS standalone mode, open a spark-shell and run > $SPARK_HOME/bin/spark-shell --master spark://localhost:7077 > {code:scala} > val x = sc.makeRDD(1 to 10, 5) > x.count() > {code} > Then open the app UI in the browser, and click the Executors page, will get > stuck at this page: > !executor-page.png! > Also the return JSON of REST API endpoint > http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors > miss "peakMemoryMetrics" for executors. > {noformat} > [ { > "id" : "driver", > "hostPort" : "192.168.1.241:50042", > "isActive" : true, > "rddBlocks" : 0, > "memoryUsed" : 0, > "diskUsed" : 0, > "totalCores" : 0, > "maxTasks" : 0, > "activeTasks" : 0, > "failedTasks" : 0, > "completedTasks" : 0, > "totalTasks" : 0, > "totalDuration" : 0, > "totalGCTime" : 0, > "totalInputBytes" : 0, > "totalShuffleRead" : 0, > "totalShuffleWrite" : 0, > "isBlacklisted" : false, > "maxMemory" : 455501414, > "addTime" : "2020-12-24T19:44:18.033GMT", > "executorLogs" : { }, > "memoryMetrics" : { > "usedOnHeapStorageMemory" : 0, > "usedOffHeapStorageMemory" : 0, > "totalOnHeapStorageMemory" : 455501414, > "totalOffHeapStorageMemory" : 0 > }, > "blacklistedInStages" : [ ], > "peakMemoryMetrics" : { > "JVMHeapMemory" : 135021152, > "JVMOffHeapMemory" : 149558576, > "OnHeapExecutionMemory" : 0, > "OffHeapExecutionMemory" : 0, > "OnHeapStorageMemory" : 3301, > "OffHeapStorageMemory" : 0, > "OnHeapUnifiedMemory" : 3301, > "OffHeapUnifiedMemory" : 0, > "DirectPoolMemory" : 67963178, > "MappedPoolMemory" : 0, > "ProcessTreeJVMVMemory" : 0, > "ProcessTreeJVMRSSMemory" : 0, > "ProcessTreePythonVMemory" : 0, > "ProcessTreePythonRSSMemory" : 0, > "ProcessTreeOtherVMemory" : 0, > "ProcessTreeOtherRSSMemory" : 0, > "MinorGCCount" : 15, > "MinorGCTime" : 101, > "MajorGCCount" : 0, > "MajorGCTime" : 0 > }, > "attributes" : { }, > "resources" : { }, > "resourceProfileId" : 0, > "isExcluded" : false, > "excludedInStages" : [ ] > }, { > "id" : "0", > "hostPort" : "192.168.1.241:50054", > "isActive" : true, > "rddBlocks" : 0, > "memoryUsed" : 0, > "diskUsed" : 0, > "totalCores" : 12, > "maxTasks" : 12, > "activeTasks" : 0, > "failedTasks" : 0, > "completedTasks" : 5, > "totalTasks" : 5, > "totalDuration" : 2107, > "totalGCTime" : 25, > "totalInputBytes" : 0, > "totalShuffleRead" : 0, > "totalShuffleWrite" : 0, > "isBlacklisted" : false, > "maxMemory" : 455501414, > "addTime" : "2020-12-24T19:44:20.335GMT", > "executorLogs" : { > "stdout" : > "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stdout";, > "stderr" : > "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stderr"; > }, > "memoryMetrics" : { > "usedOnHeapStorageMemory" : 0, > "usedOffHeapStorageMemory" : 0, > "totalOnHeapStorageMemory" : 455501414, > "totalOffHeapStorageMemory" : 0 > }, > "blacklistedInStages" : [ ], > "attributes" : { }, > "resources" : { }, > "resourceProfileId" : 0, > "isExcluded" : false, > "excludedInStages" : [ ] > } ] > {noformat} > I debugged it and observed that ExecutorMetricsPoller > .getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to > None in > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. > The possible reason for returning the empty map is that the stage completion > time is shorter than the heartbeat interval, so the stage entry in stageTCMP > has already been removed before the reportHeartbeat is called. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33906) SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset
[ https://issues.apache.org/jira/browse/SPARK-33906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-33906: Description: How to reproduce it? In mac OS standalone mode, open a spark-shell and run $SPARK_HOME/bin/spark-shell --master spark://localhost:7077 {code:scala} val x = sc.makeRDD(1 to 10, 5) x.count() {code} Then open the app UI in the browser, and click the Executors page, will get stuck at this page: !executor-page.png! Also the return JSON of REST API endpoint http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors miss "peakMemoryMetrics" for executors. {noformat} [ { "id" : "driver", "hostPort" : "192.168.1.241:50042", "isActive" : true, "rddBlocks" : 0, "memoryUsed" : 0, "diskUsed" : 0, "totalCores" : 0, "maxTasks" : 0, "activeTasks" : 0, "failedTasks" : 0, "completedTasks" : 0, "totalTasks" : 0, "totalDuration" : 0, "totalGCTime" : 0, "totalInputBytes" : 0, "totalShuffleRead" : 0, "totalShuffleWrite" : 0, "isBlacklisted" : false, "maxMemory" : 455501414, "addTime" : "2020-12-24T19:44:18.033GMT", "executorLogs" : { }, "memoryMetrics" : { "usedOnHeapStorageMemory" : 0, "usedOffHeapStorageMemory" : 0, "totalOnHeapStorageMemory" : 455501414, "totalOffHeapStorageMemory" : 0 }, "blacklistedInStages" : [ ], "peakMemoryMetrics" : { "JVMHeapMemory" : 135021152, "JVMOffHeapMemory" : 149558576, "OnHeapExecutionMemory" : 0, "OffHeapExecutionMemory" : 0, "OnHeapStorageMemory" : 3301, "OffHeapStorageMemory" : 0, "OnHeapUnifiedMemory" : 3301, "OffHeapUnifiedMemory" : 0, "DirectPoolMemory" : 67963178, "MappedPoolMemory" : 0, "ProcessTreeJVMVMemory" : 0, "ProcessTreeJVMRSSMemory" : 0, "ProcessTreePythonVMemory" : 0, "ProcessTreePythonRSSMemory" : 0, "ProcessTreeOtherVMemory" : 0, "ProcessTreeOtherRSSMemory" : 0, "MinorGCCount" : 15, "MinorGCTime" : 101, "MajorGCCount" : 0, "MajorGCTime" : 0 }, "attributes" : { }, "resources" : { }, "resourceProfileId" : 0, "isExcluded" : false, "excludedInStages" : [ ] }, { "id" : "0", "hostPort" : "192.168.1.241:50054", "isActive" : true, "rddBlocks" : 0, "memoryUsed" : 0, "diskUsed" : 0, "totalCores" : 12, "maxTasks" : 12, "activeTasks" : 0, "failedTasks" : 0, "completedTasks" : 5, "totalTasks" : 5, "totalDuration" : 2107, "totalGCTime" : 25, "totalInputBytes" : 0, "totalShuffleRead" : 0, "totalShuffleWrite" : 0, "isBlacklisted" : false, "maxMemory" : 455501414, "addTime" : "2020-12-24T19:44:20.335GMT", "executorLogs" : { "stdout" : "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stdout";, "stderr" : "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stderr"; }, "memoryMetrics" : { "usedOnHeapStorageMemory" : 0, "usedOffHeapStorageMemory" : 0, "totalOnHeapStorageMemory" : 455501414, "totalOffHeapStorageMemory" : 0 }, "blacklistedInStages" : [ ], "attributes" : { }, "resources" : { }, "resourceProfileId" : 0, "isExcluded" : false, "excludedInStages" : [ ] } ] {noformat} I debugged it and observed that ExecutorMetricsPoller .getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to None in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. The possible reason for returning the empty map is that the stage completion time is shorter than the heartbeat interval, so the stage entry in stageTCMP has already been removed before the reportHeartbeat is called. was: How to reproduce it? In mac OS standalone mode, open a spark-shell and run $SPARK_HOME/bin/spark-shell --master spark://localhost:7077 {code:scala} val x = sc.makeRDD(1 to 10, 5) x.count() {code} Then open the app UI in the browser, and click the Executors page, will get stuck at this page: !image-2020-12-24-14-12-22-983.png! Also the return JSON of REST API endpoint http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors miss "peakMemoryMetrics" for executors. {noformat} [ { "id" : "driver", "hostPort" : "192.168.1.241:50042", "isActive" : true, "rddBlocks" : 0, "memoryUsed" : 0, "diskUsed" : 0, "totalCores" : 0, "maxTasks" : 0, "activeTasks" : 0, "failedTasks" : 0, "completedTasks" : 0, "totalTasks" : 0, "totalDuration" : 0, "totalGCTime" : 0, "totalInputBytes" : 0, "totalShuffleRead" : 0, "totalShuffleWrite" : 0, "isBlacklisted" : false, "maxMemory" : 455501414, "addTime" : "2020-12-24T19:44:18.033GMT", "executorLogs" : { }, "memoryMetrics" : { "usedOnHeapStorageMemory" : 0, "usedOffHeapStorageMemory" : 0, "totalOnHeapStorageMemory" : 455501414, "totalOffHeapStorageMemory" : 0 }, "blacklistedInStages"
[jira] [Updated] (SPARK-33906) SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset
[ https://issues.apache.org/jira/browse/SPARK-33906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-33906: Attachment: executor-page.png > SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset > - > > Key: SPARK-33906 > URL: https://issues.apache.org/jira/browse/SPARK-33906 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Baohe Zhang >Priority: Major > Attachments: executor-page.png > > > How to reproduce it? > In mac OS standalone mode, open a spark-shell and run > $SPARK_HOME/bin/spark-shell --master spark://localhost:7077 > {code:scala} > val x = sc.makeRDD(1 to 10, 5) > x.count() > {code} > Then open the app UI in the browser, and click the Executors page, will get > stuck at this page: > !image-2020-12-24-14-12-22-983.png! > Also the return JSON of REST API endpoint > http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors > miss "peakMemoryMetrics" for executors. > {noformat} > [ { > "id" : "driver", > "hostPort" : "192.168.1.241:50042", > "isActive" : true, > "rddBlocks" : 0, > "memoryUsed" : 0, > "diskUsed" : 0, > "totalCores" : 0, > "maxTasks" : 0, > "activeTasks" : 0, > "failedTasks" : 0, > "completedTasks" : 0, > "totalTasks" : 0, > "totalDuration" : 0, > "totalGCTime" : 0, > "totalInputBytes" : 0, > "totalShuffleRead" : 0, > "totalShuffleWrite" : 0, > "isBlacklisted" : false, > "maxMemory" : 455501414, > "addTime" : "2020-12-24T19:44:18.033GMT", > "executorLogs" : { }, > "memoryMetrics" : { > "usedOnHeapStorageMemory" : 0, > "usedOffHeapStorageMemory" : 0, > "totalOnHeapStorageMemory" : 455501414, > "totalOffHeapStorageMemory" : 0 > }, > "blacklistedInStages" : [ ], > "peakMemoryMetrics" : { > "JVMHeapMemory" : 135021152, > "JVMOffHeapMemory" : 149558576, > "OnHeapExecutionMemory" : 0, > "OffHeapExecutionMemory" : 0, > "OnHeapStorageMemory" : 3301, > "OffHeapStorageMemory" : 0, > "OnHeapUnifiedMemory" : 3301, > "OffHeapUnifiedMemory" : 0, > "DirectPoolMemory" : 67963178, > "MappedPoolMemory" : 0, > "ProcessTreeJVMVMemory" : 0, > "ProcessTreeJVMRSSMemory" : 0, > "ProcessTreePythonVMemory" : 0, > "ProcessTreePythonRSSMemory" : 0, > "ProcessTreeOtherVMemory" : 0, > "ProcessTreeOtherRSSMemory" : 0, > "MinorGCCount" : 15, > "MinorGCTime" : 101, > "MajorGCCount" : 0, > "MajorGCTime" : 0 > }, > "attributes" : { }, > "resources" : { }, > "resourceProfileId" : 0, > "isExcluded" : false, > "excludedInStages" : [ ] > }, { > "id" : "0", > "hostPort" : "192.168.1.241:50054", > "isActive" : true, > "rddBlocks" : 0, > "memoryUsed" : 0, > "diskUsed" : 0, > "totalCores" : 12, > "maxTasks" : 12, > "activeTasks" : 0, > "failedTasks" : 0, > "completedTasks" : 5, > "totalTasks" : 5, > "totalDuration" : 2107, > "totalGCTime" : 25, > "totalInputBytes" : 0, > "totalShuffleRead" : 0, > "totalShuffleWrite" : 0, > "isBlacklisted" : false, > "maxMemory" : 455501414, > "addTime" : "2020-12-24T19:44:20.335GMT", > "executorLogs" : { > "stdout" : > "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stdout";, > "stderr" : > "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stderr"; > }, > "memoryMetrics" : { > "usedOnHeapStorageMemory" : 0, > "usedOffHeapStorageMemory" : 0, > "totalOnHeapStorageMemory" : 455501414, > "totalOffHeapStorageMemory" : 0 > }, > "blacklistedInStages" : [ ], > "attributes" : { }, > "resources" : { }, > "resourceProfileId" : 0, > "isExcluded" : false, > "excludedInStages" : [ ] > } ] > {noformat} > I debugged it and observed that ExecutorMetricsPoller > .getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to > None in > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. > The possible reason for returning the empty map is that the stage completion > time is shorter than the heartbeat interval, so the stage entry in stageTCMP > has already been removed before the reportHeartbeat is called. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33906) SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset
Baohe Zhang created SPARK-33906: --- Summary: SPARK UI Executors page stuck when ExecutorSummary.peakMemoryMetrics is unset Key: SPARK-33906 URL: https://issues.apache.org/jira/browse/SPARK-33906 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.2.0 Reporter: Baohe Zhang How to reproduce it? In mac OS standalone mode, open a spark-shell and run $SPARK_HOME/bin/spark-shell --master spark://localhost:7077 {code:scala} val x = sc.makeRDD(1 to 10, 5) x.count() {code} Then open the app UI in the browser, and click the Executors page, will get stuck at this page: !image-2020-12-24-14-12-22-983.png! Also the return JSON of REST API endpoint http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors miss "peakMemoryMetrics" for executors. {noformat} [ { "id" : "driver", "hostPort" : "192.168.1.241:50042", "isActive" : true, "rddBlocks" : 0, "memoryUsed" : 0, "diskUsed" : 0, "totalCores" : 0, "maxTasks" : 0, "activeTasks" : 0, "failedTasks" : 0, "completedTasks" : 0, "totalTasks" : 0, "totalDuration" : 0, "totalGCTime" : 0, "totalInputBytes" : 0, "totalShuffleRead" : 0, "totalShuffleWrite" : 0, "isBlacklisted" : false, "maxMemory" : 455501414, "addTime" : "2020-12-24T19:44:18.033GMT", "executorLogs" : { }, "memoryMetrics" : { "usedOnHeapStorageMemory" : 0, "usedOffHeapStorageMemory" : 0, "totalOnHeapStorageMemory" : 455501414, "totalOffHeapStorageMemory" : 0 }, "blacklistedInStages" : [ ], "peakMemoryMetrics" : { "JVMHeapMemory" : 135021152, "JVMOffHeapMemory" : 149558576, "OnHeapExecutionMemory" : 0, "OffHeapExecutionMemory" : 0, "OnHeapStorageMemory" : 3301, "OffHeapStorageMemory" : 0, "OnHeapUnifiedMemory" : 3301, "OffHeapUnifiedMemory" : 0, "DirectPoolMemory" : 67963178, "MappedPoolMemory" : 0, "ProcessTreeJVMVMemory" : 0, "ProcessTreeJVMRSSMemory" : 0, "ProcessTreePythonVMemory" : 0, "ProcessTreePythonRSSMemory" : 0, "ProcessTreeOtherVMemory" : 0, "ProcessTreeOtherRSSMemory" : 0, "MinorGCCount" : 15, "MinorGCTime" : 101, "MajorGCCount" : 0, "MajorGCTime" : 0 }, "attributes" : { }, "resources" : { }, "resourceProfileId" : 0, "isExcluded" : false, "excludedInStages" : [ ] }, { "id" : "0", "hostPort" : "192.168.1.241:50054", "isActive" : true, "rddBlocks" : 0, "memoryUsed" : 0, "diskUsed" : 0, "totalCores" : 12, "maxTasks" : 12, "activeTasks" : 0, "failedTasks" : 0, "completedTasks" : 5, "totalTasks" : 5, "totalDuration" : 2107, "totalGCTime" : 25, "totalInputBytes" : 0, "totalShuffleRead" : 0, "totalShuffleWrite" : 0, "isBlacklisted" : false, "maxMemory" : 455501414, "addTime" : "2020-12-24T19:44:20.335GMT", "executorLogs" : { "stdout" : "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stdout";, "stderr" : "http://192.168.1.241:8081/logPage/?appId=app-20201224134418-0003&executorId=0&logType=stderr"; }, "memoryMetrics" : { "usedOnHeapStorageMemory" : 0, "usedOffHeapStorageMemory" : 0, "totalOnHeapStorageMemory" : 455501414, "totalOffHeapStorageMemory" : 0 }, "blacklistedInStages" : [ ], "attributes" : { }, "resources" : { }, "resourceProfileId" : 0, "isExcluded" : false, "excludedInStages" : [ ] } ] {noformat} I debugged it and observed that ExecutorMetricsPoller .getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to None in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. The possible reason for returning the empty map is that the stage completion time is shorter than the heartbeat interval, so the stage entry in stageTCMP has already been removed before the reportHeartbeat is called. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26399) Add new stage-level REST APIs and parameters
[ https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244341#comment-17244341 ] Baohe Zhang commented on SPARK-26399: - Seems the work of this ticket is already done by https://issues.apache.org/jira/browse/SPARK-23431 and https://issues.apache.org/jira/browse/SPARK-32446. > Add new stage-level REST APIs and parameters > > > Key: SPARK-26399 > URL: https://issues.apache.org/jira/browse/SPARK-26399 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Edward Lu >Priority: Major > > Add the peak values for the metrics to the stages REST API. Also add a new > executorSummary REST API, which will return executor summary metrics for a > specified stage: > {code:java} > curl http://:18080/api/v1/applications/ id>// attempt>/executorSummary{code} > Add parameters to the stages REST API to specify: > * filtering for task status, and returning tasks that match (for example, > FAILED tasks). > * task metric quantiles, add adding the task summary if specified > * executor metric quantiles, and adding the executor summary if specified -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33215) Speed up event log download by skipping UI rebuild
Baohe Zhang created SPARK-33215: --- Summary: Speed up event log download by skipping UI rebuild Key: SPARK-33215 URL: https://issues.apache.org/jira/browse/SPARK-33215 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.1, 2.4.7 Reporter: Baohe Zhang Right now, when we want to download the event logs from the spark history server(SHS), SHS will need to parse entire the event log to rebuild UI, and this is just for view permission checks. UI rebuilding is a time-consuming and memory-intensive task, especially for large logs. However, this process is unnecessary for event log download. This patch enables SHS to check UI view permissions of a given app/attempt for a given user, without rebuilding the UI. This is achieved by adding a method "checkUIViewPermissions(appId: String, attemptId: Option[String], user: String): Boolean" to many layers of history server components. With this patch, UI rebuild can be skipped when downloading event logs from the history server. Thus the time of downloading a GB scale event log can be reduced from several minutes to several seconds, and the memory consumption of UI rebuilding can be avoided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32350) Add batch write support on LevelDB to improve performance of HybridStore
[ https://issues.apache.org/jira/browse/SPARK-32350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-32350: Description: The idea is to improve the performance of HybridStore by adding batch write support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 introduces HybridStore. HybridStore will write data to InMemoryStore at first and use a background thread to dump data to LevelDB once the writing to InMemoryStore is completed. In the comments section of [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned using batch writing can improve the performance of this dumping process and he wrote the code of writeAll(). I did the comparison of the HybridStore switching time between one-by-one write and batch write on an HDD disk. When the disk is free, the batch-write has around 25% improvement, and when the disk is 100% busy, the batch-write has 7x - 10x improvement. when the disk is at 0% utilization: ||log size, jobs and tasks per job||original switching time, with write()||switching time with writeAll()|| |133m, 400 jobs, 100 tasks per job|16s|13s| |265m, 400 jobs, 200 tasks per job|30s|23s| |1.3g, 1000 jobs, 400 tasks per job|136s|108s| when the disk is at 100% utilization: ||log size, jobs and tasks per job||original switching time, with write()||switching time with writeAll()|| |133m, 400 jobs, 100 tasks per job|116s|17s| |265m, 400 jobs, 200 tasks per job|251s|26s| I also ran some write related benchmarking tests on LevelDBBenchmark.java and measured the total time of writing 1024 objects. when the disk is at 0% utilization: ||Benchmark test||with write(), ms||with writeAll(), ms || |randomUpdatesIndexed|213.060|157.356| |randomUpdatesNoIndex|57.869|35.439| |randomWritesIndexed|298.854|229.274| |randomWritesNoIndex|66.764|38.361| |sequentialUpdatesIndexed|87.019|56.219| |sequentialUpdatesNoIndex|61.851|41.942| |sequentialWritesIndexed|94.044|56.534| |sequentialWritesNoIndex|118.345|66.483| when the disk is at 50% utilization: ||Benchmark test||with write(), ms||with writeAll(), ms|| |randomUpdatesIndexed|230.386|180.817| |randomUpdatesNoIndex|58.935|50.113| |randomWritesIndexed|315.241|254.400| |randomWritesNoIndex|96.709|41.164| |sequentialUpdatesIndexed|89.971|70.387| |sequentialUpdatesNoIndex|72.021|53.769| |sequentialWritesIndexed|103.052|67.358| |sequentialWritesNoIndex|76.194|99.037| was: The idea is to improve the performance of HybridStore by adding batch write support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 introduces HybridStore. HybridStore will write data to InMemoryStore at first and use a background thread to dump data to LevelDB once the writing to InMemoryStore is completed. In the comments section of [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned using batch writing can improve the performance of this dumping process and he wrote the code of writeAll(). I did the comparison of the HybridStore switching time between one-by-one write and batch write on an HDD disk. When the disk is free, the batch-write has around 25% improvement, and when the disk is 100% busy, the batch-write has 7x - 10x improvement. when the disk is at 0% utilization: ||log size, jobs and tasks per job||original switching time, with write()||switching time with writeAll()|| |133m, 400 jobs, 100 tasks per job|16s|13s| |265m, 400 jobs, 200 tasks per job|30s|23s| |1.3g, 1000 jobs, 400 tasks per job|136s|108s| when the disk is at 100% utilization: ||log size, jobs and tasks per job||original switching time, with write()||switching time with writeAll()|| |133m, 400 jobs, 100 tasks per job|116s|17s| |265m, 400 jobs, 200 tasks per job|251s|26s| I also ran some write related benchmarking tests on LevelDBBenchmark.java and measured the total time of writing 1024 objects. The test was conducted when disk at 0% utilization. ||Benchmark test||with write(), ms||with writeAll(), ms|| |randomUpdatesIndexed|230.386|180.817| |randomUpdatesNoIndex|58.935|50.113| |randomWritesIndexed|315.241|254.400| |randomWritesNoIndex|96.709|41.164| |sequentialUpdatesIndexed|89.971|70.387| |sequentialUpdatesNoIndex|72.021|53.769| |sequentialWritesIndexed|103.052|67.358| |sequentialWritesNoIndex|76.194|99.037| > Add batch write support on LevelDB to improve performance of HybridStore > > > Key: SPARK-32350 > URL: https://issues.apache.org/jira/browse/SPARK-32350 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.1, 3.1.0 >Reporter: Baohe Zhang >Priority: Major > > The idea is to improve the performance of HybridStore by adding batch write > support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 > introduce
[jira] [Created] (SPARK-32350) Add batch write support on LevelDB to improve performance of HybridStore
Baohe Zhang created SPARK-32350: --- Summary: Add batch write support on LevelDB to improve performance of HybridStore Key: SPARK-32350 URL: https://issues.apache.org/jira/browse/SPARK-32350 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.1, 3.1.0 Reporter: Baohe Zhang The idea is to improve the performance of HybridStore by adding batch write support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 introduces HybridStore. HybridStore will write data to InMemoryStore at first and use a background thread to dump data to LevelDB once the writing to InMemoryStore is completed. In the comments section of [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned using batch writing can improve the performance of this dumping process and he wrote the code of writeAll(). I did the comparison of the HybridStore switching time between one-by-one write and batch write on an HDD disk. When the disk is free, the batch-write has around 25% improvement, and when the disk is 100% busy, the batch-write has 7x - 10x improvement. when the disk is at 0% utilization: ||log size, jobs and tasks per job||original switching time, with write()||switching time with writeAll()|| |133m, 400 jobs, 100 tasks per job|16s|13s| |265m, 400 jobs, 200 tasks per job|30s|23s| |1.3g, 1000 jobs, 400 tasks per job|136s|108s| when the disk is at 100% utilization: ||log size, jobs and tasks per job||original switching time, with write()||switching time with writeAll()|| |133m, 400 jobs, 100 tasks per job|116s|17s| |265m, 400 jobs, 200 tasks per job|251s|26s| I also ran some write related benchmarking tests on LevelDBBenchmark.java and measured the total time of writing 1024 objects. The test was conducted when disk at 0% utilization. ||Benchmark test||with write(), ms||with writeAll(), ms|| |randomUpdatesIndexed|230.386|180.817| |randomUpdatesNoIndex|58.935|50.113| |randomWritesIndexed|315.241|254.400| |randomWritesNoIndex|96.709|41.164| |sequentialUpdatesIndexed|89.971|70.387| |sequentialUpdatesNoIndex|72.021|53.769| |sequentialWritesIndexed|103.052|67.358| |sequentialWritesNoIndex|76.194|99.037| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31664) Race in YARN scheduler shutdown leads to uncaught SparkException "Could not find CoarseGrainedScheduler"
Baohe Zhang created SPARK-31664: --- Summary: Race in YARN scheduler shutdown leads to uncaught SparkException "Could not find CoarseGrainedScheduler" Key: SPARK-31664 URL: https://issues.apache.org/jira/browse/SPARK-31664 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 3.0.0, 3.0.1, 3.1.0 Reporter: Baohe Zhang I used this command to run SparkPi on a yarn cluster with dynamicAllocation enabled: "$SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi ./spark-examples.jar 1000" and received error log below every time. {code:java} 20/05/06 16:31:44 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message. org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:169) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:684) at org.apache.spark.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:66) at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) 20/05/06 16:31:45 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 20/05/06 16:31:45 INFO MemoryStore: MemoryStore cleared 20/05/06 16:31:45 INFO BlockManager
[jira] [Updated] (SPARK-31608) Add a hybrid KVStore to make UI loading faster
[ https://issues.apache.org/jira/browse/SPARK-31608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-31608: Description: This is a follow-up for the work done by Hieu Huynh in 2019. Add a new class HybridKVStore to make the history server faster when loading event files. When rebuilding the application state from event logs, HybridKVStore will first write data to an in-memory store and having a background thread that keeps pushing the change to levelDB. I ran some tests on 3.0.1 on mac os: ||kvstore type / log size||100m||200m||500m||1g||2g|| |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to leveldb| |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse| For example when loading a 1g file, HybridKVStore takes 23s to parse (that means, users only need to wait for 23s to see the UI), the background thread will still run 17s to copy data to leveldb. And after that, the in memory store can be closed, the entire store now moves to leveldb. So in general, it has 3x - 4x UI loading speed improvement. was: This is a follow-up for the work done by Hieu Huynh in 2019. Add a new class HybridKVStore to make the history server faster when loading event files. When writing to this kvstore, it will first write to an in-memory store and having a background thread that keeps pushing the change to levelDB. I ran some tests on 3.0.1 on mac os: ||kvstore type / log size||100m||200m||500m||1g||2g|| |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to leveldb| |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse| > Add a hybrid KVStore to make UI loading faster > -- > > Key: SPARK-31608 > URL: https://issues.apache.org/jira/browse/SPARK-31608 > Project: Spark > Issue Type: Story > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Baohe Zhang >Priority: Major > > This is a follow-up for the work done by Hieu Huynh in 2019. > Add a new class HybridKVStore to make the history server faster when loading > event files. When rebuilding the application state from event logs, > HybridKVStore will first write data to an in-memory store and having a > background thread that keeps pushing the change to levelDB. > I ran some tests on 3.0.1 on mac os: > ||kvstore type / log size||100m||200m||500m||1g||2g|| > |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to > leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to > leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to > leveldb| > |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse| > For example when loading a 1g file, HybridKVStore takes 23s to parse (that > means, users only need to wait for 23s to see the UI), the background thread > will still run 17s to copy data to leveldb. And after that, the in memory > store can be closed, the entire store now moves to leveldb. So in general, it > has 3x - 4x UI loading speed improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31608) Add a hybrid KVStore to make UI loading faster
[ https://issues.apache.org/jira/browse/SPARK-31608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-31608: Description: This is a follow-up for the work done by Hieu Huynh in 2019. Add a new class HybridKVStore to make the history server faster when loading event files. When writing to this kvstore, it will first write to an in-memory store and having a background thread that keeps pushing the change to levelDB. I ran some tests on 3.0.1 on mac os: ||kvstore type / log size||100m||200m||500m||1g||2g|| |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to leveldb| |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse| was: Add a new class HybridKVStore to make the history server faster when loading event files. When writing to this kvstore, it will first write to an in-memory store and having a background thread that keeps pushing the change to levelDB. I ran some tests on 3.0.1 on mac os: ||kvstore type / log size||100m||200m||500m||1g||2g|| |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to leveldb| |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse| > Add a hybrid KVStore to make UI loading faster > -- > > Key: SPARK-31608 > URL: https://issues.apache.org/jira/browse/SPARK-31608 > Project: Spark > Issue Type: Story > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Baohe Zhang >Priority: Major > > This is a follow-up for the work done by Hieu Huynh in 2019. > Add a new class HybridKVStore to make the history server faster when loading > event files. When writing to this kvstore, it will first write to an > in-memory store and having a background thread that keeps pushing the change > to levelDB. > I ran some tests on 3.0.1 on mac os: > ||kvstore type / log size||100m||200m||500m||1g||2g|| > |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to > leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to > leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to > leveldb| > |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31608) Add a hybrid KVStore to make UI loading faster
Baohe Zhang created SPARK-31608: --- Summary: Add a hybrid KVStore to make UI loading faster Key: SPARK-31608 URL: https://issues.apache.org/jira/browse/SPARK-31608 Project: Spark Issue Type: Story Components: Web UI Affects Versions: 3.0.1 Reporter: Baohe Zhang Add a new class HybridKVStore to make the history server faster when loading event files. When writing to this kvstore, it will first write to an in-memory store and having a background thread that keeps pushing the change to levelDB. I ran some tests on 3.0.1 on mac os: ||kvstore type / log size||100m||200m||500m||1g||2g|| |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to leveldb| |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31584) NullPointerException when parsing event log with InMemoryStore
[ https://issues.apache.org/jira/browse/SPARK-31584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-31584: Attachment: errorstack.txt > NullPointerException when parsing event log with InMemoryStore > -- > > Key: SPARK-31584 > URL: https://issues.apache.org/jira/browse/SPARK-31584 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Baohe Zhang >Priority: Minor > Fix For: 3.0.1 > > Attachments: errorstack.txt > > > I compiled with the current branch-3.0 source and tested it in mac os. A > java.lang.NullPointerException will be thrown when below conditions are met: > # Using InMemoryStore as kvstore when parsing the event log file (e.g., when > spark.history.store.path is unset). > # At least one stage in this event log has task number greater than > spark.ui.retainedTasks (by default is 10). In this case, kvstore needs to > delete extra task records. > # The job has more than one stage, so parentToChildrenMap in > InMemoryStore.java will have more than one key. > The java.lang.NullPointerException is thrown in InMemoryStore.java :296. In > the method deleteParentIndex(). > {code:java} > private void deleteParentIndex(Object key) { > if (hasNaturalParentIndex) { > for (NaturalKeys v : parentToChildrenMap.values()) { > if (v.remove(asKey(key))) { > // `v` can be empty after removing the natural key and we can > remove it from > // `parentToChildrenMap`. However, `parentToChildrenMap` is a > ConcurrentMap and such > // checking and deleting can be slow. > // This method is to delete one object with certain key, let's > make it simple here. > break; > } > } > } > }{code} > In “if (v.remove(asKey(key)))”, if the key is not contained in v, > "v.remove(asKey(key))" will return null, and java will throw a > NullPointerException when executing "if (null)". > An exception stack trace is attached. > This issue can be fixed by updating if statement to > {code:java} > if (v.remove(asKey(key)) != null){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31584) NullPointerException when parsing event log with InMemoryStore
Baohe Zhang created SPARK-31584: --- Summary: NullPointerException when parsing event log with InMemoryStore Key: SPARK-31584 URL: https://issues.apache.org/jira/browse/SPARK-31584 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.0.1 Reporter: Baohe Zhang Fix For: 3.0.1 I compiled with the current branch-3.0 source and tested it in mac os. A java.lang.NullPointerException will be thrown when below conditions are met: # Using InMemoryStore as kvstore when parsing the event log file (e.g., when spark.history.store.path is unset). # At least one stage in this event log has task number greater than spark.ui.retainedTasks (by default is 10). In this case, kvstore needs to delete extra task records. # The job has more than one stage, so parentToChildrenMap in InMemoryStore.java will have more than one key. The java.lang.NullPointerException is thrown in InMemoryStore.java :296. In the method deleteParentIndex(). {code:java} private void deleteParentIndex(Object key) { if (hasNaturalParentIndex) { for (NaturalKeys v : parentToChildrenMap.values()) { if (v.remove(asKey(key))) { // `v` can be empty after removing the natural key and we can remove it from // `parentToChildrenMap`. However, `parentToChildrenMap` is a ConcurrentMap and such // checking and deleting can be slow. // This method is to delete one object with certain key, let's make it simple here. break; } } } }{code} In “if (v.remove(asKey(key)))”, if the key is not contained in v, "v.remove(asKey(key))" will return null, and java will throw a NullPointerException when executing "if (null)". An exception stack trace is attached. This issue can be fixed by updating if statement to {code:java} if (v.remove(asKey(key)) != null){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31380) Peak Execution Memory Quantile is not displayed in Spark History Server UI
[ https://issues.apache.org/jira/browse/SPARK-31380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084431#comment-17084431 ] Baohe Zhang commented on SPARK-31380: - Did you run your application with Spark 3? I tested it in my local 3.0 build and I saw the spark history server UI can display non-zero Peak Execution Memory. !image-2020-04-15-18-16-18-254.png! > Peak Execution Memory Quantile is not displayed in Spark History Server UI > -- > > Key: SPARK-31380 > URL: https://issues.apache.org/jira/browse/SPARK-31380 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Srinivas Rishindra Pothireddi >Priority: Major > Attachments: image-2020-04-15-18-16-18-254.png > > > Peak Execution Memory Quantile is displayed in the regular Spark UI > correctly. If the same application is viewed in Spark History Server UI, Peak > Execution Memory is always displayed as zero. > Spark event log for the application seem to contain Peak Execution > Memory(under the tag "internal.metrics.peakExecutionMemory") correctly. > However this is not reflected in the History Server UI. > *Steps to produce non-zero Peak Execution Memory* > spark.range(0, 20).map\{x => (x , x % 20)}.toDF("a", > "b").createOrReplaceTempView("fred") > spark.range(0, 20).map\{x => (x , x + 1)}.toDF("a", > "b").createOrReplaceTempView("phil") > sql("select p.**,* f.* from phil p join fred f on f.b = p.b").count > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31380) Peak Execution Memory Quantile is not displayed in Spark History Server UI
[ https://issues.apache.org/jira/browse/SPARK-31380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-31380: Attachment: image-2020-04-15-18-16-18-254.png > Peak Execution Memory Quantile is not displayed in Spark History Server UI > -- > > Key: SPARK-31380 > URL: https://issues.apache.org/jira/browse/SPARK-31380 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Srinivas Rishindra Pothireddi >Priority: Major > Attachments: image-2020-04-15-18-16-18-254.png > > > Peak Execution Memory Quantile is displayed in the regular Spark UI > correctly. If the same application is viewed in Spark History Server UI, Peak > Execution Memory is always displayed as zero. > Spark event log for the application seem to contain Peak Execution > Memory(under the tag "internal.metrics.peakExecutionMemory") correctly. > However this is not reflected in the History Server UI. > *Steps to produce non-zero Peak Execution Memory* > spark.range(0, 20).map\{x => (x , x % 20)}.toDF("a", > "b").createOrReplaceTempView("fred") > spark.range(0, 20).map\{x => (x , x + 1)}.toDF("a", > "b").createOrReplaceTempView("phil") > sql("select p.**,* f.* from phil p join fred f on f.b = p.b").count > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org