Could you share your code? Are you sure you Spark 2.4 cluster had
indeed read anything? Looks like the Input size field is empty under 2.4.
-- ND
On 6/27/20 7:58 PM, Sanjeev Mishra wrote:
I have large amount of json files that Spark can read in 36 seconds
but Spark 3.0 takes almost 33 minutes to read the same. On closer
analysis, looks like Spark 3.0 is choosing different DAG than Spark
2.0. Does anyone have any idea what is going on? Is there any
configuration problem with Spark 3.0.
Here are the details:
*Spark 2.4*
Summary Metrics for 2203 Completed Tasks
<http://10.0.0.8:4040/stages/stage/?id=0&attempt=0#tasksTitle>
Metric Min 25th percentile Median 75th percentile Max
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
Aggregated Metrics by Executor
Show entries
Search:
Executor ID Logs Address Task Time Total Tasks Failed Tasks
Killed Tasks Succeeded Tasks Blacklisted
driver
10.0.0.8:49159 <http://10.0.0.8:49159> 36 s 2203 0 0
2203 false
*Spark 3.0*
Summary Metrics for 8 Completed Tasks
<http://10.0.0.8:4040/stages/stage/?id=1&attempt=0&task.eventTimelinePageNumber=1&task.eventTimelinePageSize=47#tasksTitle>
Metric Min 25th percentile Median 75th percentile Max
Duration 3.8 min 4.0 min 4.1 min 4.4 min
5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB /
55259 17.8 MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
Aggregated Metrics by Executor
Show entries
Search:
Executor ID Logs Address Task Time Total Tasks Failed Tasks
Killed Tasks Succeeded Tasks Blacklisted Input Size / Records
driver
10.0.0.8:50224 <http://10.0.0.8:50224> 33 min 8 0 0 8 false
136.1 MiB / 451999
The DAG is also different
Spark 2.0 DAG
Screenshot 2020-06-27 16.30.26.png
Spark 3.0 DAG
Screenshot 2020-06-27 16.32.32.png