Hello Sanjeev, It is hard to troubleshoot the issue without input files. Could you open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the JSON files there (or samples or code which generates JSON files)?
Maxim Gekk Software Engineer Databricks, Inc. On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <sanjeev.mis...@gmail.com> wrote: > It has read everything. As you notice the timing of count is still smaller > in Spark 2.4 > > Spark 2.4 > > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: 19691 ms > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > > scala> spark.time(res61.count()) > Time taken: 7113 ms > res64: Long = 2605349 > > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: 849652 ms > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > > scala> spark.time(res0.count()) > Time taken: 8201 ms > res2: Long = 2605349 > > > > > On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <arte...@dtechspace.com> wrote: > >> Could you share your code? Are you sure you Spark 2.4 cluster had indeed >> read anything? Looks like the Input size field is empty under 2.4. >> >> -- ND >> On 6/27/20 7:58 PM, Sanjeev Mishra wrote: >> >> >> I have large amount of json files that Spark can read in 36 seconds but >> Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, >> looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone >> have any idea what is going on? Is there any configuration problem with >> Spark 3.0. >> >> Here are the details: >> >> *Spark 2.4* >> >> Summary Metrics for 2203 Completed Tasks >> <http://10.0.0.8:4040/stages/stage/?id=0&attempt=0#tasksTitle> >> Metric Min 25th percentile Median 75th percentile Max >> Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms >> GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms >> Showing 1 to 2 of 2 entries >> Aggregated Metrics by Executor >> Show 20 40 60 100 All entries >> Search: >> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks >> Succeeded >> Tasks Blacklisted >> driver >> 10.0.0.8:49159 36 s 2203 0 0 2203 false >> >> >> *Spark 3.0* >> >> Summary Metrics for 8 Completed Tasks >> <http://10.0.0.8:4040/stages/stage/?id=1&attempt=0&task.eventTimelinePageNumber=1&task.eventTimelinePageSize=47#tasksTitle> >> Metric Min 25th percentile Median 75th percentile Max >> Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min >> GC Time 3 s 3 s 3 s 4 s 4 s >> Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 >> MiB / 58148 20.2 MiB / 71624 >> Showing 1 to 3 of 3 entries >> Aggregated Metrics by Executor >> Show 20 40 60 100 All entries >> Search: >> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks >> Succeeded >> Tasks Blacklisted Input Size / Records >> driver >> 10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999 >> >> >> The DAG is also different >> Spark 2.0 DAG >> >> [image: Screenshot 2020-06-27 16.30.26.png] >> >> Spark 3.0 DAG >> >> [image: Screenshot 2020-06-27 16.32.32.png] >> >> >>