Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
Let me share the Ipython notebook. On Tue, Jun 30, 2020 at 11:18 AM Gourav Sengupta wrote: > Hi, > > I think that the notebook clearly demonstrates that setting the > inferTimestamp option to False does not really help. > > Is it really impossible for you to show how your own data can be

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Gourav Sengupta
Hi, I think that the notebook clearly demonstrates that setting the inferTimestamp option to False does not really help. Is it really impossible for you to show how your own data can be loaded? It should be simple, just open the notebook and see why the exact code you have given does not work,

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
Hi Gourav, Please check the comments of the ticket, looks like the performance degradation is attributed to inferTimestamp option that is true by default (I have no idea why) in Spark 3.0. This forces Spark to scan entire text and so the poor performance. Regards Sanjeev > On Jun 30, 2020,

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Gourav Sengupta
Hi, Sanjeev, I think that I did precisely that, can you please download my ipython notebook and have a look, and let me know where I am going wrong. its attached with the JIRA ticket. Regards, Gourav Sengupta On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra wrote: > There are total 11 files as

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
There are total 11 files as part of tar. You will have to untar it to get to actual files (.json.gz) No, I am getting Count: 33447 spark.time(spark.read.json(“/data/small-anon/")) Time taken: 431 ms res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 2 more fields]

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Gourav Sengupta
Hi Sanjeev, that just gives 11 records from the sample that you have loaded to the JIRA tickets is it correct? Regards, Gourav Sengupta On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra wrote: > There is not much code, I am just using spark-shell and reading the data > like so > >

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
There is not much code, I am just using spark-shell and reading the data like so spark.time(spark.read.json("/data/small-anon/")) > On Jun 30, 2020, at 3:53 AM, Gourav Sengupta > wrote: > > Hi Sanjeev, > > can you share the exact code that you are using to read the JSON files? > Currently I

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Sanjeev Mishra
Done. https://issues.apache.org/jira/browse/SPARK-32130 On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk wrote: > Hello Sanjeev, > > It is hard to troubleshoot the issue without input files. Could you open > an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and > attach the JSON files

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Maxim Gekk
Hello Sanjeev, It is hard to troubleshoot the issue without input files. Could you open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the JSON files there (or samples or code which generates JSON files)? Maxim Gekk Software Engineer Databricks, Inc. On Mon, Jun

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Sanjeev Mishra
It has read everything. As you notice the timing of count is still smaller in Spark 2.4 Spark 2.4 scala> spark.time(spark.read.json("/data/20200528")) Time taken: 19691 ms res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields] scala> spark.time(res61.count())

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread ArtemisDev
Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4. -- ND On 6/27/20 7:58 PM, Sanjeev Mishra wrote: I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Sanjeev Mishra
There is not much code, I am using spark-shell provided by Spark 2.4 and Spark 3. val dp = spark.read.json("/Users//data/dailyparams/20200528") On Mon, Jun 29, 2020 at 2:25 AM Gourav Sengupta wrote: > Hi, > > can you please share the SPARK code? > > > > Regards, > Gourav > > On Sun, Jun 28,

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Gourav Sengupta
Hi, can you please share the SPARK code? Regards, Gourav On Sun, Jun 28, 2020 at 12:58 AM Sanjeev Mishra wrote: > > I have large amount of json files that Spark can read in 36 seconds but > Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, > looks like Spark 3.0 is

Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-27 Thread Sanjeev Mishra
I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark