Let me share the Ipython notebook. On Tue, Jun 30, 2020 at 11:18 AM Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
> Hi, > > I think that the notebook clearly demonstrates that setting the > inferTimestamp option to False does not really help. > > Is it really impossible for you to show how your own data can be loaded? > It should be simple, just open the notebook and see why the exact code you > have given does not work, and shows only 11 records. > > > Regards, > Gourav Sengupta > > On Tue, Jun 30, 2020 at 4:15 PM Sanjeev Mishra <sanjeev.mis...@gmail.com> > wrote: > >> Hi Gourav, >> >> Please check the comments of the ticket, looks like the performance >> degradation is attributed to inferTimestamp option that is true by default >> (I have no idea why) in Spark 3.0. This forces Spark to scan entire text >> and so the poor performance. >> >> Regards >> Sanjeev >> >> On Jun 30, 2020, at 8:12 AM, Gourav Sengupta <gourav.sengu...@gmail.com> >> wrote: >> >> Hi, Sanjeev, >> >> I think that I did precisely that, can you please download my ipython >> notebook and have a look, and let me know where I am going wrong. its >> attached with the JIRA ticket. >> >> >> Regards, >> Gourav Sengupta >> >> On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra <sanjeev.mis...@gmail.com> >> wrote: >> >>> There are total 11 files as part of tar. You will have to untar it to >>> get to actual files (.json.gz) >>> >>> No, I am getting >>> >>> Count: 33447 >>> >>> spark.time(spark.read.json(“/data/small-anon/")) >>> Time taken: 431 ms >>> res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... >>> 2 more fields] >>> >>> scala> res73.count() >>> res74: Long = 33447 >>> >>> ls -ltr >>> total 7592 >>> -rw-r--r-- 1 sanjeevmishra staff 132413 Jun 29 08:40 >>> part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz >>> -rw-r--r-- 1 sanjeevmishra staff 272767 Jun 29 08:40 >>> part-00009-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz >>> -rw-r--r-- 1 sanjeevmishra staff 272314 Jun 29 08:40 >>> part-00008-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz >>> -rw-r--r-- 1 sanjeevmishra staff 277158 Jun 29 08:40 >>> part-00007-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz >>> -rw-r--r-- 1 sanjeevmishra staff 321451 Jun 29 08:40 >>> part-00006-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz >>> -rw-r--r-- 1 sanjeevmishra staff 331419 Jun 29 08:40 >>> part-00005-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz >>> -rw-r--r-- 1 sanjeevmishra staff 337195 Jun 29 08:40 >>> part-00004-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz >>> -rw-r--r-- 1 sanjeevmishra staff 366346 Jun 29 08:40 >>> part-00003-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz >>> -rw-r--r-- 1 sanjeevmishra staff 423154 Jun 29 08:40 >>> part-00002-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz >>> -rw-r--r-- 1 sanjeevmishra staff 458187 Jun 29 08:40 >>> part-00000-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz >>> -rw-r--r-- 1 sanjeevmishra staff 673836 Jun 29 08:40 >>> part-00001-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz >>> -rw-r--r-- 1 sanjeevmishra staff 0 Jun 29 08:40 _SUCCESS >>> >>> On Jun 30, 2020, at 5:37 AM, Gourav Sengupta <gourav.sengu...@gmail.com> >>> wrote: >>> >>> Hi Sanjeev, >>> that just gives 11 records from the sample that you have loaded to the >>> JIRA tickets is it correct? >>> >>> >>> Regards, >>> Gourav Sengupta >>> >>> On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra <sanjeev.mis...@gmail.com> >>> wrote: >>> >>>> There is not much code, I am just using spark-shell and reading the >>>> data like so >>>> >>>> spark.time(spark.read.json("/data/small-anon/")) >>>> >>>> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta <gourav.sengu...@gmail.com> >>>> wrote: >>>> >>>> Hi Sanjeev, >>>> >>>> can you share the exact code that you are using to read the JSON files? >>>> Currently I am getting only 11 records from the tar file that you have >>>> attached with JIRA. >>>> >>>> Regards, >>>> Gourav Sengupta >>>> >>>> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev <arte...@dtechspace.com> >>>> wrote: >>>> >>>>> According to the spec, in addition to the line breaks, you should also >>>>> put the nested object values in arrays instead of dictionaries. You may >>>>> want to give a try and see if this would give you a better performance. >>>>> >>>>> Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms >>>>> 3.0. Hope the Databricks engineers will find an answer or bug fix soon. >>>>> >>>>> -- ND >>>>> On 6/29/20 12:27 PM, Sanjeev Mishra wrote: >>>>> >>>>> The tar file that I have attached has bunch of json.zip files and this >>>>> is the file that is being processed. Each line is self contained JSON as >>>>> shown below >>>>> >>>>> zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | >>>>>>> head -3 >>>>>> >>>>>> {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS >>>>>>> HNC IGD","Annex F >>>>>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports >>>>>>> Arris FastPath Speed >>>>>>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG"," >>>>>>> Arris.NVG4xx.Missing.CA >>>>>>> <http://arris.nvg4xx.missing.ca/>","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS >>>>>>> HNC IGD >>>>>>> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service >>>>>>> Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First >>>>>>> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260} >>>>>> >>>>>> {"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS >>>>>>> HNC IGD","Annex F >>>>>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports >>>>>>> Arris FastPath Speed >>>>>>> Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG"," >>>>>>> Arris.NVG4xx.Missing.CA >>>>>>> <http://arris.nvg4xx.missing.ca/>","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS >>>>>>> HNC IGD >>>>>>> EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total >>>>>>> Control","GPON_100M_100M","Self-Service >>>>>>> Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253} >>>>>> >>>>>> {"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS >>>>>>> HNC IGD","Annex F >>>>>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports >>>>>>> Arris FastPath Speed >>>>>>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC"," >>>>>>> Arris.NVG4xx.Missing.CA >>>>>>> <http://arris.nvg4xx.missing.ca/>","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS >>>>>>> HNC IGD >>>>>>> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total >>>>>>> Control","HSI","Self-Service >>>>>>> Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW - NVG4xx >>>>>>> - >>>>>>> First Contact","GPON_200M_200M","TCW >>>>>>> Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415} >>>>>> >>>>>> >>>>> On Mon, Jun 29, 2020 at 9:20 AM ArtemisDev <arte...@dtechspace.com> >>>>> wrote: >>>>> >>>>>> Could you please share your input file instead of output files on >>>>>> that ticket? Not sure if you were following the specific file format >>>>>> requirement for JSON in Spark. The following is a snippet from the Spark >>>>>> online doc: >>>>>> >>>>>> Note that the file that is offered as *a json file* is not a typical >>>>>> JSON file. Each line must contain a separate, self-contained valid JSON >>>>>> object. For more information, please see JSON Lines text format, >>>>>> also called newline-delimited JSON <http://jsonlines.org/>. >>>>>> >>>>>> For a regular multi-line JSON file, set the multiLine option to true. >>>>>> >>>>>> -- ND >>>>>> On 6/29/20 11:55 AM, Sanjeev Mishra wrote: >>>>>> >>>>>> Done. https://issues.apache.org/jira/browse/SPARK-32130 >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <maxim.g...@databricks.com> >>>>>> wrote: >>>>>> >>>>>>> Hello Sanjeev, >>>>>>> >>>>>>> It is hard to troubleshoot the issue without input files. Could you >>>>>>> open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and >>>>>>> attach the JSON files there (or samples or code which generates JSON >>>>>>> files)? >>>>>>> >>>>>>> Maxim Gekk >>>>>>> Software Engineer >>>>>>> Databricks, Inc. >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra < >>>>>>> sanjeev.mis...@gmail.com> wrote: >>>>>>> >>>>>>>> It has read everything. As you notice the timing of count is still >>>>>>>> smaller in Spark 2.4 >>>>>>>> >>>>>>>> Spark 2.4 >>>>>>>> >>>>>>>> scala> spark.time(spark.read.json("/data/20200528")) >>>>>>>> Time taken: 19691 ms >>>>>>>> res61: org.apache.spark.sql.DataFrame = [created: bigint, id: >>>>>>>> string ... 5 more fields] >>>>>>>> >>>>>>>> scala> spark.time(res61.count()) >>>>>>>> Time taken: 7113 ms >>>>>>>> res64: Long = 2605349 >>>>>>>> >>>>>>>> Spark 3.0 >>>>>>>> scala> spark.time(spark.read.json("/data/20200528")) >>>>>>>> 20/06/29 08:06:53 WARN package: Truncated the string representation >>>>>>>> of a plan since it was too large. This behavior can be adjusted by >>>>>>>> setting >>>>>>>> 'spark.sql.debug.maxToStringFields'. >>>>>>>> Time taken: 849652 ms >>>>>>>> res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string >>>>>>>> ... 5 more fields] >>>>>>>> >>>>>>>> scala> spark.time(res0.count()) >>>>>>>> Time taken: 8201 ms >>>>>>>> res2: Long = 2605349 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <arte...@dtechspace.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Could you share your code? Are you sure you Spark 2.4 cluster had >>>>>>>>> indeed read anything? Looks like the Input size field is empty under >>>>>>>>> 2.4. >>>>>>>>> >>>>>>>>> -- ND >>>>>>>>> On 6/27/20 7:58 PM, Sanjeev Mishra wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> I have large amount of json files that Spark can read in 36 >>>>>>>>> seconds but Spark 3.0 takes almost 33 minutes to read the same. On >>>>>>>>> closer >>>>>>>>> analysis, looks like Spark 3.0 is choosing different DAG than Spark >>>>>>>>> 2.0. >>>>>>>>> Does anyone have any idea what is going on? Is there any configuration >>>>>>>>> problem with Spark 3.0. >>>>>>>>> >>>>>>>>> Here are the details: >>>>>>>>> >>>>>>>>> *Spark 2.4* >>>>>>>>> >>>>>>>>> Summary Metrics for 2203 Completed Tasks >>>>>>>>> <http://10.0.0.8:4040/stages/stage/?id=0&attempt=0#tasksTitle> >>>>>>>>> Metric Min 25th percentile Median 75th percentile Max >>>>>>>>> Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms >>>>>>>>> GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms >>>>>>>>> Showing 1 to 2 of 2 entries >>>>>>>>> Aggregated Metrics by Executor >>>>>>>>> Show 20 40 60 100 All entries >>>>>>>>> Search: >>>>>>>>> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed >>>>>>>>> Tasks Succeeded Tasks Blacklisted >>>>>>>>> driver >>>>>>>>> 10.0.0.8:49159 36 s 2203 0 0 2203 false >>>>>>>>> >>>>>>>>> >>>>>>>>> *Spark 3.0* >>>>>>>>> >>>>>>>>> Summary Metrics for 8 Completed Tasks >>>>>>>>> <http://10.0.0.8:4040/stages/stage/?id=1&attempt=0&task.eventTimelinePageNumber=1&task.eventTimelinePageSize=47#tasksTitle> >>>>>>>>> Metric Min 25th percentile Median 75th percentile Max >>>>>>>>> Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min >>>>>>>>> GC Time 3 s 3 s 3 s 4 s 4 s >>>>>>>>> Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / >>>>>>>>> 55259 17.8 MiB / 58148 20.2 MiB / 71624 >>>>>>>>> Showing 1 to 3 of 3 entries >>>>>>>>> Aggregated Metrics by Executor >>>>>>>>> Show 20 40 60 100 All entries >>>>>>>>> Search: >>>>>>>>> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed >>>>>>>>> Tasks Succeeded Tasks Blacklisted Input Size / Records >>>>>>>>> driver >>>>>>>>> 10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999 >>>>>>>>> >>>>>>>>> >>>>>>>>> The DAG is also different >>>>>>>>> Spark 2.0 DAG >>>>>>>>> >>>>>>>>> <Screenshot 2020-06-27 16.30.26.png> >>>>>>>>> >>>>>>>>> Spark 3.0 DAG >>>>>>>>> >>>>>>>>> <Screenshot 2020-06-27 16.32.32.png> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>> >>> >>
load_anon_data.ipynb
Description: Binary data
--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org