There are total 11 files as part of tar. You will have to untar it to get to 
actual files (.json.gz)

No, I am getting

Count: 33447

spark.time(spark.read.json(“/data/small-anon/"))
Time taken: 431 ms
res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 2 more 
fields]

scala> res73.count()
res74: Long = 33447

ls -ltr
total 7592
-rw-r--r--  1 sanjeevmishra  staff  132413 Jun 29 08:40 
part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  272767 Jun 29 08:40 
part-00009-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  272314 Jun 29 08:40 
part-00008-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  277158 Jun 29 08:40 
part-00007-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  321451 Jun 29 08:40 
part-00006-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  331419 Jun 29 08:40 
part-00005-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  337195 Jun 29 08:40 
part-00004-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  366346 Jun 29 08:40 
part-00003-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  423154 Jun 29 08:40 
part-00002-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  458187 Jun 29 08:40 
part-00000-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  673836 Jun 29 08:40 
part-00001-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff       0 Jun 29 08:40 _SUCCESS

> On Jun 30, 2020, at 5:37 AM, Gourav Sengupta <gourav.sengu...@gmail.com> 
> wrote:
> 
> Hi Sanjeev,
> that just gives 11 records from the sample that you have loaded to the JIRA 
> tickets is it correct?
> 
> 
> Regards,
> Gourav Sengupta 
> 
> On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra <sanjeev.mis...@gmail.com 
> <mailto:sanjeev.mis...@gmail.com>> wrote:
> There is not much code, I am just using spark-shell and reading the data like 
> so
> 
> spark.time(spark.read.json("/data/small-anon/"))
> 
>> On Jun 30, 2020, at 3:53 AM, Gourav Sengupta <gourav.sengu...@gmail.com 
>> <mailto:gourav.sengu...@gmail.com>> wrote:
>> 
>> Hi Sanjeev,
>> 
>> can you share the exact code that you are using to read the JSON files? 
>> Currently I am getting only 11 records from the tar file that you have 
>> attached with JIRA.
>> 
>> Regards,
>> Gourav Sengupta
>> 
>> On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev <arte...@dtechspace.com 
>> <mailto:arte...@dtechspace.com>> wrote:
>> According to the spec, in addition to the line breaks, you should also put 
>> the nested object values in arrays instead of dictionaries.  You may want to 
>> give a try and see if this would give you a better performance.
>> 
>> Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0.  
>> Hope the Databricks engineers will find an answer or bug fix soon.
>> 
>> -- ND
>> 
>> On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
>>> The tar file that I have attached has bunch of json.zip files and this is 
>>> the file that is being processed. Each line is self contained JSON as shown 
>>> below
>>> 
>>> zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head 
>>> -3
>>> {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>>>  HNC IGD","Annex F 
>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>>>  Arris FastPath Speed 
>>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA
>>>  
>>> <http://arris.nvg4xx.missing.ca/>","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>>>  HNC IGD 
>>> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>>>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
>>> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}
>>> {"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS
>>>  HNC IGD","Annex F 
>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports
>>>  Arris FastPath Speed 
>>> Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA
>>>  
>>> <http://arris.nvg4xx.missing.ca/>","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>>>  HNC IGD 
>>> EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total
>>>  Control","GPON_100M_100M","Self-Service 
>>> Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253}
>>> {"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS
>>>  HNC IGD","Annex F 
>>> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports
>>>  Arris FastPath Speed 
>>> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Arris.NVG4xx.Missing.CA
>>>  
>>> <http://arris.nvg4xx.missing.ca/>","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>>>  HNC IGD 
>>> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total
>>>  Control","HSI","Self-Service 
>>> Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW - NVG4xx - 
>>> First Contact","GPON_200M_200M","TCW 
>>> Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415}
>>> 
>>> On Mon, Jun 29, 2020 at 9:20 AM ArtemisDev <arte...@dtechspace.com 
>>> <mailto:arte...@dtechspace.com>> wrote:
>>> Could you please share your input file instead of output files on that 
>>> ticket?   Not sure if you were following the specific file format 
>>> requirement for JSON in Spark.  The following is a snippet from the Spark 
>>> online doc:
>>> 
>>> Note that the file that is offered as a json file is not a typical JSON 
>>> file. Each line must contain a separate, self-contained valid JSON object. 
>>> For more information, please see JSON Lines text format, also called 
>>> newline-delimited JSON <http://jsonlines.org/>.
>>> 
>>> For a regular multi-line JSON file, set the multiLine option to true.
>>> 
>>> -- ND
>>> 
>>> On 6/29/20 11:55 AM, Sanjeev Mishra wrote:
>>>> Done. https://issues.apache.org/jira/browse/SPARK-32130 
>>>> <https://issues.apache.org/jira/browse/SPARK-32130>
>>>> 
>>>> 
>>>> 
>>>> On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <maxim.g...@databricks.com 
>>>> <mailto:maxim.g...@databricks.com>> wrote:
>>>> Hello Sanjeev,
>>>> 
>>>> It is hard to troubleshoot the issue without input files. Could you open 
>>>> an JIRA ticket at https://issues.apache.org/jira/projects/SPARK 
>>>> <https://issues.apache.org/jira/projects/SPARK> and attach the JSON files 
>>>> there (or samples or code which generates JSON files)? 
>>>> 
>>>> Maxim Gekk
>>>> Software Engineer
>>>> Databricks, Inc.
>>>> 
>>>> 
>>>> On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <sanjeev.mis...@gmail.com 
>>>> <mailto:sanjeev.mis...@gmail.com>> wrote:
>>>> It has read everything. As you notice the timing of count is still smaller 
>>>> in Spark 2.4
>>>> 
>>>> Spark 2.4
>>>> 
>>>> scala> spark.time(spark.read.json("/data/20200528"))
>>>> Time taken: 19691 ms
>>>> res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
>>>> more fields]
>>>> 
>>>> scala> spark.time(res61.count())
>>>> Time taken: 7113 ms
>>>> res64: Long = 2605349
>>>> 
>>>> Spark 3.0
>>>> scala> spark.time(spark.read.json("/data/20200528"))
>>>> 20/06/29 08:06:53 WARN package: Truncated the string representation of a 
>>>> plan since it was too large. This behavior can be adjusted by setting 
>>>> 'spark.sql.debug.maxToStringFields'.
>>>> Time taken: 849652 ms
>>>> res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
>>>> more fields]
>>>> 
>>>> scala> spark.time(res0.count())
>>>> Time taken: 8201 ms
>>>> res2: Long = 2605349
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <arte...@dtechspace.com 
>>>> <mailto:arte...@dtechspace.com>> wrote:
>>>> Could you share your code?  Are you sure you Spark 2.4 cluster had indeed 
>>>> read anything?  Looks like the Input size field is empty under 2.4.
>>>> 
>>>> -- ND
>>>> 
>>>> On 6/27/20 7:58 PM, Sanjeev Mishra wrote:
>>>>> 
>>>>> I have large amount of json files that Spark can read in 36 seconds but 
>>>>> Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, 
>>>>> looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does 
>>>>> anyone have any idea what is going on? Is there any configuration problem 
>>>>> with Spark 3.0.
>>>>> 
>>>>> Here are the details:
>>>>> 
>>>>> Spark 2.4
>>>>> 
>>>>> Summary Metrics for 2203 Completed Tasks 
>>>>> <http://10.0.0.8:4040/stages/stage/?id=0&attempt=0#tasksTitle>
>>>>> Metric    Min     25th percentile Median  75th percentile Max
>>>>> Duration  0.0 ms  0.0 ms  0.0 ms  1.0 ms  62.0 ms
>>>>> GC Time   0.0 ms  0.0 ms  0.0 ms  0.0 ms  11.0 ms
>>>>> Showing 1 to 2 of 2 entries
>>>>>   Aggregated Metrics by Executor   <>
>>>>> Show     entries
>>>>> Search:
>>>>> Executor ID       Logs    Address Task Time       Total Tasks     Failed 
>>>>> Tasks    Killed Tasks    Succeeded Tasks Blacklisted
>>>>> driver    
>>>>> 10.0.0.8:49159 <http://10.0.0.8:49159/>   36 s    2203    0       0       
>>>>> 2203    false
>>>>> 
>>>>> 
>>>>> Spark 3.0
>>>>> 
>>>>> Summary Metrics for 8 Completed Tasks 
>>>>> <http://10.0.0.8:4040/stages/stage/?id=1&attempt=0&task.eventTimelinePageNumber=1&task.eventTimelinePageSize=47#tasksTitle>
>>>>> Metric    Min     25th percentile Median  75th percentile Max
>>>>> Duration  3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
>>>>> GC Time   3 s     3 s     3 s     4 s     4 s
>>>>> Input Size / Records      15.6 MiB / 51028        16.2 MiB / 53303        
>>>>> 16.8 MiB / 55259        17.8 MiB / 58148        20.2 MiB / 71624
>>>>> Showing 1 to 3 of 3 entries
>>>>>   Aggregated Metrics by Executor   <>
>>>>> Show     entries
>>>>> Search:
>>>>> Executor ID       Logs    Address Task Time       Total Tasks     Failed 
>>>>> Tasks    Killed Tasks    Succeeded Tasks Blacklisted     Input Size / 
>>>>> Records
>>>>> driver    
>>>>> 10.0.0.8:50224 <http://10.0.0.8:50224/>   33 min  8       0       0       
>>>>> 8       false   136.1 MiB / 451999
>>>>> 
>>>>> 
>>>>> The DAG is also different
>>>>> Spark 2.0 DAG
>>>>> 
>>>>> <Screenshot 2020-06-27 16.30.26.png>
>>>>> 
>>>>> Spark 3.0 DAG
>>>>> 
>>>>> <Screenshot 2020-06-27 16.32.32.png>
>>>>> 
>>>>> 
> 

Reply via email to