[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149235#comment-17149235 ]
Maxim Gekk commented on SPARK-32130: ------------------------------------ I would like to propose: # Add the SQL config spark.sql.legacy.json.inferTimestamps.enabled with false by default. The will control timestamp inference in JSON globally. # Keep inferTimestamps as an JSON option (because it has been already released) and set it to false by default. # If the JSON option inferTimestamps is set to any value, it overrides the SQL config. If you are ok with the changes, I will open an PR for master and branch-3.0. cc [~Samwel] [~hyukjin.kwon] > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -------------------------------------------------------------------------- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. > Reporter: Sanjeev Mishra > Priority: Critical > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff0000}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#0000ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff0000}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#0000ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > To reproduce the issue please untar the attachment - it will have multiple > .json.gz files whose contents will look similar to following > > {quote}{color:#0000ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports > Arris FastPath Speed > Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD > EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service > Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First > Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[\\{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":\\{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":\\{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":\\{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":\\{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":\\{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":\\{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":\\{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}{color} > {quote} > {quote}{color:#741b47}{"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":\{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports > Arris FastPath Speed > Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD > EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total > Control","GPON_100M_100M","Self-Service > Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[\\{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":\\{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":\\{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":\\{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":\\{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":\\{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":\\{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":\\{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253}{color} > {quote} > {quote}{color:#0000ff}{"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":\{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports > Arris FastPath Speed > Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD > EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total > Control","HSI","Self-Service > Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW - NVG4xx - > First Contact","GPON_200M_200M","TCW > Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[\\{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":\\{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":\\{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":\\{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":\\{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":\\{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":\\{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":\\{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415}{color} > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org