[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sanjeev Mishra updated SPARK-32130: ----------------------------------- Attachment: small-anon.tar > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -------------------------------------------------------------------------- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. > Reporter: Sanjeev Mishra > Priority: Critical > Attachments: small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff0000}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#0000ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff0000}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#0000ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to reproduce > the issue) that is much smaller than the actual size but the performance > comparison can still be verified. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org