[ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanjeev Mishra updated SPARK-32130:
-----------------------------------
    Attachment: small-anon.tar

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --------------------------------------------------------------------------
>
>                 Key: SPARK-32130
>                 URL: https://issues.apache.org/jira/browse/SPARK-32130
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 3.0.0
>         Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>  ____ __
>  / __/__ ___ _____/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>            Reporter: Sanjeev Mishra
>            Priority: Critical
>         Attachments: small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
> Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
> Time taken: {color:#ff0000}19691 ms{color}
> res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
> scala> spark.time(res61.count())
> Time taken: {color:#0000ff}7113 ms{color}
> res64: Long = 2605349
> Spark 3.0
> scala> spark.time(spark.read.json("/data/20200528"))
> 20/06/29 08:06:53 WARN package: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> Time taken: {color:#ff0000}849652 ms{color}
> res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
> scala> spark.time(res0.count())
> Time taken: {color:#0000ff}8201 ms{color}
> res2: Long = 2605349
>  
>  
> I am attaching a sample data (please delete is once you are able to reproduce 
> the issue) that is much smaller than the actual size but the performance 
> comparison can still be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to