[ 
https://issues.apache.org/jira/browse/SPARK-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Nguyen updated SPARK-13804:
-----------------------------------
    Environment: 
- 3 nodes Spark cluster: 1 master node and 2 slave nodes
- Each node is an EC2 with c3.4xlarge
- Each node has 16 cores and 30GB of RAM
    Description: 
Spark SQL is used to load cvs files via com.databricks.spark.csv and then run 
dataFrame.count() 

In the same environment with plenty of CPU and RAM, Spark SQL takes 
- 18.25 seconds to load  a table with 4 millions vs
- 346.624 seconds (5.77 minutes) to load a table with 16 million rows.

Even though the number of rows increases by 4 times, the time it takes Spark 
SQL to run dataframe.count () increases by 19.22 times. So the performance of 
dataframe.count () diverges so drastically.

1. Why does Spark SQL's performance not proportional to the number of rows 
while there is plenty of CPU and RAM (it uses only 10GB out of 30GB RAM) ?

2. What can be done to fix  this performance issue ?

  was:
1. HiveThriftServer2 was started with startWithContext

2. Multiple temp tables were loaded and registered via registerTempTable .

3. HiveThriftServer2 was accessed via JDBC to access to those tables.

4. Some temp tables were dropped via 
hiveContext.dropTempTable(registerTableName); and reloaded to refresh their 
data. There are 1 to 7 million rows in these tables.

5. The same queries ran in step 3 were re-ran over the existing JDBC 
connection. This time HiveThriftServer2 receives those queries but at times 
HiveThriftServer2  hangs and does not return the results.  CPU utilization on 
both Spark driver and child nodes was around 1%. 10GB of RAM was used out of 
30GB on the driver, and 3GB of RAM out of 30GB was used on the child nodes. So 
there was no resource starvation.

6. Wait about 5 minutes and rerun the same queries in step 5, and this time, 
HiveThriftServer2  returns the results of those queries fine.

This issue occurs intermittently when the steps 1-5 are repeated, so it may 
take several attempts to reproduce this issue.


        Summary: Spark SQL's DataFrame.count()  Major Divergent (Non-Linear) 
Performance Slowdown going from 4million rows to 16+ million rows  (was: 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 hangs intermittently)

> Spark SQL's DataFrame.count()  Major Divergent (Non-Linear) Performance 
> Slowdown going from 4million rows to 16+ million rows
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-13804
>                 URL: https://issues.apache.org/jira/browse/SPARK-13804
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>         Environment: - 3 nodes Spark cluster: 1 master node and 2 slave nodes
> - Each node is an EC2 with c3.4xlarge
> - Each node has 16 cores and 30GB of RAM
>            Reporter: Michael Nguyen
>
> Spark SQL is used to load cvs files via com.databricks.spark.csv and then run 
> dataFrame.count() 
> In the same environment with plenty of CPU and RAM, Spark SQL takes 
> - 18.25 seconds to load  a table with 4 millions vs
> - 346.624 seconds (5.77 minutes) to load a table with 16 million rows.
> Even though the number of rows increases by 4 times, the time it takes Spark 
> SQL to run dataframe.count () increases by 19.22 times. So the performance of 
> dataframe.count () diverges so drastically.
> 1. Why does Spark SQL's performance not proportional to the number of rows 
> while there is plenty of CPU and RAM (it uses only 10GB out of 30GB RAM) ?
> 2. What can be done to fix  this performance issue ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to