[jira] [Commented] (SPARK-7393) How to improve Spark SQL performance?

Dennis Proppe (JIRA) Wed, 06 May 2015 07:42:06 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530636#comment-14530636
 ]


Dennis Proppe commented on SPARK-7393:
--------------------------------------

Hi, Liang Lee,

without more information (HDD <> SSD?), it is quite hard to reproduce this. Do 
you cache the DF before querying it? Otherwise, you'd be pulling the file from 
the parquet files on disk everytime you query it. In that case, 3s for a select 
on a table of 61 million rows sounds impressing.

In our org, we found that working along the lines of:

df = sqlContext.load("file.parquet")
df.cache.count()
selection = df.where("foo = bar")

works really fast and may deliver good response times in your case. 

> How to improve Spark SQL performance?
> -------------------------------------
>
>                 Key: SPARK-7393
>                 URL: https://issues.apache.org/jira/browse/SPARK-7393
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Liang Lee
>
> We want to use Spark SQL in our project ,but we found that the Spark SQL 
> performance is not very well as we expected. The detail is as follows:
>  1. We save data as parquet file on HDFS.
>  2.We just select one or several rows from the parquet file using spark SQL.
>  3. When the total record number is 61 million, it needs about 3 seconds to 
> get the result, which is unacceptable long for our scenario. 
> 4.When the total record number is 2 million, it needs about 93 ms to get the 
> result, whcih is still a little long for us.
>  5. The query statement is like : SELECT * FROM DBA WHERE COLA=? AND COLB=? 
> And the table is not complex, which has less 10 columns and the content for 
> each column is less than 100 bytes.
>  6. Does any one know how to improve the performance or give some other ideas?
>  7. Can Spark SQL support micro-second-level response? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7393) How to improve Spark SQL performance?

Reply via email to