[jira] [Commented] (SPARK-14037) count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame

Sun Rui (JIRA) Mon, 21 Mar 2016 23:54:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205906#comment-15205906
 ]


Sun Rui commented on SPARK-14037:
---------------------------------

Currently, createDataFrame does not allow specifying partitions., otherwise you 
can try specifying the number of partitions to increase parallelism.

Worker metrics of SparkR shell can be get in Spark UI.
Can you find "executor summary"?

> count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-14037
>                 URL: https://issues.apache.org/jira/browse/SPARK-14037
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 1.6.1
>         Environment: Ubuntu 12.04
> RAM : 6 GB
> Spark 1.6.1 Standalone
>            Reporter: Samuel Alexander
>              Labels: performance, sparkR
>
> Any operations on dataframe created using SparkR::createDataFrame is very 
> slow.
> I have a CSV of size ~ 6MB. Below is the sample content
> 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter
> 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter
> 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter
> 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter
> 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter
> 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter
> 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter
> 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter
> 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter
> 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter
> I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, 
> sep=","). And then converted into Spark dataframe using sp_df <- 
> createDataFrame(sqlContext, r_df)
> Now count(sp_df) took more than 30 seconds
> When I load the same CSV using spark-csv like, direct_df <- 
> read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = 
> "com.databricks.spark.csv", inferSchema = "false", header="true")
> count(direct_df) took below 1 sec.
> I know performance has been improved in createDataFrame in Spark 1.6. But 
> other operations like count(), is very slow.
> How can I get rid of this performance issue? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14037) count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame

Reply via email to