[ https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248416#comment-15248416 ]
Shivaram Venkataraman commented on SPARK-14037: ----------------------------------------------- Thanks [~samalexg] and [~sunrui] for investigating this issue. I think one thing that we could add to our profiling is the size of the data being written out of the RRDD. My guess is that the overhead is due to (a) serialization of strings is slow in SparkR (b) the serialization of strings in SparkR increases the size of the data being written out and as /tmp might not be mounted in memory here, we are running into disk overheads here. > count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame > ------------------------------------------------------------------------------ > > Key: SPARK-14037 > URL: https://issues.apache.org/jira/browse/SPARK-14037 > Project: Spark > Issue Type: Bug > Components: SparkR > Affects Versions: 1.6.1 > Environment: Ubuntu 12.04 > RAM : 6 GB > Spark 1.6.1 Standalone > Reporter: Samuel Alexander > Labels: performance, sparkR > Attachments: console.log, spark_ui.png, spark_ui_ray.png > > > Any operations on dataframe created using SparkR::createDataFrame is very > slow. > I have a CSV of size ~ 6MB. Below is the sample content > 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter > 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter > 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter > 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter > 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter > 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter > 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter > 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter > 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter > 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter > I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, > sep=","). And then converted into Spark dataframe using sp_df <- > createDataFrame(sqlContext, r_df) > Now count(sp_df) took more than 30 seconds > When I load the same CSV using spark-csv like, direct_df <- > read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = > "com.databricks.spark.csv", inferSchema = "false", header="true") > count(direct_df) took below 1 sec. > I know performance has been improved in createDataFrame in Spark 1.6. But > other operations like count(), is very slow. > How can I get rid of this performance issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org