You may want to look at this tooling for helping identify performance issues and bottlenecks:
https://github.com/kayousterhout/trace-analysis I believe this is slated to become part of the web ui in the 1.4 release, in fact based on the status of the JIRA, https://issues.apache.org/jira/browse/SPARK-6418, looks like it is complete. On Tue, May 19, 2015 at 3:56 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > Hi Peer, > > If you open the driver UI (running on port 4040) you can see the stages > and the tasks happening inside it. Best way to identify the bottleneck for > a stage is to see if there's any time spending on GC, and how many tasks > are there per stage (it should be a number > total # cores to achieve max > parallelism). Also you can see for each task how long does it take etc into > consideration. > > Thanks > Best Regards > > On Tue, May 19, 2015 at 12:58 PM, Peer, Oded <oded.p...@rsa.com> wrote: > >> I am running Spark over Cassandra to process a single table. >> >> My task reads a single days’ worth of data from the table and performs 50 >> group by and distinct operations, counting distinct userIds by different >> grouping keys. >> >> My code looks like this: >> >> >> >> JavaRdd<Row> rdd = sc.parallelize().mapPartitions().cache() // reads >> the data from the table >> >> for each groupingKey { >> >> JavaPairRdd<GroupingKey, UserId> groupByRdd = rdd.mapToPair(); >> >> JavaPairRDD<GroupingKey, Long> countRdd = >> groupByRdd.distinct().mapToPair().reduceByKey() // counts distinct values >> per grouping key >> >> } >> >> >> >> The distinct() stage takes about 2 minutes for every groupByValue, and my >> task takes well over an hour to complete. >> >> My cluster has 4 nodes and 30 GB of RAM per Spark process, the table size >> is 4 GB. >> >> >> >> How can I identify the bottleneck more accurately? Is it caused by >> shuffling data? >> >> How can I improve the performance? >> >> >> >> Thanks, >> >> Oded >> > >