I am running Spark over Cassandra to process a single table.
My task reads a single days' worth of data from the table and performs 50 group 
by and distinct operations, counting distinct userIds by different grouping 
keys.
My code looks like this:

   JavaRdd<Row> rdd = sc.parallelize().mapPartitions().cache() // reads the 
data from the table
   for each groupingKey {
      JavaPairRdd<GroupingKey, UserId> groupByRdd = rdd.mapToPair();
      JavaPairRDD<GroupingKey, Long> countRdd = 
groupByRdd.distinct().mapToPair().reduceByKey() // counts distinct values per 
grouping key
   }

The distinct() stage takes about 2 minutes for every groupByValue, and my task 
takes well over an hour to complete.
My cluster has 4 nodes and 30 GB of RAM per Spark process, the table size is 4 
GB.

How can I identify the bottleneck more accurately? Is it caused by shuffling 
data?
How can I improve the performance?

Thanks,
Oded

Reply via email to