I am running Spark over Cassandra to process a single table. My task reads a single days' worth of data from the table and performs 50 group by and distinct operations, counting distinct userIds by different grouping keys. My code looks like this:
JavaRdd<Row> rdd = sc.parallelize().mapPartitions().cache() // reads the data from the table for each groupingKey { JavaPairRdd<GroupingKey, UserId> groupByRdd = rdd.mapToPair(); JavaPairRDD<GroupingKey, Long> countRdd = groupByRdd.distinct().mapToPair().reduceByKey() // counts distinct values per grouping key } The distinct() stage takes about 2 minutes for every groupByValue, and my task takes well over an hour to complete. My cluster has 4 nodes and 30 GB of RAM per Spark process, the table size is 4 GB. How can I identify the bottleneck more accurately? Is it caused by shuffling data? How can I improve the performance? Thanks, Oded