You may want to look at this tooling for helping identify performance
issues and bottlenecks:

https://github.com/kayousterhout/trace-analysis

I believe this is slated to become part of the web ui in the 1.4 release,
in fact based on the status of the JIRA,
https://issues.apache.org/jira/browse/SPARK-6418, looks like it is complete.


On Tue, May 19, 2015 at 3:56 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Hi Peer,
>
> If you open the driver UI (running on port 4040) you can see the stages
> and the tasks happening inside it. Best way to identify the bottleneck for
> a stage is to see if there's any time spending on GC, and how many tasks
> are there per stage (it should be a number > total # cores to achieve max
> parallelism). Also you can see for each task how long does it take etc into
> consideration.
>
> Thanks
> Best Regards
>
> On Tue, May 19, 2015 at 12:58 PM, Peer, Oded <oded.p...@rsa.com> wrote:
>
>>  I am running Spark over Cassandra to process a single table.
>>
>> My task reads a single days’ worth of data from the table and performs 50
>> group by and distinct operations, counting distinct userIds by different
>> grouping keys.
>>
>> My code looks like this:
>>
>>
>>
>>    JavaRdd<Row> rdd = sc.parallelize().mapPartitions().cache() // reads
>> the data from the table
>>
>>    for each groupingKey {
>>
>>       JavaPairRdd<GroupingKey, UserId> groupByRdd = rdd.mapToPair();
>>
>>       JavaPairRDD<GroupingKey, Long> countRdd =
>> groupByRdd.distinct().mapToPair().reduceByKey() // counts distinct values
>> per grouping key
>>
>>    }
>>
>>
>>
>> The distinct() stage takes about 2 minutes for every groupByValue, and my
>> task takes well over an hour to complete.
>>
>> My cluster has 4 nodes and 30 GB of RAM per Spark process, the table size
>> is 4 GB.
>>
>>
>>
>> How can I identify the bottleneck more accurately? Is it caused by
>> shuffling data?
>>
>> How can I improve the performance?
>>
>>
>>
>> Thanks,
>>
>> Oded
>>
>
>

Reply via email to