Hi Peer,
If you open the driver UI (running on port 4040) you can see the stages and
the tasks happening inside it. Best way to identify the bottleneck for a
stage is to see if there's any time spending on GC, and how many tasks are
there per stage (it should be a number total # cores to achieve
I am running Spark over Cassandra to process a single table.
My task reads a single days' worth of data from the table and performs 50 group
by and distinct operations, counting distinct userIds by different grouping
keys.
My code looks like this:
JavaRddRow rdd =
You may want to look at this tooling for helping identify performance
issues and bottlenecks:
https://github.com/kayousterhout/trace-analysis
I believe this is slated to become part of the web ui in the 1.4 release,
in fact based on the status of the JIRA,