To answer my own question... I didn't realize that I was responsible for telling Spark how much parallelism I wanted for my job. I figured that between Spark and Yarn they'd figure it out for themselves.
Adding --executor-memory 3G --num-executors 24 to my spark-submit command took the query time down to 30s from 18 minutes and I'm seeing much better utilization of my accumulo tablet servers. -Russ On Tue, Sep 9, 2014 at 5:13 PM, Russ Weeks <rwe...@newbrightidea.com> wrote: > Hi, > > I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat. > Not sure if I should be asking on the Spark list or the Accumulo list, but > I'll try here. The problem is that the workload to process SQL queries > doesn't seem to be distributed across my cluster very well. > > My Spark SQL app is running in yarn-client mode. The query I'm running is > "select count(*) from audit_log" (or a similarly simple query) where my > audit_log table has 14.3M rows, 504M key value pairs spread fairly evenly > across 8 tablet servers. Looking at the Accumulo monitor app, I only ever > see a maximum of 2 tablet servers with active scans. Since the data is > spread across all the tablet servers, I hoped to see 8! > > I realize there are a lot of moving parts here but I'd any advice about > where to start looking. > > Using Spark 1.0.1 with Accumulo 1.6. > > Thanks! > -Russ >