To answer my own question... I didn't realize that I was responsible for
telling Spark how much parallelism I wanted for my job. I figured that
between Spark and Yarn they'd figure it out for themselves.

Adding --executor-memory 3G --num-executors 24 to my spark-submit command
took the query time down to 30s from 18 minutes and I'm seeing much better
utilization of my accumulo tablet servers.

-Russ

On Tue, Sep 9, 2014 at 5:13 PM, Russ Weeks <rwe...@newbrightidea.com> wrote:

> Hi,
>
> I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat.
> Not sure if I should be asking on the Spark list or the Accumulo list, but
> I'll try here. The problem is that the workload to process SQL queries
> doesn't seem to be distributed across my cluster very well.
>
> My Spark SQL app is running in yarn-client mode. The query I'm running is
> "select count(*) from audit_log" (or a similarly simple query) where my
> audit_log table has 14.3M rows, 504M key value pairs spread fairly evenly
> across 8 tablet servers. Looking at the Accumulo monitor app, I only ever
> see a maximum of 2 tablet servers with active scans. Since the data is
> spread across all the tablet servers, I hoped to see 8!
>
> I realize there are a lot of moving parts here but I'd any advice about
> where to start looking.
>
> Using Spark 1.0.1 with Accumulo 1.6.
>
> Thanks!
> -Russ
>

Reply via email to