Hi James,

There are a few configurations that you can try:
https://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options

>From my experience, the codegen really boost things up. Just run
sqlContext.sql("spark.sql.codegen=true") before you execute your query. But
keep in mind that sometimes this is buggy (depending on your query), so
compare to results without codegen to be sure.
Also you can try changing the default partitions.

You can also use dataframes (since 1.3). Not sure they are better than
specifying the query in 1.3, but with spark 1.4 there should be an enormous
performance improvement in dataframes.

Lior

On Mon, Jun 22, 2015 at 6:28 PM, James Aley <james.a...@swiftkey.com> wrote:

> Hello,
>
> A colleague of mine ran the following Spark SQL query:
>
> select
>   count(*) as uses,
>   count (distinct cast(id as string)) as users
> from usage_events
> where
>   from_unixtime(cast(timestamp_millis/1000 as bigint))
> between '2015-06-09' and '2015-06-16'
>
> The table contains billions of rows, but totals only 64GB of data across
> ~30 separate files, which are stored as Parquet with LZO compression in S3.
>
> From the referenced columns:
>
> * id is Binary, which we cast to a String so that we can DISTINCT by it.
> (I was already told this will improve in a later release, in a separate
> thread.)
> * timestamp_millis is a long, containing a unix timestamp with
> millisecond resolution
>
> This took nearly 2 hours to run on a 5 node cluster of r3.xlarge EC2
> instances, using 20 executors, each with 4GB memory. I can see from
> monitoring tools that the CPU usage is at 100% on all nodes, but incoming
> network seems a bit low at 2.5MB/s, suggesting to me that this is CPU-bound.
>
> Does that seem slow? Can anyone offer any ideas by glancing at the query
> as to why this might be slow? We'll profile it meanwhile and post back if
> we find anything ourselves.
>
> A side issue - I've found that this query, and others, sometimes completes
> but doesn't return any results. There appears to be no error that I can see
> in the logs, and Spark reports the job as successful, but the connected
> JDBC client (SQLWorkbenchJ in this case), just sits there forever waiting.
> I did a quick Google and couldn't find anyone else having similar issues.
>
>
> Many thanks,
>
> James.
>

Reply via email to