Re: Help optimising Spark SQL query

Jörn Franke Mon, 22 Jun 2015 12:56:49 -0700

Generally (not only spark sql specific) you should not cast in the where
part of a sql query. It is also not necessary in your case. Getting rid of
casts in the whole query will be also beneficial.


Le lun. 22 juin 2015 à 17:29, James Aley <james.a...@swiftkey.com> a écrit :

> Hello,
>
> A colleague of mine ran the following Spark SQL query:
>
> select
>   count(*) as uses,
>   count (distinct cast(id as string)) as users
> from usage_events
> where
>   from_unixtime(cast(timestamp_millis/1000 as bigint))
> between '2015-06-09' and '2015-06-16'
>
> The table contains billions of rows, but totals only 64GB of data across
> ~30 separate files, which are stored as Parquet with LZO compression in S3.
>
> From the referenced columns:
>
> * id is Binary, which we cast to a String so that we can DISTINCT by it.
> (I was already told this will improve in a later release, in a separate
> thread.)
> * timestamp_millis is a long, containing a unix timestamp with
> millisecond resolution
>
> This took nearly 2 hours to run on a 5 node cluster of r3.xlarge EC2
> instances, using 20 executors, each with 4GB memory. I can see from
> monitoring tools that the CPU usage is at 100% on all nodes, but incoming
> network seems a bit low at 2.5MB/s, suggesting to me that this is CPU-bound.
>
> Does that seem slow? Can anyone offer any ideas by glancing at the query
> as to why this might be slow? We'll profile it meanwhile and post back if
> we find anything ourselves.
>
> A side issue - I've found that this query, and others, sometimes completes
> but doesn't return any results. There appears to be no error that I can see
> in the logs, and Spark reports the job as successful, but the connected
> JDBC client (SQLWorkbenchJ in this case), just sits there forever waiting.
> I did a quick Google and couldn't find anyone else having similar issues.
>
>
> Many thanks,
>
> James.
>

Re: Help optimising Spark SQL query

Reply via email to