Hi James, Maybe it's the DISTINCT causing the issue.
I rewrote the query as follows. Maybe this one can finish faster. select sum(cnt) as uses, count(id) as users from ( select count(*) cnt, cast(id as string) as id, from usage_events where from_unixtime(cast(timestamp_millis/1000 as bigint)) between '2015-06-09' and '2015-06-16' group by cast(id as string) ) tmp Thanks, Yin On Mon, Jun 22, 2015 at 12:55 PM, Jörn Franke <jornfra...@gmail.com> wrote: > Generally (not only spark sql specific) you should not cast in the where > part of a sql query. It is also not necessary in your case. Getting rid of > casts in the whole query will be also beneficial. > > Le lun. 22 juin 2015 à 17:29, James Aley <james.a...@swiftkey.com> a > écrit : > >> Hello, >> >> A colleague of mine ran the following Spark SQL query: >> >> select >> count(*) as uses, >> count (distinct cast(id as string)) as users >> from usage_events >> where >> from_unixtime(cast(timestamp_millis/1000 as bigint)) >> between '2015-06-09' and '2015-06-16' >> >> The table contains billions of rows, but totals only 64GB of data across >> ~30 separate files, which are stored as Parquet with LZO compression in S3. >> >> From the referenced columns: >> >> * id is Binary, which we cast to a String so that we can DISTINCT by it. >> (I was already told this will improve in a later release, in a separate >> thread.) >> * timestamp_millis is a long, containing a unix timestamp with >> millisecond resolution >> >> This took nearly 2 hours to run on a 5 node cluster of r3.xlarge EC2 >> instances, using 20 executors, each with 4GB memory. I can see from >> monitoring tools that the CPU usage is at 100% on all nodes, but incoming >> network seems a bit low at 2.5MB/s, suggesting to me that this is CPU-bound. >> >> Does that seem slow? Can anyone offer any ideas by glancing at the query >> as to why this might be slow? We'll profile it meanwhile and post back if >> we find anything ourselves. >> >> A side issue - I've found that this query, and others, sometimes >> completes but doesn't return any results. There appears to be no error that >> I can see in the logs, and Spark reports the job as successful, but the >> connected JDBC client (SQLWorkbenchJ in this case), just sits there forever >> waiting. I did a quick Google and couldn't find anyone else having similar >> issues. >> >> >> Many thanks, >> >> James. >> >