Have you test this on a smaller set to verify that the query is correct?
On Mon, Jun 22, 2015 at 2:59 PM, ayan guha guha.a...@gmail.com wrote:
You may also want to change count(*) to specific column.
On 23 Jun 2015 01:29, James Aley james.a...@swiftkey.com wrote:
Hello,
A colleague of mine ran the following Spark SQL query:
select
count(*) as uses,
count (distinct cast(id as string)) as users
from usage_events
where
from_unixtime(cast(timestamp_millis/1000 as bigint))
between '2015-06-09' and '2015-06-16'
The table contains billions of rows, but totals only 64GB of data across
~30 separate files, which are stored as Parquet with LZO compression in S3.
From the referenced columns:
* id is Binary, which we cast to a String so that we can DISTINCT by it.
(I was already told this will improve in a later release, in a separate
thread.)
* timestamp_millis is a long, containing a unix timestamp with
millisecond resolution
This took nearly 2 hours to run on a 5 node cluster of r3.xlarge EC2
instances, using 20 executors, each with 4GB memory. I can see from
monitoring tools that the CPU usage is at 100% on all nodes, but incoming
network seems a bit low at 2.5MB/s, suggesting to me that this is CPU-bound.
Does that seem slow? Can anyone offer any ideas by glancing at the query
as to why this might be slow? We'll profile it meanwhile and post back if
we find anything ourselves.
A side issue - I've found that this query, and others, sometimes
completes but doesn't return any results. There appears to be no error that
I can see in the logs, and Spark reports the job as successful, but the
connected JDBC client (SQLWorkbenchJ in this case), just sits there forever
waiting. I did a quick Google and couldn't find anyone else having similar
issues.
Many thanks,
James.