Have you test this on a smaller set to verify that the query is correct?

On Mon, Jun 22, 2015 at 2:59 PM, ayan guha <guha.a...@gmail.com> wrote:

> You may also want to change count(*) to specific column.
> On 23 Jun 2015 01:29, "James Aley" <james.a...@swiftkey.com> wrote:
>
>> Hello,
>>
>> A colleague of mine ran the following Spark SQL query:
>>
>> select
>>   count(*) as uses,
>>   count (distinct cast(id as string)) as users
>> from usage_events
>> where
>>   from_unixtime(cast(timestamp_millis/1000 as bigint))
>> between '2015-06-09' and '2015-06-16'
>>
>> The table contains billions of rows, but totals only 64GB of data across
>> ~30 separate files, which are stored as Parquet with LZO compression in S3.
>>
>> From the referenced columns:
>>
>> * id is Binary, which we cast to a String so that we can DISTINCT by it.
>> (I was already told this will improve in a later release, in a separate
>> thread.)
>> * timestamp_millis is a long, containing a unix timestamp with
>> millisecond resolution
>>
>> This took nearly 2 hours to run on a 5 node cluster of r3.xlarge EC2
>> instances, using 20 executors, each with 4GB memory. I can see from
>> monitoring tools that the CPU usage is at 100% on all nodes, but incoming
>> network seems a bit low at 2.5MB/s, suggesting to me that this is CPU-bound.
>>
>> Does that seem slow? Can anyone offer any ideas by glancing at the query
>> as to why this might be slow? We'll profile it meanwhile and post back if
>> we find anything ourselves.
>>
>> A side issue - I've found that this query, and others, sometimes
>> completes but doesn't return any results. There appears to be no error that
>> I can see in the logs, and Spark reports the job as successful, but the
>> connected JDBC client (SQLWorkbenchJ in this case), just sits there forever
>> waiting. I did a quick Google and couldn't find anyone else having similar
>> issues.
>>
>>
>> Many thanks,
>>
>> James.
>>
>

Reply via email to