Generally (not only spark sql specific) you should not cast in the where part of a sql query. It is also not necessary in your case. Getting rid of casts in the whole query will be also beneficial.
Le lun. 22 juin 2015 à 17:29, James Aley <james.a...@swiftkey.com> a écrit : > Hello, > > A colleague of mine ran the following Spark SQL query: > > select > count(*) as uses, > count (distinct cast(id as string)) as users > from usage_events > where > from_unixtime(cast(timestamp_millis/1000 as bigint)) > between '2015-06-09' and '2015-06-16' > > The table contains billions of rows, but totals only 64GB of data across > ~30 separate files, which are stored as Parquet with LZO compression in S3. > > From the referenced columns: > > * id is Binary, which we cast to a String so that we can DISTINCT by it. > (I was already told this will improve in a later release, in a separate > thread.) > * timestamp_millis is a long, containing a unix timestamp with > millisecond resolution > > This took nearly 2 hours to run on a 5 node cluster of r3.xlarge EC2 > instances, using 20 executors, each with 4GB memory. I can see from > monitoring tools that the CPU usage is at 100% on all nodes, but incoming > network seems a bit low at 2.5MB/s, suggesting to me that this is CPU-bound. > > Does that seem slow? Can anyone offer any ideas by glancing at the query > as to why this might be slow? We'll profile it meanwhile and post back if > we find anything ourselves. > > A side issue - I've found that this query, and others, sometimes completes > but doesn't return any results. There appears to be no error that I can see > in the logs, and Spark reports the job as successful, but the connected > JDBC client (SQLWorkbenchJ in this case), just sits there forever waiting. > I did a quick Google and couldn't find anyone else having similar issues. > > > Many thanks, > > James. >