Re: Help optimising Spark SQL query

2015-06-30 Thread James Aley
Thanks everybody for the advice on this. I attached YourKit and found that the CPU time split was about 70% in Parquet/LZO reading and 30% applying the filter predicate. I guess those are reasonable things for it to be spending time on, and so it really could just be a case of needing more

Re: Help optimising Spark SQL query

2015-06-23 Thread Sabarish Sasidharan
64GB in parquet could be many billions of rows because of the columnar compression. And count distinct by itself is an expensive operation. This is not just on Spark, even on Presto/Impala, you would see performance dip with count distincts. And the cluster is not that powerful either. The one

Re: Help optimising Spark SQL query

2015-06-23 Thread James Aley
Thanks for the suggestions everyone, appreciate the advice. I tried replacing DISTINCT for the nested GROUP BY, running on 1.4 instead of 1.3, replacing the date casts with a between operation on the corresponding long constants instead and changing COUNT(*) to COUNT(1). None of these seem to

Help optimising Spark SQL query

2015-06-22 Thread James Aley
Hello, A colleague of mine ran the following Spark SQL query: select count(*) as uses, count (distinct cast(id as string)) as users from usage_events where from_unixtime(cast(timestamp_millis/1000 as bigint)) between '2015-06-09' and '2015-06-16' The table contains billions of rows, but

RE: Help optimising Spark SQL query

2015-06-22 Thread Matthew Johnson
this issue, so might be worth upgrading if you are not already on 1.4. Cheers, Matthew *From:* Lior Chaga [mailto:lio...@taboola.com] *Sent:* 22 June 2015 17:24 *To:* James Aley *Cc:* user *Subject:* Re: Help optimising Spark SQL query Hi James, There are a few configurations that you

Re: Help optimising Spark SQL query

2015-06-22 Thread Lior Chaga
Hi James, There are a few configurations that you can try: https://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options From my experience, the codegen really boost things up. Just run sqlContext.sql(spark.sql.codegen=true) before you execute your query. But keep

Re: Help optimising Spark SQL query

2015-06-22 Thread James Aley
are not already on 1.4. Cheers, Matthew *From:* Lior Chaga [mailto:lio...@taboola.com] *Sent:* 22 June 2015 17:24 *To:* James Aley *Cc:* user *Subject:* Re: Help optimising Spark SQL query Hi James, There are a few configurations that you can try: https://spark.apache.org/docs/latest

Re: Help optimising Spark SQL query

2015-06-22 Thread Ntale Lukama
Have you test this on a smaller set to verify that the query is correct? On Mon, Jun 22, 2015 at 2:59 PM, ayan guha guha.a...@gmail.com wrote: You may also want to change count(*) to specific column. On 23 Jun 2015 01:29, James Aley james.a...@swiftkey.com wrote: Hello, A colleague of mine

Re: Help optimising Spark SQL query

2015-06-22 Thread Yin Huai
Hi James, Maybe it's the DISTINCT causing the issue. I rewrote the query as follows. Maybe this one can finish faster. select sum(cnt) as uses, count(id) as users from ( select count(*) cnt, cast(id as string) as id, from usage_events where

Re: Help optimising Spark SQL query

2015-06-22 Thread ayan guha
You may also want to change count(*) to specific column. On 23 Jun 2015 01:29, James Aley james.a...@swiftkey.com wrote: Hello, A colleague of mine ran the following Spark SQL query: select count(*) as uses, count (distinct cast(id as string)) as users from usage_events where

Re: Help optimising Spark SQL query

2015-06-22 Thread Jörn Franke
Generally (not only spark sql specific) you should not cast in the where part of a sql query. It is also not necessary in your case. Getting rid of casts in the whole query will be also beneficial. Le lun. 22 juin 2015 à 17:29, James Aley james.a...@swiftkey.com a écrit : Hello, A colleague