Thanks for the responses, guys! Sorry, I forgot to mention that I'm using Spark 1.3.0, but I'll test with 1.4.0 and try the codegen suggestion then report back.
On 22 June 2015 at 12:37, Matthew Johnson <matt.john...@algomi.com> wrote: > Hi James, > > > > What version of Spark are you using? In Spark 1.2.2 I had an issue where > Spark would report a job as complete but I couldn’t find my results > anywhere – I just assumed it was me doing something wrong as I am still > quite new to Spark. However, since upgrading to 1.4.0 I have not seen this > issue, so might be worth upgrading if you are not already on 1.4. > > > > Cheers, > > Matthew > > > > > > *From:* Lior Chaga [mailto:lio...@taboola.com] > *Sent:* 22 June 2015 17:24 > *To:* James Aley > *Cc:* user > *Subject:* Re: Help optimising Spark SQL query > > > > Hi James, > > > > There are a few configurations that you can try: > > > https://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options > > > > From my experience, the codegen really boost things up. Just run > sqlContext.sql("spark.sql.codegen=true") before you execute your query. But > keep in mind that sometimes this is buggy (depending on your query), so > compare to results without codegen to be sure. > > Also you can try changing the default partitions. > > > > You can also use dataframes (since 1.3). Not sure they are better than > specifying the query in 1.3, but with spark 1.4 there should be an enormous > performance improvement in dataframes. > > > > Lior > > > > On Mon, Jun 22, 2015 at 6:28 PM, James Aley <james.a...@swiftkey.com> > wrote: > > Hello, > > > > A colleague of mine ran the following Spark SQL query: > > > > select > > count(*) as uses, > > count (distinct cast(id as string)) as users > > from usage_events > > where > > from_unixtime(cast(timestamp_millis/1000 as bigint)) > > between '2015-06-09' and '2015-06-16' > > > > The table contains billions of rows, but totals only 64GB of data across > ~30 separate files, which are stored as Parquet with LZO compression in S3. > > > > From the referenced columns: > > > > * id is Binary, which we cast to a String so that we can DISTINCT by it. > (I was already told this will improve in a later release, in a separate > thread.) > > * timestamp_millis is a long, containing a unix timestamp with > millisecond resolution > > > > This took nearly 2 hours to run on a 5 node cluster of r3.xlarge EC2 > instances, using 20 executors, each with 4GB memory. I can see from > monitoring tools that the CPU usage is at 100% on all nodes, but incoming > network seems a bit low at 2.5MB/s, suggesting to me that this is CPU-bound. > > > > Does that seem slow? Can anyone offer any ideas by glancing at the query > as to why this might be slow? We'll profile it meanwhile and post back if > we find anything ourselves. > > > > A side issue - I've found that this query, and others, sometimes completes > but doesn't return any results. There appears to be no error that I can see > in the logs, and Spark reports the job as successful, but the connected > JDBC client (SQLWorkbenchJ in this case), just sits there forever waiting. > I did a quick Google and couldn't find anyone else having similar issues. > > > > > > Many thanks, > > > > James. > > >