Re: Help optimising Spark SQL query

James Aley Mon, 22 Jun 2015 10:02:23 -0700

Thanks for the responses, guys!

Sorry, I forgot to mention that I'm using Spark 1.3.0, but I'll test with
1.4.0 and try the codegen suggestion then report back.



On 22 June 2015 at 12:37, Matthew Johnson <matt.john...@algomi.com> wrote:

> Hi James,
>
>
>
> What version of Spark are you using? In Spark 1.2.2 I had an issue where
> Spark would report a job as complete but I couldn’t find my results
> anywhere – I just assumed it was me doing something wrong as I am still
> quite new to Spark. However, since upgrading to 1.4.0 I have not seen this
> issue, so might be worth upgrading if you are not already on 1.4.
>
>
>
> Cheers,
>
> Matthew
>
>
>
>
>
> *From:* Lior Chaga [mailto:lio...@taboola.com]
> *Sent:* 22 June 2015 17:24
> *To:* James Aley
> *Cc:* user
> *Subject:* Re: Help optimising Spark SQL query
>
>
>
> Hi James,
>
>
>
> There are a few configurations that you can try:
>
>
> https://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
>
>
>
> From my experience, the codegen really boost things up. Just run
> sqlContext.sql("spark.sql.codegen=true") before you execute your query. But
> keep in mind that sometimes this is buggy (depending on your query), so
> compare to results without codegen to be sure.
>
> Also you can try changing the default partitions.
>
>
>
> You can also use dataframes (since 1.3). Not sure they are better than
> specifying the query in 1.3, but with spark 1.4 there should be an enormous
> performance improvement in dataframes.
>
>
>
> Lior
>
>
>
> On Mon, Jun 22, 2015 at 6:28 PM, James Aley <james.a...@swiftkey.com>
> wrote:
>
> Hello,
>
>
>
> A colleague of mine ran the following Spark SQL query:
>
>
>
> select
>
>   count(*) as uses,
>
>   count (distinct cast(id as string)) as users
>
> from usage_events
>
> where
>
>   from_unixtime(cast(timestamp_millis/1000 as bigint))
>
> between '2015-06-09' and '2015-06-16'
>
>
>
> The table contains billions of rows, but totals only 64GB of data across
> ~30 separate files, which are stored as Parquet with LZO compression in S3.
>
>
>
> From the referenced columns:
>
>
>
> * id is Binary, which we cast to a String so that we can DISTINCT by it.
> (I was already told this will improve in a later release, in a separate
> thread.)
>
> * timestamp_millis is a long, containing a unix timestamp with
> millisecond resolution
>
>
>
> This took nearly 2 hours to run on a 5 node cluster of r3.xlarge EC2
> instances, using 20 executors, each with 4GB memory. I can see from
> monitoring tools that the CPU usage is at 100% on all nodes, but incoming
> network seems a bit low at 2.5MB/s, suggesting to me that this is CPU-bound.
>
>
>
> Does that seem slow? Can anyone offer any ideas by glancing at the query
> as to why this might be slow? We'll profile it meanwhile and post back if
> we find anything ourselves.
>
>
>
> A side issue - I've found that this query, and others, sometimes completes
> but doesn't return any results. There appears to be no error that I can see
> in the logs, and Spark reports the job as successful, but the connected
> JDBC client (SQLWorkbenchJ in this case), just sits there forever waiting.
> I did a quick Google and couldn't find anyone else having similar issues.
>
>
>
>
>
> Many thanks,
>
>
>
> James.
>
>
>

Re: Help optimising Spark SQL query

Reply via email to