Thanks everybody for the advice on this.
I attached YourKit and found that the CPU time split was about 70% in
Parquet/LZO reading and 30% applying the filter predicate. I guess those
are reasonable things for it to be spending time on, and so it really could
just be a case of needing more
64GB in parquet could be many billions of rows because of the columnar
compression. And count distinct by itself is an expensive operation. This
is not just on Spark, even on Presto/Impala, you would see performance dip
with count distincts. And the cluster is not that powerful either.
The one
Thanks for the suggestions everyone, appreciate the advice.
I tried replacing DISTINCT for the nested GROUP BY, running on 1.4 instead
of 1.3, replacing the date casts with a between operation on the
corresponding long constants instead and changing COUNT(*) to COUNT(1).
None of these seem to
Hello,
A colleague of mine ran the following Spark SQL query:
select
count(*) as uses,
count (distinct cast(id as string)) as users
from usage_events
where
from_unixtime(cast(timestamp_millis/1000 as bigint))
between '2015-06-09' and '2015-06-16'
The table contains billions of rows, but
this
issue, so might be worth upgrading if you are not already on 1.4.
Cheers,
Matthew
*From:* Lior Chaga [mailto:lio...@taboola.com]
*Sent:* 22 June 2015 17:24
*To:* James Aley
*Cc:* user
*Subject:* Re: Help optimising Spark SQL query
Hi James,
There are a few configurations that you
Hi James,
There are a few configurations that you can try:
https://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
From my experience, the codegen really boost things up. Just run
sqlContext.sql(spark.sql.codegen=true) before you execute your query. But
keep
are not already on 1.4.
Cheers,
Matthew
*From:* Lior Chaga [mailto:lio...@taboola.com]
*Sent:* 22 June 2015 17:24
*To:* James Aley
*Cc:* user
*Subject:* Re: Help optimising Spark SQL query
Hi James,
There are a few configurations that you can try:
https://spark.apache.org/docs/latest
Have you test this on a smaller set to verify that the query is correct?
On Mon, Jun 22, 2015 at 2:59 PM, ayan guha guha.a...@gmail.com wrote:
You may also want to change count(*) to specific column.
On 23 Jun 2015 01:29, James Aley james.a...@swiftkey.com wrote:
Hello,
A colleague of mine
Hi James,
Maybe it's the DISTINCT causing the issue.
I rewrote the query as follows. Maybe this one can finish faster.
select
sum(cnt) as uses,
count(id) as users
from (
select
count(*) cnt,
cast(id as string) as id,
from usage_events
where
You may also want to change count(*) to specific column.
On 23 Jun 2015 01:29, James Aley james.a...@swiftkey.com wrote:
Hello,
A colleague of mine ran the following Spark SQL query:
select
count(*) as uses,
count (distinct cast(id as string)) as users
from usage_events
where
Generally (not only spark sql specific) you should not cast in the where
part of a sql query. It is also not necessary in your case. Getting rid of
casts in the whole query will be also beneficial.
Le lun. 22 juin 2015 à 17:29, James Aley james.a...@swiftkey.com a écrit :
Hello,
A colleague
11 matches
Mail list logo