Hi Scott,

Thanks for your email.  Your inputs are certainly useful.

On 22/7/2015 9:54 PM, Scott Ribe wrote:
On Jul 21, 2015, at 8:30 AM, Mohit Sindhwani <mo_m...@onghu.com> wrote:
select count(*), count(distinct group_id) from
(select group_id, recorded_on from data_store_v2 order by id DESC limit 900000) 
td
where recorded_on >= (now()  - '24 hr'::INTERVAL);
Some suggestions in a slightly different direction:

1) rewrite without that completely unnecessary subquery, and see if the query 
time improves ;-)

We have tried this and the query is quite a bit slower. Filtering to the last 900k records before doing the recorded_on part helped speed it up.

2) check that appropriate indexes exist and are being used by the optimizer

3) really analyze the query execution plan, and look for more advanced 
opportunities, for instance, order by recorded_on instead of id, since, presumably, 
the query will already access the rows by an index on ordered_on; consider dropping 
the limit & order altogether; take that query to a PostgreSQL list...

Your email got me going back to look at all the parts again since obviously the query should be using the index and it was still slow. Further search last night made me realize that it's not the indexes that are a problem. The problem is the count(distinct group_id) part which seems to be quite slow in PostgreSQL. This is a lot faster:
select count(*) from
(select distinct group_id from
data_store_v2 where recorded_on >= '2015-06-06') td;
than:
select count(distinct group_id) from
data_store_v2 where recorded_on >= '2015-06-06';

as explained here: https://www.periscope.io/blog/use-subqueries-to-count-distinct-50x-faster.html

So, I guess the real problem was being masked by something else and an incorrect assumption on my part :)

4) put some effort into learning SQL better; OK, we all make mistakes sometimes 
and maybe this is just that; but it sure looks to me like someone who doesn't 
really understand SQL struggling at throwing together various clauses until the 
correct answer pops out (note that in addition to the issue I pointed out in 1, 
the subquery is selecting a column which is completely unused--probably doesn't 
affect anything, but just another sign that the person writing the query did 
not understand it).

Thanks for the analysis :D
I do understand SQL and I thought I'm not throwing things together... for my understanding, which column was unnecessary? I thought we needed all:
> group_id for counting the distinct group_id
> recorded_on for the subsequent query on it
> id only for getting the most recent records

Most of my personal work is with SQLite3 on embedded platforms, but this discussion resolves the problem for now. It now moves away from being a Rails issue to being a PostgreSQL issue.

Best Regards,
Mohit.

--
You received this message because you are subscribed to the Google Groups "Ruby on 
Rails: Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to rubyonrails-talk+unsubscr...@googlegroups.com.
To post to this group, send email to rubyonrails-talk@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/rubyonrails-talk/55B05B1B.9010904%40onghu.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to