Re: [WellRailed] Removing duplicates in query

Will Bryant Wed, 25 Aug 2010 06:38:48 -0700

On Wed, Aug 25, 2010 at 4:02 PM, Richard Hulse
<[email protected]> wrote:
> Based on what I thought I understood about group I am not sure why
> that should work, but hey, it does!


>From that I would infer that you're using mysql?  It doesn't require
that you to specify all your primitive fields in the group by.  In
other words, it lets you do something invalid, which will work ok in
this case, but only on mysql, and with unpredictable results if you
ever change the query to return any of the fields in the joined row
(since it is somewhat random which would be selected).  So fine in
this case, but not really correct.

Because you don't actually want any rows from the joined table, you
needn't use a cartesian join.  These days I would suggest writing your
query using an EXISTS condition, which has the side benefit of being
executed (even by mysql) explicitly as an index probe rather than a
row seek.  If you don't select any columns from the joined table, then
the database should have been doing an index lookup without row seek
anyway, but it doesn't hurt to be sure, and using EXISTS means it will
stop as soon as it finds any matching row in the second table, rather
than looking for all matching rows (which would then be ignored).

For example, I would normally recommend that:

SELECT DISTINCT('news.*') FROM `news` INNER JOIN
`news_categories` ON `news_categories`.`news_id` = `news`.`id` WHERE
(news_categories.id != 5) AND (news_categories.id != 2) ORDER BY
sort_order desc LIMIT 100

should instead be written as something like:

SELECT * FROM `news` WHERE EXISTS (SELECT 1 FROM `news_categories`
WHERE `news_categories`.`news_id` = `news`.`id` AND
(news_categories.id != 5) AND (news_categories.id != 2)) ORDER BY
sort_order desc LIMIT 100

With mysql it is however always advisable to check that your query
will run the way that you expect, so do an explain on a populated
table, and check that it uses indexes (and the right indexes).  There
are cases where sort of rewritten query would perform worse, if it
does not realise that it can walk down indices and instead executes
the dependent subquery poorly, but they're rare - I don't remember
ever having problems in practice.  Sometimes mysql doesn't do the
dependent subquery as well as I want, but I don't think I've seen it
do it worse than the original join was.  Check though if your tables
are big, or particularly if you're using an older version of mysql
which are awful at subqueries.

To answer your question more directly though, I think you are probably
looking for the :uniq option.  I haven't tested to see where this is
available on arel-based queries, but it does what you want on
ActiveRecord 2.x's associations - return only one of any given record
even if it appeared multiple times in the query resultset.  (Kind of a
hack though in this case, the ORM layer shouldn't have to throw away
excess query results.)

-- 
You received this message because you are subscribed to the Google Groups 
"WellRailed" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/wellrailed?hl=en.

Re: [WellRailed] Removing duplicates in query

Reply via email to