On Saturday, 8 November, 2014 06:56, Tristan Van Berkom 
<tris...@upstairslabs.com> said:

>On Sat, 2014-11-08 at 06:23 -0700, Keith Medcalf wrote:
>> How about the direct approach:
>>
>> SELECT uid
>>   FROM resource
>>  WHERE uid NOT IN (SELECT resource_uid
>>                      FROM event_participant, event
>>                     WHERE event_participant.event_uid = event.uid
>>                       AND event.shift_uid = :shift_uid
>>                       AND event.date = :event_date)
>>
>> or perhaps
>>
>> SELECT uid
>>   FROM resource
>>  WHERE NOT EXISTS (SELECT 1
>>                      FROM event_participant, event
>>                     WHERE event_participant.event_uid = event.uid
>>                       AND resource_uid = resource.uid
>>                       AND event.shift_uid = :shift_uid
>>                       AND event.date = :event_date)
>>
>> Is not the "right way to do it" the one that obtains the result
>required, not the one that uses a "checkbox" implementation?
>
>I've presented 2 queries which both obtain the required result, and I
>would have to say both of them have varying levels of correctness.
>
>If obtaining the correct result was all that was required, then I could
>simply issue multiple queries and balloon my memory with result UIDs
>in between each of them, all for the sake of better readability (but
>then, I'm not sure why I would need a nice tool like SQLite to do it
>in the first place).
>
>So I would have to say, the "right way to do it" is the most efficient
>way, the one which provides SQLite with the best indications of how
>to plot an efficient query plan.
>
>The heuristics of the table is generally that resources are rather
>finite, while events grow over time (possibly 10 events each with
>10 participants for 3 shifts per day, meaning after a year that
>table will be large, and we'll still want to find the available
>resources hopefully in under 30ms, ideally 10ms).
>
>So yeah, speed at query time (not insert/update/delete time), is very
>important.

While either of the above are direct translations, if you wish to minimize 
materialization of intermediate data then the second would be preferable since 
it uses only inner joins.  However, the first may be far more efficient 
depending on the shape of your data and the number of rows, how much memory is 
available for SQLite to use (Bytes, KBytes, MBytes), and the speed of your I/O 
devices.

Given that selecting all the resources based on your stated heuristics of 
growth, having proper indexes, and the constraints given in the query, I would 
say that the first NOT IN will be more efficient since it is only building a 
list of a (couple) hundred or so exclusions, and NOT IN is very efficient.  The 
second form will use a few thousand bytes less memory but perform significantly 
more I/O.

Theoretically, of course, the execution plan of a query which obtains a 
specified result should not depend on the phrasing of the query.  They should 
all result in the same execution plan and the same result.  However, this is 
not true in practice because of limitations in optimization.  Some optimizers 
need a bit of help in figuring out how to generate an optimum plan, and some 
others will generate overly complicated and inefficient plans simply because 
they prefer to use newly added or nifty features only available in that 
particular database engine rather than a simpler, faster, more direct solution. 
 

I don't know why an outer join would be preferable in any case -- you are going 
to use extra CPU and I/O to perform a join on which a significant proportion of 
the results will be discarded -- or depend on the optimizer to notice this and 
"optimize" away the outer join thus ending up with a "more direct" phrasing 
which could have been written in the first place (and is far more 
understandable for future maintainers).

Many optimizers can only optimize within specific constraints and specifying a 
cute complicated query will result in the most efficient cute and complicated 
solution it can muster (ie, a local optimum, not a global optimum).

Your statement 1 implements NOT IN indirectly by doing a LEFT JOIN then 
discarding the intersection set.  This requires an extra join and discard over 
a more direct NOT IN phrasing.  It is also more difficult to understand.  It 
could be optimized into the NOT IN or NOT EXISTS form by the optimizer if you 
are lucky (presently it will not).

Your statement 2 implement the correlated subquery NOT EXISTS by an indirect 
expression, again using a LEFT JOIN then discarding the intersection results.  
Same comments regarding extra operations and optimizations that cannot 
(presently) be done apply.

If you prefer the ON syntax, then you may of course use it (though, except in 
the case of a LEFT OUTER JOIN, it is merely syntactic sugar for a list of 
tables to join and where conditions.

SELECT uid
  FROM resource
 WHERE uid NOT IN (SELECT resource_uid
                     FROM event_participant 
                     JOIN event
                       ON event_participant.event_uid = event.uid
                    WHERE event.shift_uid = :shift_uid
                      AND event.date = :event_date)

SELECT uid
  FROM resource
 WHERE NOT EXISTS (SELECT 1
                     FROM event_participant
                     JOIN event
                       ON event_participant.event_uid = event.uid
                    WHERE resource_uid = resource.uid
                      AND event.shift_uid = :shift_uid
                      AND event.date = :event_date)




_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to