Hi,
   I don't have many years experience with the SQL language
and I've cooked up some pretty complex stuff which will run
in production environments, I just want to confirm with you
that the assumptions I've made are true (I do have a lot of
unit tests which confirm that my code works as far as I can 
see).

Some minimum context is required first:

  o My software is a C library which stores an addressbook
    of vCards

  o Most of the searchable contact fields are stored in
    one table, with a PRIMARY KEY column 'uid'

  o As the vCard spec specifies many values can be
    listed for a given key, for instance "TEL" or "EMAIL",
    these are stored for the purpose of quick searches
    on a separate table (one separate table per mult-valued
    datatype).

    Each of these tables has a 'uid' column which is
    created with:
      'uid TEXT NOT NULL REFERENCES the_main_table (uid)'

    So for the table which stores emails, I have one row
    per email address for each contact (multiple rows can
    exist with the same 'uid');

    To speed up searches, I have indexes created on some
    of these auxiliary tables.

  o I need to honor a predetermined contact search API,
    searches can come in for various contact fields at
    a time, for instance:

      "Give me all the contacts who's family name starts
       with J, and only if they have an email address
       ending with .com"

    Is represented as:
        (and (beginswith "family_name" "J")
             (endswith "email" ".com"))

Now the fun starts, I've recently written a query optimizer
of sorts which reorganizes how I will form my query before
generating the SQL to execute.

So let's consider the query input:

   (and (endswith "tel" "0505") (beginswith "email" "eddie"))

Note that in this case, the three tables in context are:
   o folder_id              The main table with unique 'uid'
   o folder_id_phone_list   The table containing all phone numbers
   o folder_id_email_list   The table containing all phone numbers

Before the query optimization, my query would be:
================================================
SELECT DISTINCT summary.uid, summary.vcard FROM 'folder_id' AS summary
LEFT OUTER JOIN 'folder_id_phone_list' AS phone_list 
    ON phone_list.uid = summary.uid
LEFT OUTER JOIN 'folder_id_email_list' AS email_list 
    ON email_list.uid = summary.uid
WHERE (phone_list.value IS NOT NULL AND phone_list.value LIKE '%0505')
  AND (email_list.value IS NOT NULL AND email_list.value LIKE 'eddie%')

After optimization, will be generated instead like so:
================================================
SELECT DISTINCT summary.uid, summary.vcard FROM (
 SELECT DISTINCT phone_list.uid FROM 'folder_id_phone_list' 
 AS phone_list
 WHERE (phone_list.value IS NOT NULL AND 
        phone_list.value LIKE '%0505') 
) AS phone_results
LEFT OUTER JOIN (
 SELECT DISTINCT email_list.uid FROM 'folder_id_email_list'
 AS email_list
 WHERE (email_list.value IS NOT NULL AND
       email_list.value LIKE 'eddie%') 
) AS email_list_results ON email_results.uid = phone_results.uid 
LEFT OUTER JOIN 'folder_id' AS summary 
  ON summary.uid = email_results.uid WHERE summary.uid IS NOT NULL

Performing the nested selects achieves two things:

   o Reduce the number of rows considered in the JOIN statements

   o Leverage the index which I've created on 'folder_id_email_list'
     (I am using case insensitive LIKE statements so the indexes
     work in that statement).

The main assumption that I'm making, is the toplevel logical AND
statements (i.e. <email constraint> AND <phone constraint>),
can be transformed into a nested SELECT with the LEFT OUTER JOIN
each time "ON this_select.uid = previous_select.uid", effectively
filtering out any rows which didn't match either of the
constraints.

As far as I can tell, and I've been trying to absorb these
JOIN concepts to the best of my ability... the above two
statements will always report the same results, i.e. they
are equal statements for all practical purposes, only
the second one is much faster than the first one.

And while my test cases tell me so far that I am correct
in my assumptions, I'd really like to have some insight
from a third party authority on the matter.

Can someone please either confirm that my logic is sound
in this assumption ? Or point out to me perhaps some case
where my logic might break down ?

Please excuse the long detailed email, I don't think I could
have explained this accurately without providing enough
context.

Kind Regards,

    -Tristan


_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to