Hi, I don't have many years experience with the SQL language and I've cooked up some pretty complex stuff which will run in production environments, I just want to confirm with you that the assumptions I've made are true (I do have a lot of unit tests which confirm that my code works as far as I can see).
Some minimum context is required first: o My software is a C library which stores an addressbook of vCards o Most of the searchable contact fields are stored in one table, with a PRIMARY KEY column 'uid' o As the vCard spec specifies many values can be listed for a given key, for instance "TEL" or "EMAIL", these are stored for the purpose of quick searches on a separate table (one separate table per mult-valued datatype). Each of these tables has a 'uid' column which is created with: 'uid TEXT NOT NULL REFERENCES the_main_table (uid)' So for the table which stores emails, I have one row per email address for each contact (multiple rows can exist with the same 'uid'); To speed up searches, I have indexes created on some of these auxiliary tables. o I need to honor a predetermined contact search API, searches can come in for various contact fields at a time, for instance: "Give me all the contacts who's family name starts with J, and only if they have an email address ending with .com" Is represented as: (and (beginswith "family_name" "J") (endswith "email" ".com")) Now the fun starts, I've recently written a query optimizer of sorts which reorganizes how I will form my query before generating the SQL to execute. So let's consider the query input: (and (endswith "tel" "0505") (beginswith "email" "eddie")) Note that in this case, the three tables in context are: o folder_id The main table with unique 'uid' o folder_id_phone_list The table containing all phone numbers o folder_id_email_list The table containing all phone numbers Before the query optimization, my query would be: ================================================ SELECT DISTINCT summary.uid, summary.vcard FROM 'folder_id' AS summary LEFT OUTER JOIN 'folder_id_phone_list' AS phone_list ON phone_list.uid = summary.uid LEFT OUTER JOIN 'folder_id_email_list' AS email_list ON email_list.uid = summary.uid WHERE (phone_list.value IS NOT NULL AND phone_list.value LIKE '%0505') AND (email_list.value IS NOT NULL AND email_list.value LIKE 'eddie%') After optimization, will be generated instead like so: ================================================ SELECT DISTINCT summary.uid, summary.vcard FROM ( SELECT DISTINCT phone_list.uid FROM 'folder_id_phone_list' AS phone_list WHERE (phone_list.value IS NOT NULL AND phone_list.value LIKE '%0505') ) AS phone_results LEFT OUTER JOIN ( SELECT DISTINCT email_list.uid FROM 'folder_id_email_list' AS email_list WHERE (email_list.value IS NOT NULL AND email_list.value LIKE 'eddie%') ) AS email_list_results ON email_results.uid = phone_results.uid LEFT OUTER JOIN 'folder_id' AS summary ON summary.uid = email_results.uid WHERE summary.uid IS NOT NULL Performing the nested selects achieves two things: o Reduce the number of rows considered in the JOIN statements o Leverage the index which I've created on 'folder_id_email_list' (I am using case insensitive LIKE statements so the indexes work in that statement). The main assumption that I'm making, is the toplevel logical AND statements (i.e. <email constraint> AND <phone constraint>), can be transformed into a nested SELECT with the LEFT OUTER JOIN each time "ON this_select.uid = previous_select.uid", effectively filtering out any rows which didn't match either of the constraints. As far as I can tell, and I've been trying to absorb these JOIN concepts to the best of my ability... the above two statements will always report the same results, i.e. they are equal statements for all practical purposes, only the second one is much faster than the first one. And while my test cases tell me so far that I am correct in my assumptions, I'd really like to have some insight from a third party authority on the matter. Can someone please either confirm that my logic is sound in this assumption ? Or point out to me perhaps some case where my logic might break down ? Please excuse the long detailed email, I don't think I could have explained this accurately without providing enough context. Kind Regards, -Tristan _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users