Re: [sqlite] why is count of rows in LEFT JOIN more than the rows inthe left table
On Feb 20, 2012, at 12:46 PM, Jay A. Kreibich wrote: > On Mon, Feb 20, 2012 at 05:01:17PM +, Simon Slavin scratched on the wall: >> >> On 20 Feb 2012, at 4:11pm, Puneet Kishorwrote: >> >>> is there are way to force the query to return only the "first" match >>> so the resulting query has only as many rows as the left table? >> >> You have two problems: >> >> A) You are JOINing two columns which contain some null values. null is >> interpreted as 'I don't know', so the joins tend to match many >> different rows. > > No, that's not how it works. If you have a join condition like > (A.col == B.col), and either A.col or B.col (or both) is NULL, then > the result of the comparison itself is NULL ("Are these values > equal?" "I don't know."). JOIN operation will only match rows where > the join condition is TRUE. NULL is not "good enough" to match rows. > > Think about NATURAL JOINs where the two columns are reduced to only > one column. This reduction can only take place if the values are > exactly the same. > >> B) You are using a LEFT JOIN which is explicitly defined to include >> many matches, not just one match. Instead you might try NATURAL JOIN >> or CROSS JOIN to see what difference they make. > > That's not the issue. Or rather, it is, but that isn't going to show > us anything we don't already know. NATURAL JOIN should return the > exact same results as example #5: 329743 (assuming "col" is the only > column name shared between tables), except it will merge the two > "col" columns into a single column. > > A CROSS JOIN has no join condition and is going to return > 2,066,471,848 rows. That tells us nothing. > >> And you have a meta-problem too: the dataset you're experimenting with >> is so big you can't see why it's doing what it's doing. > > Databases are designed to work with sets. If you want to debug the > issue, you need to think in sets as well. Use the database to answer > your questions, rather than just scanning the results and doing it in > your head. > > The statement... > >SELECT COUNT(*) as ct, col FROM a GROUP BY 2 HAVING ct > 1 > > ...will tell you which rows have duplicate columns. You can run this > on either table. > > If you want to have unique values in the JOIN, the best approach is > to put a unique index across each "col" column, so that the columns > have unique values. > > Failing that, replace the table reference in example #6 with > something like "(SELECT * FROM a GROUP BY col) as a". That's going > to give you random rows (there is no concept of a "first" row), but > it will get you the row count you're looking for. Thanks Jay... brilliant explanation. I ended up finding the dupes in table "b" and re-encoding the column so it had only unique values. You are right, "first" doesn't mean much in this case... I was referring to "the first matching row, and to hell with the subsequent matches." Fortunately, the problem was not too laborious to correct manually. > > You might need to do it with both tables, depending on which table(s) > has the duplicate "col" values. Yup, fortunately I know that; it was table "b". Table "a" has geographic features, and neither do I want to "create new" nor "lose existing" features. Table "b" has attributes to categorize features in table "a". Cleaning up table "b" cleared the problem. -- Puneet Kishor ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] why is count of rows in LEFT JOIN more than the rows inthe left table
On Mon, Feb 20, 2012 at 05:01:17PM +, Simon Slavin scratched on the wall: > > On 20 Feb 2012, at 4:11pm, Puneet Kishorwrote: > > > is there are way to force the query to return only the "first" match > > so the resulting query has only as many rows as the left table? > > You have two problems: > > A) You are JOINing two columns which contain some null values. null is > interpreted as 'I don't know', so the joins tend to match many > different rows. No, that's not how it works. If you have a join condition like (A.col == B.col), and either A.col or B.col (or both) is NULL, then the result of the comparison itself is NULL ("Are these values equal?" "I don't know."). JOIN operation will only match rows where the join condition is TRUE. NULL is not "good enough" to match rows. Think about NATURAL JOINs where the two columns are reduced to only one column. This reduction can only take place if the values are exactly the same. > B) You are using a LEFT JOIN which is explicitly defined to include > many matches, not just one match. Instead you might try NATURAL JOIN > or CROSS JOIN to see what difference they make. That's not the issue. Or rather, it is, but that isn't going to show us anything we don't already know. NATURAL JOIN should return the exact same results as example #5: 329743 (assuming "col" is the only column name shared between tables), except it will merge the two "col" columns into a single column. A CROSS JOIN has no join condition and is going to return 2,066,471,848 rows. That tells us nothing. > And you have a meta-problem too: the dataset you're experimenting with > is so big you can't see why it's doing what it's doing. Databases are designed to work with sets. If you want to debug the issue, you need to think in sets as well. Use the database to answer your questions, rather than just scanning the results and doing it in your head. The statement... SELECT COUNT(*) as ct, col FROM a GROUP BY 2 HAVING ct > 1 ...will tell you which rows have duplicate columns. You can run this on either table. If you want to have unique values in the JOIN, the best approach is to put a unique index across each "col" column, so that the columns have unique values. Failing that, replace the table reference in example #6 with something like "(SELECT * FROM a GROUP BY col) as a". That's going to give you random rows (there is no concept of a "first" row), but it will get you the row count you're looking for. You might need to do it with both tables, depending on which table(s) has the duplicate "col" values. -j -- Jay A. Kreibich < J A Y @ K R E I B I.C H > "Intelligence is like underwear: it is important that you have it, but showing it to the wrong people has the tendency to make them feel uncomfortable." -- Angela Johnson ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] why is count of rows in LEFT JOIN more than the rows inthe left table
On 20 Feb 2012, at 4:11pm, Puneet Kishorwrote: > is there are way to force the query to return only the "first" match so the > resulting query has only as many rows as the left table? You have two problems: A) You are JOINing two columns which contain some null values. null is interpreted as 'I don't know', so the joins tend to match many different rows. B) You are using a LEFT JOIN which is explicitly defined to include many matches, not just one match. Instead you might try NATURAL JOIN or CROSS JOIN to see what difference they make. And you have a meta-problem too: the dataset you're experimenting with is so big you can't see why it's doing what it's doing. It might be useful for you to make up a little test database with perhaps 5 rows in each table, just so you can look through every row and see what's going on. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] why is count of rows in LEFT JOIN more than the rows inthe left table
On Feb 20, 2012, at 9:34 AM, Igor Tandetnik wrote: > Puneet Kishorwrote: >> But, why is #5 329743 and not 329686? Where are the extra 57 rows coming >> from? I thought a LEFT JOIN was supposed to include >> *all* the rows from the left table with NULLs for the columns of the right >> table where there was no match. So, at most, the >> resulting table would have as many rows as the rows in the left table. >> Certainly not more. > > The conclusion doesn't follow from the premise. Yes, a row from the left > table is included in the resultset even if there are no matching rows in the > right table. Which means that the resultset would contain *at least* as many > rows as there are in the left table - but it certainly may contain more. For > each row in the left table that has N matching rows in the right table (with > N > 0), there will be N rows in the resultset. Thanks for the great explanation. Now, the follow up question -- is there are way to force the query to return only the "first" match so the resulting query has only as many rows as the left table? -- Puneet Kishor ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] why is count of rows in LEFT JOIN more than the rows inthe left table
Puneet Kishorwrote: > But, why is #5 329743 and not 329686? Where are the extra 57 rows coming > from? I thought a LEFT JOIN was supposed to include > *all* the rows from the left table with NULLs for the columns of the right > table where there was no match. So, at most, the > resulting table would have as many rows as the rows in the left table. > Certainly not more. The conclusion doesn't follow from the premise. Yes, a row from the left table is included in the resultset even if there are no matching rows in the right table. Which means that the resultset would contain *at least* as many rows as there are in the left table - but it certainly may contain more. For each row in the left table that has N matching rows in the right table (with N > 0), there will be N rows in the resultset. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users