Re: [sqlite] why is count of rows in LEFT JOIN more than the rows inthe left table

2012-02-20 Thread Puneet Kishor

On Feb 20, 2012, at 12:46 PM, Jay A. Kreibich wrote:

> On Mon, Feb 20, 2012 at 05:01:17PM +, Simon Slavin scratched on the wall:
>> 
>> On 20 Feb 2012, at 4:11pm, Puneet Kishor  wrote:
>> 
>>> is there are way to force the query to return only the "first" match
>>> so the resulting query has only as many rows as the left table?
>> 
>> You have two problems:
>> 
>> A) You are JOINing two columns which contain some null values.  null is
>> interpreted as 'I don't know', so the joins tend to match many
>> different rows.
> 
>  No, that's not how it works.  If you have a join condition like
>  (A.col == B.col), and either A.col or B.col (or both) is NULL, then
>  the result of the comparison itself is NULL ("Are these values
>  equal?"  "I don't know.").  JOIN operation will only match rows where
>  the join condition is TRUE.  NULL is not "good enough" to match rows.
> 
>  Think about NATURAL JOINs where the two columns are reduced to only
>  one column.  This reduction can only take place if the values are
>  exactly the same.
> 
>> B) You are using a LEFT JOIN which is explicitly defined to include
>> many matches, not just one match.  Instead you might try NATURAL JOIN
>> or CROSS JOIN to see what difference they make.
> 
>  That's not the issue.  Or rather, it is, but that isn't going to show
>  us anything we don't already know.  NATURAL JOIN should return the
>  exact same results as example #5: 329743  (assuming "col" is the only
>  column name shared between tables), except it will merge the two
>  "col" columns into a single column.
> 
>  A CROSS JOIN has no join condition and is going to return
>  2,066,471,848 rows.  That tells us nothing.
> 
>> And you have a meta-problem too: the dataset you're experimenting with
>> is so big you can't see why it's doing what it's doing.
> 
>  Databases are designed to work with sets.  If you want to debug the
>  issue, you need to think in sets as well.  Use the database to answer
>  your questions, rather than just scanning the results and doing it in
>  your head.
> 
>  The statement...
> 
>SELECT COUNT(*) as ct, col FROM a GROUP BY 2 HAVING ct > 1
> 
>  ...will tell you which rows have duplicate columns.  You can run this
>  on either table.
> 
>  If you want to have unique values in the JOIN, the best approach is
>  to put a unique index across each "col" column, so that the columns
>  have unique values.
> 
>  Failing that, replace the table reference in example #6 with
>  something like "(SELECT * FROM a GROUP BY col) as a".  That's going
>  to give you random rows (there is no concept of a "first" row), but
>  it will get you the row count you're looking for.


Thanks Jay... brilliant explanation. I ended up finding the dupes in table "b" 
and re-encoding the column so it had only unique values. You are right, "first" 
doesn't mean much in this case... I was referring to "the first matching row, 
and to hell with the subsequent matches." Fortunately, the problem was not too 
laborious to correct manually.


> 
>  You might need to do it with both tables, depending on which table(s)
>  has the duplicate "col" values.

Yup, fortunately I know that; it was table "b". Table "a" has geographic 
features, and neither do I want to "create new" nor "lose existing" features. 
Table "b" has attributes to categorize features in table "a". Cleaning up table 
"b" cleared the problem.



--
Puneet Kishor

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] why is count of rows in LEFT JOIN more than the rows inthe left table

2012-02-20 Thread Jay A. Kreibich
On Mon, Feb 20, 2012 at 05:01:17PM +, Simon Slavin scratched on the wall:
> 
> On 20 Feb 2012, at 4:11pm, Puneet Kishor  wrote:
> 
> > is there are way to force the query to return only the "first" match
> > so the resulting query has only as many rows as the left table?
> 
> You have two problems:
> 
> A) You are JOINing two columns which contain some null values.  null is
> interpreted as 'I don't know', so the joins tend to match many
> different rows.

  No, that's not how it works.  If you have a join condition like
  (A.col == B.col), and either A.col or B.col (or both) is NULL, then
  the result of the comparison itself is NULL ("Are these values
  equal?"  "I don't know.").  JOIN operation will only match rows where
  the join condition is TRUE.  NULL is not "good enough" to match rows.

  Think about NATURAL JOINs where the two columns are reduced to only
  one column.  This reduction can only take place if the values are
  exactly the same.

> B) You are using a LEFT JOIN which is explicitly defined to include
> many matches, not just one match.  Instead you might try NATURAL JOIN
> or CROSS JOIN to see what difference they make.

  That's not the issue.  Or rather, it is, but that isn't going to show
  us anything we don't already know.  NATURAL JOIN should return the
  exact same results as example #5: 329743  (assuming "col" is the only
  column name shared between tables), except it will merge the two
  "col" columns into a single column.

  A CROSS JOIN has no join condition and is going to return
  2,066,471,848 rows.  That tells us nothing.

> And you have a meta-problem too: the dataset you're experimenting with
> is so big you can't see why it's doing what it's doing.

  Databases are designed to work with sets.  If you want to debug the
  issue, you need to think in sets as well.  Use the database to answer
  your questions, rather than just scanning the results and doing it in
  your head.

  The statement...

SELECT COUNT(*) as ct, col FROM a GROUP BY 2 HAVING ct > 1

  ...will tell you which rows have duplicate columns.  You can run this
  on either table.

  If you want to have unique values in the JOIN, the best approach is
  to put a unique index across each "col" column, so that the columns
  have unique values.

  Failing that, replace the table reference in example #6 with
  something like "(SELECT * FROM a GROUP BY col) as a".  That's going
  to give you random rows (there is no concept of a "first" row), but
  it will get you the row count you're looking for.

  You might need to do it with both tables, depending on which table(s)
  has the duplicate "col" values.


   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Intelligence is like underwear: it is important that you have it,
 but showing it to the wrong people has the tendency to make them
 feel uncomfortable." -- Angela Johnson
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] why is count of rows in LEFT JOIN more than the rows inthe left table

2012-02-20 Thread Simon Slavin

On 20 Feb 2012, at 4:11pm, Puneet Kishor  wrote:

> is there are way to force the query to return only the "first" match so the 
> resulting query has only as many rows as the left table?

You have two problems:

A) You are JOINing two columns which contain some null values.  null is 
interpreted as 'I don't know', so the joins tend to match many different rows.

B) You are using a LEFT JOIN which is explicitly defined to include many 
matches, not just one match.  Instead you might try NATURAL JOIN or CROSS JOIN 
to see what difference they make.

And you have a meta-problem too: the dataset you're experimenting with is so 
big you can't see why it's doing what it's doing.  It might be useful for you 
to make up a little test database with perhaps 5 rows in each table, just so 
you can look through every row and see what's going on.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] why is count of rows in LEFT JOIN more than the rows inthe left table

2012-02-20 Thread Puneet Kishor

On Feb 20, 2012, at 9:34 AM, Igor Tandetnik wrote:

> Puneet Kishor  wrote:
>> But, why is #5 329743 and not 329686? Where are the extra 57 rows coming 
>> from? I thought a LEFT JOIN was supposed to include
>> *all* the rows from the left table with NULLs for the columns of the right 
>> table where there was no match. So, at most, the
>> resulting table would have as many rows as the rows in the left table. 
>> Certainly not more.  
> 
> The conclusion doesn't follow from the premise. Yes, a row from the left 
> table is included in the resultset even if there are no matching rows in the 
> right table. Which means that the resultset would contain *at least* as many 
> rows as there are in the left table - but it certainly may contain more. For 
> each row in the left table that has N matching rows in the right table (with 
> N > 0), there will be N rows in the resultset.

Thanks for the great explanation. Now, the follow up question --

is there are way to force the query to return only the "first" match so the 
resulting query has only as many rows as the left table?



--
Puneet Kishor

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] why is count of rows in LEFT JOIN more than the rows inthe left table

2012-02-20 Thread Igor Tandetnik
Puneet Kishor  wrote:
> But, why is #5 329743 and not 329686? Where are the extra 57 rows coming 
> from? I thought a LEFT JOIN was supposed to include
> *all* the rows from the left table with NULLs for the columns of the right 
> table where there was no match. So, at most, the
> resulting table would have as many rows as the rows in the left table. 
> Certainly not more.  

The conclusion doesn't follow from the premise. Yes, a row from the left table 
is included in the resultset even if there are no matching rows in the right 
table. Which means that the resultset would contain *at least* as many rows as 
there are in the left table - but it certainly may contain more. For each row 
in the left table that has N matching rows in the right table (with N > 0), 
there will be N rows in the resultset.
-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users