Lew E. Lefton wrote:
> Hi,
>
> I hope this is an appropriate place to ask this question, if you think
> it is better suited for another list/forum, please let me know.
>
> I have a table that looks like this:
>
> mysql>  select polynomial_id, term_id from polynomial;
> +---------------+---------+
> | polynomial_id | term_id |
> +---------------+---------+
> |             1 |       1 |
> |             1 |       2 |
> |             1 |       3 |
> |             1 |       4 |
> |             2 |       1 |
> |             2 |       2 |
> |             2 |       3 |
> |             2 |       4 |
> |             2 |       5 |
> |             3 |       1 |
> |             3 |       2 |
> |             3 |       3 |
> |             3 |       5 |
> +---------------+---------+
>
> which represents, say, three polynomials,
>
>   the first is a sum of 4 terms (term1 + term2 + term3 + term4),
>   the second is a sum of 5 terms (term1 + term2 + term3 + term4 + term5),
>   the third is a sum of 4 terms (term1 + term2 + term3 + term5),
>   etc.
>
> I am storing the polynomials in this way because I may need to store
> very large polynomials.  The table may grow to millions of rows before
> I'm done, with potentially many of the same terms appearing in many
> different polynomials.  Thus I have the terms stored in a separate
> table. Also, this method of storage makes the table easily searched
> (e.g. "find all polynomials which have term 2").
>
> If I have another polynomial, say the sum of terms 1,3,4, and 5, how can
> I quickly search this database to see if it's already been stored?
> Actually, I would eventually like to have a function (using appropriate
> API) which, when given a list of terms, returns the polynomial_id
> regardless of whether it is the result of a new insert or a successful
> lookup.
>
> I tried variations of this
>    SELECT DISTINCT polynomial_id FROM polynomial
>     WHERE term_id in ('1','2','3','4')
> but I get
>
> +---------------+
> | polynomial_id |
> +---------------+
> |             1 |
> |             2 |
> |             3 |
> +---------------+
>
> when I really just wanted '1'. I suspect some subquery magic (e.g.
> restricting to only those polynomials with exactly a count of 4 terms)
> would give me a nice efficient solution, but I am not an SQL expert and
> I have not been succesful in crafting the appropriate query.
>
> Thanks in advance for your help.  I am happy to provide more details if
> necessary, but I have tried to distill the essence of the problem by this
> simple example.
>
> Cheers,
> Lew Lefton


Peter Brawley wrote:
SELECT DISTINCT polynomial_id
FROM polynomial p1
INNER JOIN polynomial p2 ON p1.term_id=1 AND p2.term_id=3
INNER JOIN polynomial p3 ON p2.term_id=3 AND p3.term_id=4
INNER JOIN polynomial p4 ON p3.term_id=4 AND p4.term_id=5

That won't work, as each join is missing a crucial condition. Also, as each joined table has a polynomial_id, you need to specify which one you want. Finally, if we get the joins right, "DISTINCT" shouldn't be needed. I think you meant to say

#1
  SELECT p1.polynomial_id
  FROM polynomial p1
  JOIN polynomial p2 ON p1.polynomial_id = p2.polynomial_id AND p2.term_id=3
  JOIN polynomial p3 ON p2.polynomial_id = p3.polynomial_id AND p3.term_id=4
  JOIN polynomial p4 ON p3.polynomial_id = p4.polynomial_id AND p4.term_id=5
  WHERE p1.term_id=1;

I expect that this query, and most of the ones given below, would definitely benefit from a multi-column index on (polynomial_id, term_id). Even then, however, I don't think this query will scale very well. Adding a join per term rapidly gets expensive. Here's an equivalent query without joins:

#2
  SELECT polynomial_id
  FROM polynomial
  WHERE term_id IN (1,3,4,5)
  GROUP BY polynomial_id
  HAVING COUNT(*) = 4;

It should scale much better (and it's easier to write).

Unfortunately, neither query is quite right. Both will return all polynomials which *contain* terms 1, 3, 4, and 5, not the one polynomial (if it exists) consisting of precisely those terms and no others. You would have to do a subsequent check of the returned polynomials to see if they had any more terms.

In other words, there are actually two conditions: The polynomial must have all the required terms, and it must have the right number of terms. We need to modify the query to take both conditions into account. This means we cannot put the matching terms condition into the WHERE clause, as I did above, because then we won't see any other terms. Instead, we must count total terms and matching terms, like this:

#3
  SELECT polynomial_id,
         COUNT(*) AS terms,
         SUM(term_id IN (1,3,4,5)) as matched_terms
  FROM polynomial
  GROUP BY polynomial_id
  HAVING terms = 4 AND matched_terms = 4;

I wrote it that way to make it clear what I'm doing. If you prefer not to see the counts, you can do it this way:

#4
  SELECT polynomial_id
  FROM polynomial
  GROUP BY polynomial_id
  HAVING COUNT(*) = 4 AND SUM(term_id IN (1,3,4,5)) = 4;

Now, that's almost certainly a full table scan (or full index scan). Depending on your data and indices, it may be faster to store the results of my first query (#2 above) into a temporary table of candidates, then join that to the polynomial table and look for polynomial_ids with 4 rows.

  CREATE TEMPORARY TABLE candidates
  SELECT polynomial_id ...          <= query #2 from above

  SELECT c.polynomial_id
  FROM candidates c
  JOIN polynomial p ON c.polynomial_id = p.polynomial_id
  GROUP BY c.polynomial_id
  HAVING COUNT(*) = 4;

  DROP TABLE candidates;    #clean up

Michael





--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to