Robert DiFalco wrote:
> In a previous database engine I was using an IN was more optimal than a
> <>. So, for example:
>
>     SELECT * FROM table WHERE table.type IN (1,2,3);
>
> Where the possible values of type are 0-3, was appreciably faster than:
>
>     SELECT * FROM table WHERE table.type <> 0;
>
> I've been playing with the Query Browser and checking out the
> optimization documents and haven't been able to make a clear call on
> whether or not this is also the case with MySQL/InnoDB.

Mladen Adamovic wrote:
> IN should be faster implemented with both hash tables and BTREE's so
> nowadays it should also be faster than <> as all MySQL implementation
> AFAIK use those well known data structures for indexes.

[EMAIL PROTECTED] wrote:
> YES, YES, YES! This is definitely an optimization.
>
> When you say IN or =, you are asking for "matching values". Matches can
> come from indexes. When you say <> or NOT IN, you are asking for
> everything BUT matches.  In order to evaluate a negative, the database
> engine (and this is usually true regardless of database server) almost
> always performs a full table scan to test every row to make sure it is
> either <> or NOT IN. At the very best, they have to perform a full index
> scan which is still less efficient than  ranged or values-based lookups.
>
> It's when you get into the situation where you are matching against dozens
> of IN-clause items that you may run into slowdowns again. Until you reach
> 2 or 3 dozen terms (depending on your hardware) you should be faster with
> an IN comparison than a <> or a NOT IN comparison. An optimization to
> search for BUT a term or two is to create a temporary table of all of your
> terms and delete the exact ones you want to exclude. Put an index on your
> temp table then JOIN that back into your query again (replacing the huge
> IN clause).  The database will match index to index and things will get
> fast again. This technique can scale up to some really big queries.
>
> Always try to code for the affirmative tests. Your users will thank you.

Implicit in Mladen and Shawn's answers, but never actually mentioned in the original post, is the presence of an index on the type column. This is probably obvious to all concerned, but I mention it for completeness: without an index on type, there is no difference between "type IN (1,2,3)" and "type != 0". That is, the question is not whether IN is better than !=, but rather which will allow the optimizer to make good use of the index on type.

I find mysql's optimizer is pretty good with well-written queries, as long as subqueries aren't involved, so my initial reaction was to expect no difference. After all, as the optimizer considers the WHERE conditions and the available indexes, it is certainly possible, at least theoretically, for it to notice that "type IN (1,2,3)" and "type != 0" are identical conditions. That is, a clever optimizer could treat them identically. Shawn's and Mladen's answers gave me pause, however, and aroused my curiosity, so I decided to test:

  SELECT VERSION();
  +-----------+
  | VERSION() |
  +-----------+
  | 4.1.15    |
  +-----------+

  SELECT cat, COUNT(*) FROM inits GROUP BY cat;
  +------+----------+
  | type | COUNT(*) |
  +------+----------+
  |    0 |    44224 |
  |    1 |     1919 |
  |    2 |     1931 |
  |    3 |     1926 |
  +------+----------+

  mysql> EXPLAIN SELECT * FROM inits WHERE cat IN (1,2,3) \G
  *************************** 1. row ***************************
             id: 1
    select_type: SIMPLE
          table: inits
           type: range
  possible_keys: cat_idx
            key: cat_idx
        key_len: 5
            ref: NULL
           rows: 8117
          Extra: Using where

  mysql> EXPLAIN SELECT * FROM inits WHERE cat != 0 \G
  *************************** 1. row ***************************
             id: 1
    select_type: SIMPLE
          table: inits
           type: range
  possible_keys: cat_idx
            key: cat_idx
        key_len: 5
            ref: NULL
           rows: 8120
          Extra: Using where

As you can see, the optimizer plans to use the index in both cases, examining 8117 rows in the IN case and 8120 rows in the != case, to get 5776 matching rows out of 50,000 (about 11.55%).

On the other hand, it makes a difference how many rows will match. What is the distribution of values of type? If the number of matching rows is more than about 30% of the table, the optimizer won't use an available index in any case. For example,

mysql> EXPLAIN SELECT * FROM inits WHERE cat IN (0,1,2) \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: inits
         type: ALL
possible_keys: cat_idx
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 50000
        Extra: Using where

mysql> EXPLAIN SELECT * FROM inits WHERE cat !=3 \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: inits
         type: ALL
possible_keys: cat_idx
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 50000
        Extra: Using where

Here I've gone to the other extreme, selecting about 96.15% of the rows in the table, so the optimizer prefers not to use the index in either case.

Of course, I've just done one simple test with arbitrary data on an unloaded dev machine. My advice is to try your query both ways with EXPLAIN in front so you can see what choices the optimizer makes with your actual query and data, rather than relying on theory or one test case. If there's a difference, pick whichever is better.

Michael

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to