Re: Join with OR-condition and Indexes

Tobias Schultze Wed, 06 Jan 2010 00:23:00 -0800

Thanks for your answers.

I found out, that MySQL behaves very strange in this situation.
I think this is a bug or important missing feature.


I would like to see how other DBMS behave in this situation, which I would
think is a common problem - whenever you want to join one column of a table
with several columns of another table.

MySQL uses an index_merge when I query for one specific athlete:

SELECT COUNT(*)
FROM matches m
WHERE (m.team1_player_id = 808884 OR m.team1_partner_id = 808884 OR
m.team2_player_id = 808884 OR m.team2_partner_id = 808884)

id      select_type     table   type    possible_keys   key     key_len
ref     rows    Extra
1       SIMPLE  m       index_merge
team1_player_id_idx,team1_partner_id_idx,team2_player_id_idx,team2_partner_i
d_idx
team1_player_id_idx,team1_partner_id_idx,team2_player_id_idx,team2_partner_i
d_idx   5,5,5,5         NULL    153     Using
union(team1_player_id_idx,team1_partner_id_idx,team2_player_id_idx,team2_par
tner_id_idx); Using where


But it doesn't use the index merge when joining like I originally tried,
although logically it should be able to use it, shouldn't it?
Of course, adding USE INDEX FOR JOIN (team1_player_id_idx,
team1_partner_id_idx, team2_player_id_idx, team2_partner_id_idx) cannot
convince MySQL to use the indexes.

SELECT a.id, COUNT(*)
FROM athletes a
FROM athletes a
LEFT JOIN matches m USE INDEX FOR JOIN (team1_player_id_idx,
team1_partner_id_idx, team2_player_id_idx, team2_partner_id_idx) ON (
    m.team1_player_id = a.id OR
    m.team1_partner_id = a.id OR
    m.team2_player_id = a.id OR
    m.team2_partner_id = a.id
)
WHERE a.gender = 'female'
GROUP BY a.id

Then Michael Dykman said, MySQL is restricted to one index per table per
query. So I thought, maybe I can help MySQL when I add a compound index for
all players. So it could use the one index to resolve the join.
ALTER TABLE `matches` ADD INDEX `players_idx` ( `team1_player_id` ,
`team1_partner_id` , `team2_player_id` , `team2_partner_id` ) ;

Explain now gives me:
id      select_type     table   type    possible_keys   key     key_len
ref     rows    Extra
1       SIMPLE  a       ref     gender_index_idx        gender_index_idx
1       const   2193    Using where; Using temporary; Using filesort
1       SIMPLE  m       index
team1_player_id_idx,team1_partner_id_idx,team2_pla...   players_idx     20
NULL    46664   Using index

Hm, what does this mean? It uses the index (players_idx) but still evaluates
all 46664 rows? This does not make sense to me. 
Anyways, the query is still extremely slow (2400 rows in 3 min 5 sec). But
with IGNORE INDEX (players_idx) the query seems to be infinite (over 15
min).

So now it seems that UNIONS are the only options. But they are also much
slower than it should be. You version took 10 seconds.
My modified version with UNION ALL instead of DISTINCT takes 5.2 seconds -
only showing athletes with min. 1 match.

SELECT *, COUNT(am.id) AS number_matches
FROM (
    SELECT a.id AS athlete_id, a.first_name, a.last_name, a.gender, m.*
    FROM athletes a
    INNER JOIN matches m ON (m.team1_player_id = a.id)
        UNION ALL
    SELECT a.id AS athlete_id, a.first_name, a.last_name, a.gender, m.*
    FROM athletes a
    INNER JOIN matches m ON (m.team1_partner_id = a.id)
        UNION ALL
    SELECT a.id AS athlete_id, a.first_name, a.last_name, a.gender, m.*
    FROM athletes a
    INNER JOIN matches m ON (m.team2_player_id = a.id)
        UNION ALL
    SELECT a.id AS athlete_id, a.first_name, a.last_name, a.gender, m.*
    FROM athletes a
    INNER JOIN matches m ON (m.team2_partner_id = a.id)
) AS am
GROUP BY am.athlete_id
ORDER BY number_matches DESC

But then I could also use the idea I had in the first post, which still
needs 3.8 seconds.

SELECT *, COUNT(am.id) AS number_matches
FROM (
(
    SELECT team1_player_id AS player_id, teammatch_id, match_type,
team1_score, team2_score, team1_points, team2_points, no_fight 
    FROM matches
    WHERE team1_player_id IS NOT NULL  # reduced the query from 4.1 to 3.8
sec
) UNION ALL (
    SELECT team1_partner_id, teammatch_id, match_type, team1_score,
team2_score, team1_points, team2_points, no_fight 
    FROM matches
    WHERE team1_partner_id IS NOT NULL
) UNION ALL (
    SELECT team2_player_id, teammatch_id, match_type, team1_score,
team2_score, team1_points, team2_points, no_fight 
    FROM matches
    WHERE team2_player_id IS NOT NULL
) UNION ALL (
    SELECT team2_partner_id, teammatch_id, match_type, team1_score,
team2_score, team1_points, team2_points, no_fight 
    FROM matches
    WHERE team2_partner_id IS NOT NULL
)
) AS am
INNER JOIN athletes a ON (am.player_id = a.id)
GROUP BY a.id
ORDER BY number_matches DESC


Changing the database schema doesn't seem to be usefull. Regarding my schema
is already normalized and over-normalization generally decreases
performances. I guess I would then run into other problems.


Greetings
Tobias


-----Ursprüngliche Nachricht-----
Von: Michael Dykman [mailto:[email protected]] 
Gesendet: Dienstag, 5. Januar 2010 17:14
An: [email protected]
Cc: Tobias Schultze
Betreff: Re: Join with OR-condition and Indexes

What it comes down to is that MySQL can only use 1 index per table per
query. The moment your query includes OR examining different columns,
a full table scan is the only option.

One typical way to implement this is to use UNIONS as Mr. Green suggested:

SELECT aid, count(*) FROM
(
SELECT a.id
FROM athletes a
LEFT JOIN matches m ON (m.team1_player_id = a.id)
    UNION
SELECT a.id
FROM athletes a
LEFT JOIN matches m ON (m.team1_partner_id = a.id)
    UNION
SELECT a.id
FROM athletes a
LEFT JOIN matches m ON (m.team2_player_id = a.id)
    UNION
SELECT a.id
FROM athletes a
LEFT JOIN matches m ON (m.team2_partner_id = a.id)
) AS tmp GROUP BY aid


On Tue, Jan 5, 2010 at 9:18 AM, Shawn Green <[email protected]> wrote:
> Tobias Schultze wrote:
>>
>> Hello,
>>
>>
>> I'm working on an application for my bachelor thesis.
>>
>> I'm having a performance problem with a SQL-Query in MySQL5.
>>
>> I hoped someone can easily enlighten me in this issue.
>>
>>
>> The schema:
>>
>>
>> CREATE TABLE IF NOT EXISTS `athletes` (
>>
>>  `id` int(11) NOT NULL AUTO_INCREMENT,
>>
>>  `last_name` varchar(20) NOT NULL,
>>
>>  `first_name` varchar(20) NOT NULL,
>>
>>  `gender` enum('male','female') NOT NULL,
>>
>>  `birthday` date NOT NULL,
>>
>>  `country` char(2) NOT NULL,
>>
>>  `club_id` int(11) NOT NULL,
>>
>>  `is_active` tinyint(1) NOT NULL,
>>
>>  PRIMARY KEY (`id`),
>>
>>  KEY `gender_index_idx` (`gender`),
>>
>>  KEY `is_active_index_idx` (`is_active`),
>>
>>  KEY `club_id_idx` (`club_id`)
>>
>> ) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COMMENT='Sportler bzw Mitglieder
des
>> Verbandes';
>>
>>
>> -- --------------------------------------------------------
>>
>>
>> CREATE TABLE IF NOT EXISTS `matches` (
>>
>>  `id` int(11) NOT NULL AUTO_INCREMENT,
>>
>>  `teammatch_id` int(11) NOT NULL,
>>
>>  `match_type` varchar(10) NOT NULL,
>>
>>  `team1_player_id` int(11) DEFAULT NULL,
>>
>>  `team1_partner_id` int(11) DEFAULT NULL,
>>
>>  `team2_player_id` int(11) DEFAULT NULL,
>>
>>  `team2_partner_id` int(11) DEFAULT NULL,
>>
>>  `team1_score` tinyint(3) unsigned DEFAULT NULL,
>>
>>  `team2_score` tinyint(3) unsigned DEFAULT NULL,
>>
>>  PRIMARY KEY (`id`),
>>
>>  KEY `teammatch_id_idx` (`teammatch_id`),
>>
>>  KEY `team1_player_id_idx` (`team1_player_id`),
>>
>>  KEY `team1_partner_id_idx` (`team1_partner_id`),
>>
>>  KEY `team2_player_id_idx` (`team2_player_id`),
>>
>>  KEY `team2_partner_id_idx` (`team2_partner_id`)
>>
>> ) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COMMENT='Spiele zwischen zwei oder
>> vier Sportlern (Einzel und Doppel)' AUTO_INCREMENT=46665 ;
>>
>>
>>
>> I want to get all matches for each athlete and calculate statistics such
>> as
>> number of matches etc.
>>
>> The basic very simplified query is like
>>
>>
>> SELECT a.id, COUNT(*)
>>
>> FROM athletes a
>>
>> LEFT JOIN matches m ON (
>>            m.team1_player_id = a.id OR
>>            m.team1_partner_id = a.id OR
>>            m.team2_player_id = a.id OR
>>            m.team2_partner_id = a.id
>>
>> )
>>
>> WHERE a.gender = 'female'
>>
>> GROUP BY a.id
>>
>>
>>
>> Now the problem is, that mysql uses a full table scan to retrieve the
>> matches for an athlete, so the execution takes many seconds or even
worse.
>>
>> An athlete can be referenced in any of the m.team1_player_id OR
>> m.team1_partner_id OR m.team2_player_id OR m.team2_partner_id. (That
>> allows
>> doubles matches.)
>>
>> Why is an full table scan necessary although there is an index on each of
>> these fields? So an index exists for each OR-part of the join
condition...
>>
>>
>> Here the execution plan:
>>
>>
>>
>> id
>> select_type
>> table
>> type
>> possible_keys
>> key
>> key_len
>> ref
>> rows
>> Extra
>>
>> 1
>>
>> SIMPLE
>>
>> a
>>
>> ref
>>
>> gender_index_idx
>>
>> gender_index_idx
>>
>> 1
>>
>> const
>>
>> 2193
>>
>> Using where; Using temporary; Using filesort
>>
>>
>> 1
>>
>> SIMPLE
>>
>> m
>>
>> ALL
>>
>> team1_player_id_idx,team1_partner_id_idx,team2_pla...
>>
>> NULL
>>
>> NULL
>>
>> NULL
>>
>> 46664
>>
>>
>>
>>
>> Joining on each fields like the following is very fast and uses the index
>> but of course doesn't give me the expected result.
>>
>>
>> FROM athletes a
>>
>> LEFT JOIN matches m ON (a.id = m.team1_player_id)
>>
>> LEFT JOIN matches m2 ON (a.id = m2.team2_player_id)
>>
>>
>>
>> Maybe I need to do a workaround using a UNION?
>>
>> But this doesn't help either: (It takes 76 seconds)
>>
>>
>> FROM athletes a
>>
>> LEFT JOIN (
>>
>> (
>>
>> SELECT team1_player_id AS player_id, teammatch_id, match_type,
>> team1_score,
>> team2_score, team1_points, team2_points, no_fight
>> FROM matches
>>
>> )
>> UNION
>> (
>>
>> SELECT team1_partner_id, teammatch_id, match_type, team1_score,
>> team2_score,
>> team1_points, team2_points, no_fight
>> FROM matches
>>
>> )
>>
>> UNION
>> (
>>
>> SELECT team2_player_id, teammatch_id, match_type, team1_score,
>> team2_score,
>> team1_points, team2_points, no_fight
>> FROM matches
>>
>> )
>>
>> UNION
>> (
>>
>> SELECT team2_partner_id, teammatch_id, match_type, team1_score,
>> team2_score,
>> team1_points, team2_points, no_fight
>> FROM matches
>>
>> )
>>
>> ) m ON (a.id = m.player_id)
>>
>>
>>
>>
>> I hope someone can help me with this.
>>
>> Thanks in advance.
>>
>>
>
> I think the problem here is that you have an ordered vector table (main1,
> partner1, main2, partner2) which requires you to optionally match on 4
> separate columns to make your JOIN condition because the athlete you want
to
> locate could be in any of those 4 places in the match record.
>
> What would be faster is to normalize your match data into two tables. The
> first table would be just match information (id, date, location, etc) and
> another table of match-athlete pairs (match_id, athlete_id)
>
> alternatively you would have to use a query like
> (
> select...
> FROM athletes
> INNER JOIN matches
>  ON ... matches.member1
> ) UNION (
> select...
> FROM athletes
> INNER JOIN matches
>  ON ... matches.partner1
> ) UNION (
> select...
> FROM athletes
> INNER JOIN matches
>  ON ... matches.member2
> ) UNION (
> select...
> FROM athletes
> INNER JOIN matches
>  ON ... matches.partner2
> )
>
> in order to use the indexes for the JOINS.
>
> --
> Shawn Green, MySQL Senior Support Engineer
> Sun Microsystems, Inc.
> Office: Blountville, TN
>
>
>
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe:    http://lists.mysql.com/[email protected]
>
>



-- 
 - michael dykman
 - [email protected]

 May the Source be with you.


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[email protected]

Re: Join with OR-condition and Indexes

Reply via email to