I should also say that table tweet has more than 400 millions hows and table tweet_topic has estimated more than 800 millions rows.
Thanks again, Caio On Mon, Nov 4, 2013 at 6:44 PM, Caio Casimiro <casimiro.lis...@gmail.com>wrote: > Thank you very much for your answers guys! > > > On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes <jeff.ja...@gmail.com> wrote: > >> On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro <casimiro.lis...@gmail.com> >> wrote: >> >>> Hello all, >>> >>> I have one query running at ~ 7 seconds and I would like to know if it's >>> possible to make it run faster, once this query runs lots of time in my >>> experiment. >>> >> >> >> Do you mean you want it to be fast because it runs many times, or that >> you want it to become fast after it runs many times (i.e. once the data is >> fully cached)? The plan you show takes 24 seconds, not 7 seconds. >> > > I want it to be fast because it runs many times. I have an experiment that > evaluates recommendation algorithms for a set of twitter users. This query > returns recommendation candidates so it is called a lot of times for > different users and time intervals. > > >> >> >>> >>> Basically the query return the topics of tweets published by users that >>> the user N follows and that are published between D1 and D2. >>> >>> *Query*: >>> >>> SELECT tt.tweet_id, tt.topic, tt.topic_value >>> FROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id = >>> t.id >>> WHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in >>> (SELECT followed_id FROM relationship WHERE follower_id = N) >>> ORDER BY tt.tweet_id; >>> >> >> >> I don't know if this affects the plan at all, but it is silly to do a >> left join to "tweet" when the WHERE clause has conditions that can't be >> satisfied with a null row. Also, you could try changing the IN-list to an >> EXISTS subquery. >> > > I'm sorry the ignorance, but I don't understand the issue with the left > join, could you explain more? > > >> Is there some patterns to D1 and D2 that could help the caching? For >> example, are they both usually in the just-recent past? >> > The only pattern is that it is always a one day interval, e.g. D1 = > '2013-05-01' and D2 = '2013-05-02'. > >> >> >> Indexes: >>> "tweet_plk" PRIMARY KEY, btree (id) CLUSTER >>> "tweet_creation_time_index" btree (creation_time) >>> "tweet_id_index" hash (id) >>> "tweet_ios_index" btree (id, user_id, creation_time) >>> "tweet_retweeted_idx" hash (retweeted) >>> "tweet_user_id_creation_time_index" btree (creation_time, user_id) >>> "tweet_user_id_index" hash (user_id) >>> >> >> >> Are all of those indexes important? If your table is heavily >> updated/inserted, which I assume it is, maintaining those indexes is going >> to take up precious RAM that could probably be better used elsewhere. >> > > Probably not. But once this database is read only, the quantity of index > grew following my desperation. =) > >> >> Cheers, >> >> Jeff >> > > Thank you very much again! > Caio >