I should also say that table tweet has more than 400 millions hows and
table tweet_topic has estimated more than 800 millions rows.

Thanks again,
Caio


On Mon, Nov 4, 2013 at 6:44 PM, Caio Casimiro <casimiro.lis...@gmail.com>wrote:

> Thank you very much for your answers guys!
>
>
> On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
>
>> On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro <casimiro.lis...@gmail.com>
>>  wrote:
>>
>>> Hello all,
>>>
>>> I have one query running at ~ 7 seconds and I would like to know if it's
>>> possible to make it run faster, once this query runs lots of time in my
>>> experiment.
>>>
>>
>>
>> Do you mean you want it to be fast because it runs many times, or that
>> you want it to become fast after it runs many times (i.e. once the data is
>> fully cached)?  The plan you show takes 24 seconds, not 7 seconds.
>>
>
> I want it to be fast because it runs many times. I have an experiment that
> evaluates recommendation algorithms  for a set of twitter users. This query
> returns recommendation candidates so it is called a lot of times for
> different users and time intervals.
>
>
>>
>>
>>>
>>> Basically the query return the topics of tweets published by users that
>>> the user N follows and that are published between D1 and D2.
>>>
>>> *Query*:
>>>
>>> SELECT tt.tweet_id, tt.topic, tt.topic_value
>>>             FROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id =
>>> t.id
>>>             WHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in
>>>             (SELECT followed_id FROM relationship WHERE follower_id = N)
>>> ORDER BY tt.tweet_id;
>>>
>>
>>
>> I don't know if this affects the plan at all, but it is silly to do a
>> left join to "tweet" when the WHERE clause has conditions that can't be
>> satisfied with a null row.  Also, you could try changing the IN-list to an
>> EXISTS subquery.
>>
>
> I'm sorry the ignorance, but I don't understand the issue with the left
> join, could you explain more?
>
>
>> Is there some patterns to D1 and D2 that could help the caching?  For
>> example, are they both usually in the just-recent past?
>>
> The only pattern is that it is always a one day interval, e.g. D1 =
> '2013-05-01' and  D2 = '2013-05-02'.
>
>>
>>
>> Indexes:
>>>     "tweet_plk" PRIMARY KEY, btree (id) CLUSTER
>>>     "tweet_creation_time_index" btree (creation_time)
>>>     "tweet_id_index" hash (id)
>>>     "tweet_ios_index" btree (id, user_id, creation_time)
>>>     "tweet_retweeted_idx" hash (retweeted)
>>>     "tweet_user_id_creation_time_index" btree (creation_time, user_id)
>>>     "tweet_user_id_index" hash (user_id)
>>>
>>
>>
>> Are all of those indexes important?  If your table is heavily
>> updated/inserted, which I assume it is, maintaining those indexes is going
>> to take up precious RAM that could probably be better used elsewhere.
>>
>
> Probably not. But once this database is read only, the quantity of index
> grew following my desperation. =)
>
>>
>> Cheers,
>>
>> Jeff
>>
>
> Thank you very much again!
> Caio
>

Reply via email to