Hi all! I track Twitter followers in my database. I have the following table:
# \d persona_followers
Table "public.persona_followers"
Column | Type | Modifiers
-------------+-----------------------------+-----------
service_id | bigint | not null
follower_id | bigint | not null
valid_at | timestamp without time zone |
Indexes:
"persona_followers_pkey" PRIMARY KEY, btree (service_id, follower_id)
The table IS NOT partitioned.
I have a list of Twitter people I follow more - brands, actors, those kinds of
Twitter accounts. They often have thousands, if not hundreds of thousands, of
followers. I fetch the followers of these accounts about once a day. When it's
time to insert into the database, I use the following algorithm:
CREATE TEMP TABLE import( service_id bigint, follower_id bigint );
COPY INTO import FROM STDIN;
...
\N
INSERT INTO persona_followers(service_id, follower_id, valid_at)
SELECT service_id, follower_id, NOW()
FROM (SELECT DISTINCT service_id, follower_id FROM import) AS import
WHERE NOT EXISTS(SELECT * FROM persona_followers WHERE import.service_id =
persona_followers.service_id AND import.follower_id =
persona_followers.follower_id);
I currently have 660 million rows in persona_followers (47 GB). A test import
is 13.5 million rows (571 MB). The real daily import will be at least 10x more.
In a 24 hour period, I will have at most a few thousand *new* rows - the rest
will already exist in persona_followers. How do I most efficiently eliminate
the duplicates? Should I delete the duplicates in import? Or should I bite the
bullet and EXCEPT the final table? Should I insert much smaller batches? Or is
the above already the most efficient way? What other completely different data
structure could I use to achieve my goal? I truly need the exhaustive list of
followers because we do reach calculations (number of unique accounts which
received a particular tweet).
The true answer is probably "benchmark on your own servers", but I'm looking
for guidelines, people with the same kind of experience.
Thanks!
François
smime.p7s
Description: S/MIME cryptographic signature
