I have a database with this (partial) schema: CREATE TABLE origins ( id INTEGER PRIMARY KEY, label TEXT NOT NULL UNIQUE );
CREATE TABLE url_strings ( id INTEGER PRIMARY KEY, url TEXT NOT NULL UNIQUE ); CREATE TABLE canon_statuses ( id INTEGER PRIMARY KEY, status TEXT NOT NULL UNIQUE ); CREATE TABLE urls ( origin INTEGER NOT NULL REFERENCES origins(id), origin_id INTEGER NOT NULL, url INTEGER NOT NULL REFERENCES url_strings(id), UNIQUE (origin, origin_id) ); CREATE TABLE canon_urls ( url INTEGER PRIMARY KEY REFERENCES url_strings(id), canon INTEGER REFERENCES url_strings(id), status INTEGER REFERENCES canon_statuses(id) ); The 'urls' and 'canon_urls' tables are separate because the same URL may appear many times (with different origin+origin_id pairs) in the former table, but only needs to appear once in the latter. There are no explicit indexes, only the ones implied by the UNIQUEs and PRIMARY KEYs. There are currently about 4,000,000 rows in 'url_strings', 'urls', and 'canon_urls', and about 200 rows in 'canon_statuses'. There are several programs that make use of this database, but the one that's relevant right now is called "canonicalize", and what it does, in outline, is scan the 'urls' table for any new URLs that aren't already in the 'canon_urls' table, add them, then do some complicated processing on each entry in 'canon_urls' for which both 'canon' and 'status' are currently NULL. (The short version is it chases HTTP redirects until there aren't any more.) Because the 'complicated processing' stage is slow and independent for each URL, it gets farmed out to a pool of worker threads, but (at present) there is only one thread that talks to the database. In pseudocode + SQL, the database thread's logic is as follows: INSERT INTO canon_urls SELECT DISTINCT url, NULL, NULL FROM urls WHERE url NOT IN (SELECT url FROM canon_urls) work_queue = Queue() result_queue = Queue() for (id, url) in (SELECT u.url, v.url FROM canon_urls AS u LEFT JOIN url_strings AS v ON u.url = v.id WHERE u.canon IS NULL AND u.status IS NULL): work_queue.put( (id, url) ) repeat n_workers times: spawn_worker(work_queue, result_queue) while (url_id, canon_url, status) = result_queue.get(): # 'canon_url' can be NULL at this point, 'status' can't. if canon_url == NULL: canon_id = NULL else: canon_id = (SELECT id FROM url_strings WHERE url = result.canon_url) if canon_id == NULL: INSERT INTO url_strings VALUES(NULL, result.canon_url) canon_id = last_row_id status_id = (SELECT id FROM canon_statuses WHERE status = result.status) if not status_id: INSERT INTO canon_statuses VALUES(NULL, result.status) status_id = last_row_id UPDATE canon_urls SET canon = canon_id, status = status_id WHERE url = url_id This works correctly but is inefficient. The work_queue - which, as you can see, is initialized to every yet-to-be-done row from 'canon_urls' - occupies too much memory: when the job is started from scratch, approximately half of all available RAM on the rather small virtual machine where this needs to run. The obvious fix is to feed the workers directly from the SELECT query that currently loads up the work queue; then the set of jobs-to-do need not sit in RAM. (There would still be a short work queue to keep all database accesses on one thread.) But this would mean simultaneously reading and writing the same table. If I hold a read cursor open on 'canon_urls', it is my understanding that the UPDATEs will fail; if I restart the read cursor after every UPDATE, I will be rescanning the table for each new job and the overall algorithm will become quadratic. Can anyone suggest a better way? Note that the overall job may be interrupted at any time, so writing completed rows to a scratch table and then copying them over at the end is impractical. Also, but much less important, the initial INSERT operation takes tens of seconds even when it has nothing to do; if there's a more efficient way to write that I would like to know about it. (I also tried INSERT INTO canon_urls SELECT DISTINCT url, NULL, NULL FROM urls EXCEPT SELECT url, NULL, NULL FROM canon_urls but that is even slower.) Thanks, zw _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users