I have a database with this (partial) schema:

  CREATE TABLE origins (
    id         INTEGER PRIMARY KEY,
    label      TEXT NOT NULL UNIQUE
  );

  CREATE TABLE url_strings (
    id         INTEGER PRIMARY KEY,
    url        TEXT NOT NULL UNIQUE
  );

  CREATE TABLE canon_statuses (
    id         INTEGER PRIMARY KEY,
    status     TEXT NOT NULL UNIQUE
  );

  CREATE TABLE urls (
    origin     INTEGER NOT NULL REFERENCES origins(id),
    origin_id  INTEGER NOT NULL,
    url        INTEGER NOT NULL REFERENCES url_strings(id),
    UNIQUE (origin, origin_id)
  );

  CREATE TABLE canon_urls (
    url        INTEGER PRIMARY KEY REFERENCES url_strings(id),
    canon      INTEGER REFERENCES url_strings(id),
    status     INTEGER REFERENCES canon_statuses(id)
  );

The 'urls' and 'canon_urls' tables are separate because the same URL
may appear many times (with different origin+origin_id pairs) in the
former table, but only needs to appear once in the latter.  There are
no explicit indexes, only the ones implied by the UNIQUEs and PRIMARY
KEYs.  There are currently about 4,000,000 rows in 'url_strings',
'urls', and 'canon_urls', and about 200 rows in 'canon_statuses'.

There are several programs that make use of this database, but the one
that's relevant right now is called "canonicalize", and what it does,
in outline, is scan the 'urls' table for any new URLs that aren't
already in the 'canon_urls' table, add them, then do some complicated
processing on each entry in 'canon_urls' for which both 'canon' and
'status' are currently NULL.  (The short version is it chases HTTP
redirects until there aren't any more.)  Because the 'complicated
processing' stage is slow and independent for each URL, it gets farmed
out to a pool of worker threads, but (at present) there is only one
thread that talks to the database.  In pseudocode + SQL, the database
thread's logic is as follows:

    INSERT INTO canon_urls
      SELECT DISTINCT url, NULL, NULL FROM urls
        WHERE url NOT IN (SELECT url FROM canon_urls)

    work_queue = Queue()
    result_queue = Queue()
    for (id, url) in (SELECT u.url, v.url
                        FROM canon_urls AS u
                        LEFT JOIN url_strings AS v ON u.url = v.id
                        WHERE u.canon IS NULL AND u.status IS NULL):
      work_queue.put( (id, url) )

    repeat n_workers times:
      spawn_worker(work_queue, result_queue)

    while (url_id, canon_url, status) = result_queue.get():

      # 'canon_url' can be NULL at this point, 'status' can't.
      if canon_url == NULL:
        canon_id = NULL
      else:
        canon_id = (SELECT id FROM url_strings WHERE url = result.canon_url)
        if canon_id == NULL:
          INSERT INTO url_strings VALUES(NULL, result.canon_url)
          canon_id = last_row_id

      status_id = (SELECT id FROM canon_statuses WHERE status = result.status)
      if not status_id:
        INSERT INTO canon_statuses VALUES(NULL, result.status)
        status_id = last_row_id

       UPDATE canon_urls SET canon = canon_id, status = status_id
         WHERE url = url_id

This works correctly but is inefficient.  The work_queue - which, as
you can see, is initialized to every yet-to-be-done row from
'canon_urls' - occupies too much memory: when the job is started from
scratch, approximately half of all available RAM on the rather small
virtual machine where this needs to run. The obvious fix is to feed
the workers directly from the SELECT query that currently loads up the
work queue; then the set of jobs-to-do need not sit in RAM.  (There
would still be a short work queue to keep all database accesses on one
thread.)  But this would mean simultaneously reading and writing the
same table.  If I hold a read cursor open on 'canon_urls', it is my
understanding that the UPDATEs will fail; if I restart the read cursor
after every UPDATE, I will be rescanning the table for each new job
and the overall algorithm will become quadratic.  Can anyone suggest a
better way?  Note that the overall job may be interrupted at any time,
so writing completed rows to a scratch table and then copying them over
at the end is impractical.

Also, but much less important, the initial INSERT operation takes tens
of seconds even when it has nothing to do; if there's a more efficient
way to write that I would like to know about it.  (I also tried

   INSERT INTO canon_urls
     SELECT DISTINCT url, NULL, NULL FROM urls
     EXCEPT SELECT url, NULL, NULL FROM canon_urls

but that is even slower.)

Thanks,
zw
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to