I have code which looks basically like this:

now        = datetime.today()
beginning  = datetime.fromtimestamp(0)
end        = now - timedelta(days=settings.DAYSTOKEEP)

def purgedb():
    """Delete archivedEmail objects from the beginning of time until
    daystokeep days in the past."""
    queryset   = archivedEmail.objects.all()
    purgeset   = queryset.filter(received__range=(beginning, end))
    for email in purgeset:
        print email
        try:
            os.unlink(settings.REAVER_CACHE+"texts/%s"     % email.cacheID)
            os.unlink(settings.REAVER_CACHE+"prob_good/%s" % email.cacheID)
            os.unlink(settings.REAVER_CACHE+"prob_spam/%s" % email.cacheID)
        except OSError:
            pass
    purgeset.delete()

if __name__ == '__main__':
    purgedb()

The idea is that we are stuffing a bunch of emails in a database for
customer service purposes. I want to clear out anything older than
DAYSTOKEEP. The model looks like this:

class archivedEmail(models.Model):
    subject     = models.CharField(blank=True, max_length=512, null=True)
    toAddress   = models.CharField(blank=True, max_length=128, db_index=True)
    fromAddress = models.CharField(blank=True, max_length=128, db_index=True)
    date        = models.DateTimeField()
    received    = models.DateTimeField(db_index=True)
    crmScore    = models.FloatField()
    spamStatus  = models.CharField(max_length=6, choices=spamStatusChoices, 
db_index=True)
    cacheHost   = models.CharField(max_length=24)
    cacheID     = models.CharField(max_length=31, primary_key=True)

    class Meta:
        ordering = ('-received',)

But when purgedb runs it deletes emails 100 at a time (which takes
forever) and after running for a couple of hours uses a gig and a half
of RAM. If I let it continue after a number of hours it runs the
machine out of RAM/swap.

Am I doing something which is not idiomatic or misusing the ORM
somehow? My understanding is that it should be lazy so using
objects.all() on queryset and then narrowing it down with a
queryset.filter() to make a purgeset should be ok, right? What can I
do to make this run in reasonable time/memory?

PS: I used to have ordering set to -date in the class Meta but that
caused the db to always put an ORDER BY date on the select query which
was unnecessary in this case causing it to take ages sorting a couple
million rows since there is no index on date (nor did there need to
be, so I thought, since we never select on it). Changing it to
received makes no difference to my app but avoids creating another
index. Django's is the first ORM I have ever used and these sneaky
performance issues are making me wonder...

-- 
Tracy Reed
http://tracyreed.org

Attachment: pgpdFIVBgVCfG.pgp
Description: PGP signature

Reply via email to