A case for CASE expressions and bulk_update

michal . modzelewski Mon, 17 Nov 2014 11:41:16 -0800

I've been working on a bulk_update method for a project to allow saving 
many instances of a Model in a single call/query, an analog to the existing 
bulk_create method.
There seems to be some interest in this as evidenced by a library 
django-bulk-update <https://github.com/aykut/django-bulk-update> (which 
isn't working for me in tests with python3 and SQLite) and a recent ticket 
<https://code.djangoproject.com/ticket/23646>.


The ticket was closed as wontfix with an invitation for discussion, 
although the author doesn't seem to have pursued it. However, Shai Berger 
did comment positively on the ticket and proposed a different API, similar 
to the one in django-bulk-update and my own prototype:

Book.objects.update_many(books, 'price')

My own attempt was on Django 1.7 using a subclass of ExpressionNode passed 
as a value into UpdateQuery.add_update_values which allows Django to 
generate most of the UPDATE SQL, while the expression subclass was only 
responsible for generating an SQL CASE expression. This quickly generalized 
into a Case expression class which could be passed as a value into 
QuerySet.update.

But now that Query Expressions have been merged, I've done a quick port and 
the Case expression now works with updates, annotations and aggregates, and 
is used to power my bulk_update method.

A simple performance test shows that this works quite well. I tested the 
following methods of updating many database rows, using an in memory SQLite 
database :

# loop_save
for o in objects:
    o.save(update_fields=['field'])

# bulk_update
MyModel.objects.bulk_update(objects, update_fields=['field'])

# many_updates
q = Q(condition__lt=test_value)
MyModel.objects.filter(q).update(field=value1)
MyModel.objects.filter(~q).update(field=value2)

# case_update
q = Q(condition__lt=test_value)
MyModel.objects.update(field=Case([(q, value1)], default=value2,
                                  
output_field=MyModel._meta.get_field('field')))

With 10000 objects in the database, and updating all of them I got these 
run times using clock():

loop_save:    3.022845626653664
bulk_update:  0.1785595393377175
many_updates: 0.009768530320993563
case_update:  0.009343744353718986

We can see that bulk_update outperforms looping by an order of magnitude, 
despite  working by generating a CASE expression with a WHEN clause 
for every object conditioned on its pk.
Using Case in an update has no significant performance improvement, over 
running multiple queries. It may arguably allow for more readable code when 
using many conditions since the semantics of case are that conditions are 
evaluated in order, like a Python if ... elif ... else. Compare this:

MyModel.objects.filter(Q(condition__lt=test_value1)).update(field=value1)
MyModel.objects.filter(Q(condition__gte=test_value1, 
condition__lt=test_value2)).update(field=value2)
MyModel.objects.filter(Q(condition__gte=test_value2, 
condition__lt=test_value3)).update(field=value3)
MyModel.objects.filter(Q(condition__gte=test_value3, 
condition__lt=test_value4)).update(field=value4)
MyModel.objects.filter(Q(condition__gte=test_value4)).update(field=value5)

as opposed to:

q = Q(condition__lt=test_value)
MyModel.objects.update(field=Case([(Q(condition__lt=test_value1), value1),
                                   (Q(condition__lt=test_value2), value2),
                                   (Q(condition__lt=test_value3), value3),
                                   (Q(condition__lt=test_value4), value4)], 
default=value5,
                                  
output_field=MyModel._meta.get_field('field')))

With only 10 objects in the database, and updating all of them the times 
are:

loop_save:    0.004131221428270882
bulk_update:  0.0006249582329893796
many_updates: 0.0006562061446388481
case_update:  0.00036259952923938313

So the bulk_update performance gains are still there, and update using a 
Case expression if faster because the SQL generation time dominates in this 
example, and only one query needs to be generated and executed.

Case expressions can also be used in other ways (although these examples 
use a subclass SimpleCase, which is described in a little more detail 
below, for simpler syntax):

# annotation
MyModel.objects.annotate(status_text=SimpleCase('status', [('S', 
'Started'), ('R', 'Running'), ('F', 'Finished')],
                                                default='Unknown', 
output_field=CharField()))

# aggregation
MyModel.objects.aggregate(started=Sum(SimpleCase('status', [('S', 1)], 
default=0, output_field=IntegerField())),
                          running=Sum(SimpleCase('status', [('R', 1)], 
default=0, output_field=IntegerField())),
                          finished=Sum(SimpleCase('status', [('F', 1)], 
default=0, output_field=IntegerField())))

In these examples SimpleCase is a subclass of Case that generates the SQL 
simple CASE expression which is an equality test against a field.

So is this something that would be a worthwhile addition to core?
It could work now as a separate module, but the update example requires 
setting output_field. My original Django 1.7 code abused the 
prepare_database_save method that SQLUpdateCompiler called on an 
ExpressionNode to get the field that the expression would be assigned to 
automatically. After the merge of Query Expression SQLUpdateCompiler calls 
either resolve_expression or prepare_database_save, so I can no longer use 
this. A change in SQLUpdateCompiler would be required with a new API for 
settings output_field automatically on ExpressionNodes used as update 
values.

If this were considered for core, I would appreciate input on the API. I 
currently have:

Case(list_of_case_tuples, default, output_field) - A general CASE 
expression called a searched case in the SQL spec. list_of_case_tuples is 
an iterable of tuples of the form (Q_object, value), and default is the 
value for the ELSE clause. This generates SQL like:

CASE WHEN n > 0 THEN 'positive' WHEN n < 0 THEN 'negative' ELSE 'zero' END

SimpleCase(fieldname, list_of_simple_case_tuples, default, output_field) - 
A simple case expression in the SQL spec. fieldname is a field identifier 
like in an F expression, list_of_simple_case_tuples is an iterable of 
tuples of the form (condition_value, result_value). This generates SQL like:

CASE n WHEN 1 THEN 'one' WHEN 2 THEN 'two' ELSE 'other' END

Model.objects.bulk_update(list_of_instances, update_fields) - An analog to 
the bulk_create method that save changes in many model instances. list of 
instances is an iterable of model instances, update_fields is the same as 
the argument to Model.save of the same name.

Of course I will put up a branch on github if there is interest in this 
proposal. Thanks for your time and sorry for the wall of text.

- Michael

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/000de073-e26e-4b79-ae1b-67ba2f3d7180%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

A case for CASE expressions and bulk_update

Reply via email to