[Django] #22251: Dumpdata improvement suggestions

Django Tue, 11 Mar 2014 18:49:05 -0700

#22251: Dumpdata improvement suggestions
--------------------------------------------+--------------------
     Reporter:  Gwildor                     |      Owner:  nobody
         Type:  Cleanup/optimization        |     Status:  new
    Component:  Core (Management commands)  |    Version:  master
     Severity:  Normal                      |   Keywords:
 Triage Stage:  Unreviewed                  |  Has patch:  0
Easy pickings:  0                           |      UI/UX:  0
--------------------------------------------+--------------------
 The dumpdata and loaddata are the standard built-in management commands
 for dumping and loading a database to something else than a big unparsable
 SQL file. Perhaps using them is in their current form is not the best idea
 for big databases and a dedicated separate package should be used for
 that, but the fact is that these commands are currently present in Django,
 while their usability, from a user's standpoint, is lacking at best. While
 they do their job, they have a few big shortcomings which make the
 commands hard to use in their current form. These problems could be
 tackled with a few big tweaks, which would make these commands worthy
 again of being in Django natively.

The problems I found are from using the dumpdata command quite intensely
over the course of the past two months, with resulting unindented JSON
dumps ranging in the 300-400MB area. Instead of using the loaddata
command, I built a custom compatibility parser for our project, but I
reckon these problems hold true for the loaddata command as well.

In my opinion, the current usability problems with the commands are:

=== Complete lack of verbosity
In its current form, you have absolutely no clue whether the command is
still running properly (when deciding that it has failed and you should
kill it and try again) or what its progress is. Because its default
behaviour is to return the serialization result to the console, the
command output is usually redirected to a separate file, making this even
worse. Usually, the only feedback you get from the command is the command
actually stopping again and thus giving you control over your console
again.

Hence, something like this is extremely common:
{{{
$ ./manage.py dumpdata app1 app2 ... --format=json > dump.json
$
}}}
So, to clarify: between these two lines in the console, there could be an
indefinite amount of time. With the dumps I spoke of, the command could
last up to two hours, giving no indication at all about its state or
progress. This is of course logical, because you are redirecting the
output, but to me this is a major usability flaw. The only way you can
check if the command is still correctly running is by monitoring the
process to see if its memory usage is still increasing. Which actually
brings me to my second point...

=== Keeping everything in memory

During the serialization, the final result is built up in memory and
returned as the final step of the command. To me, this is bad for a few
reasons:
* When the command stops unexpectedly, you are left with nothing.
* Slows down your computer when you don't have enough memory.
* Slows down itself when you don't have enough memory.
* Possibly a lot of other things that happen when you don't have enough
memory (such as being killed by the OS? Not sure if this ever happens in
Python, but see it a lot when trying to run big misconfigured Java
programs).

This is especially annoying when you get the dreaded "Unable to serialize
database" error and the command just stops right there (which could be a
ticket of its own), without returning the result it has accumulated up to
that point. Combined with the above-mentioned lack of verbosity, this
makes the command very annoying to use in some circumstances, depending of
course on the size and state of your database.

== Possible improvements to address these issues

Off the top of my head, I've come up with these suggestions for
improvement, but of course diagnosing the problem correctly is the first
and most important step, so I'm mainly listing these to get the discussion
going and to end ticket on a more optimistic note.

* add a mandatory argument for dumpdata for a filename to write the
result to, or create one automatically if it's not given (such as
"dump_20130311_1337.json"). This makes redirecting the output of the
command unnecessary, and opens up the ability to add verbosity.
* collect the amount of models (and perhaps rows of data?) to dump, tell
this to the user and give progress updates in between. This would fully
eliminate the "is it stuck?" question I often have when running the
command.
* write one row of data in one JSON object per one line in the file.
Perhaps this could be added as a flag on both commands, just like indent
is already? This would make it possible to read and write the dump file in
a buffered manor, eliminating the need to load the entire result into
memory. Now this is a tough one, because it's backwards incompatible if
it's not added as a flag and it requires rewriting the loaddata command as
well.

Like said, these are just some pointers on how the problems mentioned
could be addressed, but I reckon everyone has their on views on it. Thanks
in advance for your time and effort.

First ticket I've created, so analogies in advance for any shortcomings.

--
Ticket URL: <https://code.djangoproject.com/ticket/22251>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

--
You received this message because you are subscribed to the Google Groups
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to django-updates+unsubscr...@googlegroups.com.
To post to this group, send email to django-updates@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/django-updates/050.76e1636e7c02da82445f69fa07d3995f%40djangoproject.com.
For more options, visit https://groups.google.com/d/optout.

[Django] #22251: Dumpdata improvement suggestions

Reply via email to