#22251: Dumpdata improvement suggestions
--------------------------------------------+--------------------
     Reporter:  Gwildor                     |      Owner:  nobody
         Type:  Cleanup/optimization        |     Status:  new
    Component:  Core (Management commands)  |    Version:  master
     Severity:  Normal                      |   Keywords:
 Triage Stage:  Unreviewed                  |  Has patch:  0
Easy pickings:  0                           |      UI/UX:  0
--------------------------------------------+--------------------
 The dumpdata and loaddata are the standard built-in management commands
 for dumping and loading a database to something else than a big unparsable
 SQL file. Perhaps using them is in their current form is not the best idea
 for big databases and a dedicated separate package should be used for
 that, but the fact is that these commands are currently present in Django,
 while their usability, from a user's standpoint, is lacking at best. While
 they do their job, they have a few big shortcomings which make the
 commands hard to use in their current form. These problems could be
 tackled with a few big tweaks, which would make these commands worthy
 again of being in Django natively.

 The problems I found are from using the dumpdata command quite intensely
 over the course of the past two months, with resulting unindented JSON
 dumps ranging in the 300-400MB area. Instead of using the loaddata
 command, I built a custom compatibility parser for our project, but I
 reckon these problems hold true for the loaddata command as well.

 In my opinion, the current usability problems with the commands are:

 === Complete lack of verbosity
 In its current form, you have absolutely no clue whether the command is
 still running properly (when deciding that it has failed and you should
 kill it and try again) or what its progress is. Because its default
 behaviour is to return the serialization result to the console, the
 command output is usually redirected to a separate file, making this even
 worse. Usually, the only feedback you get from the command is the command
 actually stopping again and thus giving you control over your console
 again.

 Hence, something like this is extremely common:
 {{{
 $ ./manage.py dumpdata app1 app2 ... --format=json > dump.json
 $
 }}}
 So, to clarify: between these two lines in the console, there could be an
 indefinite amount of time. With the dumps I spoke of, the command could
 last up to two hours, giving no indication at all about its state or
 progress. This is of course logical, because you are redirecting the
 output, but to me this is a major usability flaw. The only way you can
 check if the command is still correctly running is by monitoring the
 process to see if its memory usage is still increasing. Which actually
 brings me to my second point...

 === Keeping everything in memory

 During the serialization, the final result is built up in memory and
 returned as the final step of the command. To me, this is bad for a few
 reasons:
  * When the command stops unexpectedly, you are left with nothing.
  * Slows down your computer when you don't have enough memory.
  * Slows down itself when you don't have enough memory.
  * Possibly a lot of other things that happen when you don't have enough
 memory (such as being killed by the OS? Not sure if this ever happens in
 Python, but see it a lot when trying to run big misconfigured Java
 programs).

 This is especially annoying when you get the dreaded "Unable to serialize
 database" error and the command just stops right there (which could be a
 ticket of its own), without returning the result it has accumulated up to
 that point. Combined with the above-mentioned lack of verbosity, this
 makes the command very annoying to use in some circumstances, depending of
 course on the size and state of your database.

 == Possible improvements to address these issues

 Off the top of my head, I've come up with these suggestions for
 improvement, but of course diagnosing the problem correctly is the first
 and most important step, so I'm mainly listing these to get the discussion
 going and to end ticket on a more optimistic note.

  * add a mandatory argument for dumpdata for a filename to write the
 result to, or create one automatically if it's not given (such as
 "dump_20130311_1337.json"). This makes redirecting the output of the
 command unnecessary, and opens up the ability to add verbosity.
  * collect the amount of models (and perhaps rows of data?) to dump, tell
 this to the user and give progress updates in between. This would fully
 eliminate the "is it stuck?" question I often have when running the
 command.
  * write one row of data in one JSON object per one line in the file.
 Perhaps this could be added as a flag on both commands, just like indent
 is already? This would make it possible to read and write the dump file in
 a buffered manor, eliminating the need to load the entire result into
 memory. Now this is a tough one, because it's backwards incompatible if
 it's not added as a flag and it requires rewriting the loaddata command as
 well.

 Like said, these are just some pointers on how the problems mentioned
 could be addressed, but I reckon everyone has their on views on it. Thanks
 in advance for your time and effort.

 First ticket I've created, so analogies in advance for any shortcomings.

-- 
Ticket URL: <https://code.djangoproject.com/ticket/22251>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-updates+unsubscr...@googlegroups.com.
To post to this group, send email to django-updates@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/050.76e1636e7c02da82445f69fa07d3995f%40djangoproject.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to