Re: [Trac-dev] Refactoring the Trac data model - relational approach

Terje S. Mon, 05 Jul 2010 11:46:37 -0700

On Sun, Jul 04, 2010 at 09:27:33PM +0200, Remy Blank wrote:

> I'm sure your input would be highly appreciated. None of the current
> developers is a DB expert either (AFAIK, I'm certainly not one, quite
> the opposite), so any help is welcome.


Sounds fun, but to clarify: I'm just a hobbyist too, though I have a 
book collection on the topic, and love to play around when time permits, 
but that's about it. You get what you pay for, I guess! ;)

>  - Even if a complete migration requires a complete rewrite, is it
> possible to migrate progressively from the current to the new structure?

Tough question. My gut feeling is no, because I think it will add a 
large amount of work overall. This change will touch virtually all 
aspects of the system. We have to rethink a lot of the elementary units,
layers and interfaces throughout, and continually evolving this in an 
active branch will be tough.

IMO it deserves a completely new, isolated research codebase. The new 
model and core middleware architecture is laid out, and at some point 
the reusable parts of the existing codebase are ported based on 
decisions to be made along the way. The first alpha is built from that 
foundation, and it is marketed as a new product with a (partial) 
upgrade path possible (or we gather enough information to decide for 
gradual implementation to main code)

> I don't believe in big, monolithic rewrites, as they are a sure way to
> stall the project for a long time. [...]

Without a doubt, this is true. But also consider that the existing system 
has been gold-plated for some time. Working with the old stuff, that has
a very different approach and lots of quirks in general, may inhibit 
freedom or progress in the new design, because we have to support a lot 
of legacy hacks along the way.

It will *not* be an easy task either way, and it is evident that a
complete rewrite will take a long time to reach release. I think the 
relative simplicity of the feature set as compared to the codebase is an 
argument for not using the dual/integrated approach, as very much of 
the current code including some plugins will "rewrite itself" directly on 
top of the new model, and doubtless some code can be ported/reused with 
minimal changes, like wiki and other data processors (but I don't think 
it makes sense to evolve together).

Ultimately the maintainers must make the call, of course. 

>  - I have read a few times that although a normalized database schema
> makes it very flexible and general, it also has a performance impact.
> From a layman point of view, having to join multiple tables has to be
> slower than not having to join tables, hasn't it? Again, I'm a noob in
> DB technology, so don't laugh too much if the question is stupid.

That's a great question, though much more complex than I bargained for ;-) 

Your concern is that of premature optimization, the root of all evil.
Performance is measured in many ways, and it's not about raw I/O speed 
at all (any database will pale in comparison to the filesystem).

When you express a query to the database, there is an optimizer that
figures out the best way to answer. It considers tables, indexes, joins, 
the order of comparisons and all other aspects of your query and how it 
applies to the model. It then makes a plan for what parts of what tables 
must be loaded from disk to RAM in what order, and returns the result of 
the query in the most efficient way possible. It also has some *very* 
sophisticated caching mechanisms to stay fast at runtime, in case you
were to ask similar questions again soon. This has been tweaked for ages, 
and to take advantage of it, you must follow the game rules for relational 
databases. Normalization is sort of "best practices", how to correctly 
model various types of information, so it stays intact and can be 
efficiently queried, cached and updated without side effects. The point 
with a normalized model is that you can ask *any sort* of question about 
the data you have, combining as many or as few elements as you need to do 
the job at hand, and still get the right answer in the most effective way. 

IMPORTANT NOTE: The following sections may not be entirely accurate as 
to the exact inner workings of a database system, or in any other regard. 

So option one is to send a query to the database, asking for the information
you need in a well-defined and well-understood language, SQL. The (well-
tested) optimizer is technically speaking a million wizards who joyously 
compute answers with all sorts of deep magic, and give it right back to 
you with a huge smile. You can easily change the question, and ask other 
things about the same data, adding or removing elements, criteria, 
comparing various aspects etc. The wizards will whip the answer right up, 
in the most effective way possible, as long as you have a normalized model
and ask a sane question. 

Option two is to ignore this alltogether, and pretend the database is a
filesystem without wizards. You tell them to cast an ancient fscanf()-spell
to load a giant bunch of key-value pairs - one wizard casts the spell, then 
they all get depressed and go to the basement to smoke a ton of heroin you
have bought for them (with your own time maintaining this model). Ancient 
spells still work, so you get your list of key-value pairs, and proceed to
JOIN it *manually* in a *nonstandard* way *at runtime* (by instantiating 
objects, appending to lists or some other code that will be very hard, 
since you are reinventing SQL). Now you need some code to enforce 
relationships between entities, to denote how a milestone object relates 
to specific ticket objects and all sorts of other business rules to handle 
mastertickets or whatever. By now, in an early prototype of the system, we 
are approaching the stage where we can answer very simple questions about 
the data using a very complex process, but *not* the other way around! 
We have successfully rescued a single brain-damaged wizard from the 
heroin hell downstairs, up to the middleware.

When you want to ask a slightly different question later, you are in deep 
shit because the other 999999 wizards have all become highly addicted 
hippies and by now have taken up ritual murder of cute kittens as their 
primary hobby, casting an occasional fscanf() for their dose. The one 
wizard you've rescued has no capacity to learn, so you'll spend ages 
getting just one more of them clean and re-purposed for work. When you 
do, he will inevitably be brain-damaged, and you have to spend eternity
checking if he's doing his job or needs fixing. 

First you have to design this system, then implement, test and maintain 
it, and at several points later invent various caching layers or other 
optimizations that the wizards knew perfectly well before you drugged them. 

All the shit happens simply because smoking heroin turns out to have a 
detrimental effect on wizards! Hope That Helps.

It is true, there are special cases when a denormalized model is best, but 
that's not an argument for giving all the wizards heroin when you first 
meet them. It's much better to give one of them a little bit of cocaine to 
get by some particularily mundane part of your model (say a consolidated 
Work Efforts or the custom field system that both *must* largely be 
denormalized and thus handled in code). Without fail, a denormalized part 
in the model *requires* that you train and babysit some drugged-down 
hippie fucker crawling around in the wastelands of your code. In exchange 
for that, you can have flexibility, speed, time to market or other wins.

And so, I think it will be easier to rid of all the addicted wizards than
to re-integrate them with society in a 12-step program, and normalization 
is an argument *for* performance on all levels, not against it ;-)

As possible evidence for some aspects of all of the above, I submit this
link and a two-beer bet that straight wizards would outperform it:

http://FIXME.edgewall.org/timeline?daysback=30

Raise, anyone? ;-)

Terje

-- 
You received this message because you are subscribed to the Google Groups "Trac 
Development" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/trac-dev?hl=en.

Re: [Trac-dev] Refactoring the Trac data model - relational approach

Reply via email to