On Mon, Jul 27, 2009 at 02:44:08PM -0700, CityDev scratched on the wall: > > More recently the term 'denormalise' has been used > instead. This is where you repeat foreign data in a table to avoid the > overhead of joins at runtime.
> Over the intervening years I can't ever remember denormalising data (even > when dealing with eg 13 million insurance customers in a table). Is it OK > nowadays to say always aim to be fully normalised - modern RDBMSs are > usually powerful enough to cope with most anything? Normalization is a fundamental part of the Relational Model. The first three Normal Forms were defined as part of the original research into the Relational Model. As you might remember, the basic idea behind normalization is to eliminate all duplicate data. This is critical to data integrity, which is critical to the assumptions made by the Relational Model. A big part of normalization is making sure that each table (relation) is used to represent one and only one "thing." A side effect of this is that when you normalize your design there is only one copy of any given atomic data unit. If there is only one copy, it is impossible for that data to get out of sync, further forwarding the ideas of data integrity and consistency. It also tends to INCREASE performance for everything but SELECT statements. In general, normalized databases tend to be smaller, thanks to all the duplicate elimination. That means less I/O, which is a big performance hit for many database system. If a smaller database can hold the same represented data, it also means a higher percentage of the unique data can fit into a given size cache or memory pool, further reducing the need for I/O. It also means that if you need to update something (insert/update/delete) those changes tend to be extremely localized, making the change smaller and faster. The one trick is that if you split your data up into a bunch of tables, you need to spend time gathering it back up for most queries. That means JOINs. There are two problems with JOINs. First, unless you're careful about your index placement and how the query is setup, they can become very expensive. Normally they're fairly efficient, but it only takes a misplaced index to really screw things up. So you have to understand what you're doing and pay attention. Not a real big deal, but you need to have to know what is going on. The other issue (that plays strongly into the first) is that a lot of people fundamentally cannot wrap their heads around JOINs. If you're uncomfortable with JOINs, you tend to design databases that don't require them. And that's really too bad, as the JOIN is, in many ways, *The* fundamental Relational operator. So, in general, I would say yes... the goal should be 3NF (in nearly all cases, if you get to 3NF you get to 5NF... the last two are to cover a few somewhat unusual edge cases). The normal forms really buy you a lot of practical benefits, and are a lot more than just a bunch of stuffy academic design rules. I think it is fair to view denormalization as a last-resort performance optimization, but it should be seen as an optimization and it should be nearly last resort. It shouldn't be done until you can prove the performance isn't there, that there are no other practical ways of getting it back (e.g. better index placement), and that you've measured and profiled (and NOT guessed) that the JOINs (or whatever the denormalization is going to get rid of) are the actual cause of the performance issues. The big exception to this, and where the most blatant denormalization takes place, is in data warehouses. That's a pretty special case, however. First, the data is basically write-once, read-many. Once the data is imported, it is usually treated as read-only. The lack of updates means there is less chance of inconsistencies being introduced-- assuming the original data is good. Second, the data is usually dumped from another database that has higher normalization. That is, the data is "known good" due to the fact that it came from what was, in theory, a properly normalized and well formed database. This essentially turns the data warehouse into a series of "instanced views" where useful sets of data are "pre-queried" into common sub-parts of queries that are normally joined together from several tables for nearly every query you might want to run. Essentially exactly the kind of thing you might use a view for, only you select it into a table. Since the database is lacks updates except, typically, for nightly batch imports, this works out just fine as long as you remember to re-form the "views" each time you import new data. On the other hand, with the advent of the data driven website, many databases see a far greater number of reads than writes, so it makes some sense to push read performance over other issues. This is a tricky balance. It exposes a lot of data management problems and pushes more application logic and validation up into the database access layer. On the other hand, if the database access layer is a website (and not a desktop application) that has one code base (fewer deployment problems) that are controlled by the same folks that run the database, there is at lease some hope of keeping all the Is dotted and Ts crossed. On the gripping hand, memory and processor is really cheap (fast storage, not so much) put next to programmer time. Unless the database has hundreds of thousands of rows, if you're running on modern system chances are none of it matters. You'll generally do fine and it is doubtful the question will ever come up. Those running on a phone, perhaps not, but a server with a pile of RAM should let you do almost anything (just remember to adjust things so that said resources can be effectively used). Overall, I'd say that normalization is like many "rules." There are good reasons for normalization and the "always normalize" rule should be strongly respected. *Respecting* a rule doesn't always mean *following* a rule, however. If you understand the rule, why it is there, the thoughts and ideas behind it, and the full consequences of breaking it, then you're allowed to make a conscious decision to cross that line and accept the consequences. Breaking a rule by default or ignorance is usually bad news, however. "Normalize until it hurts, denormalize until it works." -j -- Jay A. Kreibich < J A Y @ K R E I B I.C H > "Our opponent is an alien starship packed with atomic bombs. We have a protractor." "I'll go home and see if I can scrounge up a ruler and a piece of string." --from Anathem by Neal Stephenson _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users