Re: [sqlite] Denormalisation

Jay A. Kreibich Mon, 27 Jul 2009 22:16:45 -0700

On Mon, Jul 27, 2009 at 02:44:08PM -0700, CityDev scratched on the wall:
> 
> More recently the term 'denormalise' has been used
> instead. This is where you repeat foreign data in a table to avoid the
> overhead of joins at runtime.


> Over the intervening years I can't ever remember denormalising data (even
> when dealing with eg 13 million insurance customers in a table). Is it OK
> nowadays to say always aim to be fully normalised - modern RDBMSs are
> usually powerful enough to cope with most anything?

  Normalization is a fundamental part of the Relational Model.  The
  first three Normal Forms were defined as part of the original
  research into the Relational Model.

  As you might remember, the basic idea behind normalization is to
  eliminate all duplicate data.  This is critical to data integrity,
  which is critical to the assumptions made by the Relational Model.
  A big part of normalization is making sure that each table (relation)
  is used to represent one and only one "thing."  A side effect of this
  is that when you normalize your design there is only one copy of any
  given atomic data unit.  If there is only one copy, it is impossible
  for that data to get out of sync, further forwarding the ideas of
  data integrity and consistency.

  It also tends to INCREASE performance for everything but SELECT
  statements.  In general, normalized databases tend to be smaller,
  thanks to all the duplicate elimination.  That means less I/O,
  which is a big performance hit for many database system.  If a
  smaller database can hold the same represented data, it also means
  a higher percentage of the unique data can fit into a given size
  cache or memory pool, further reducing the need for I/O.  It also
  means that if you need to update something (insert/update/delete)
  those changes tend to be extremely localized, making the change
  smaller and faster.

  The one trick is that if you split your data up into a bunch of
  tables, you need to spend time gathering it back up for most queries.
  That means JOINs.

  There are two problems with JOINs.  First, unless you're careful
  about your index placement and how the query is setup, they can become
  very expensive.  Normally they're fairly efficient, but it only takes
  a misplaced index to really screw things up.  So you have to
  understand what you're doing and pay attention.  Not a real big deal,
  but you need to have to know what is going on.
  
  The other issue (that plays strongly into the first) is that a lot of
  people fundamentally cannot wrap their heads around JOINs.  If you're
  uncomfortable with JOINs, you tend to design databases that don't
  require them.  And that's really too bad, as the JOIN is, in many
  ways, *The* fundamental Relational operator.


  So, in general, I would say yes... the goal should be 3NF (in nearly
  all cases, if you get to 3NF you get to 5NF... the last two are to
  cover a few somewhat unusual edge cases).  The normal forms really
  buy you a lot of practical benefits, and are a lot more than just a
  bunch of stuffy academic design rules.  
  
  I think it is fair to view denormalization as a last-resort performance
  optimization, but it should be seen as an optimization and it should
  be nearly last resort.  It shouldn't be done until you can prove the
  performance isn't there, that there are no other practical ways of
  getting it back (e.g. better index placement), and that you've
  measured and profiled (and NOT guessed) that the JOINs (or whatever
  the denormalization is going to get rid of) are the actual cause of
  the performance issues.


  The big exception to this, and where the most blatant denormalization
  takes place, is in data warehouses.  That's a pretty special case,
  however.  First, the data is basically write-once, read-many.  Once the
  data is imported, it is usually treated as read-only.  The lack of
  updates means there is less chance of inconsistencies being
  introduced-- assuming the original data is good.  Second, the data is
  usually dumped from another database that has higher normalization.
  That is, the data is "known good" due to the fact that it came from
  what was, in theory, a properly normalized and well formed database.
  This essentially turns the data warehouse into a series of "instanced
  views" where useful sets of data are "pre-queried" into common
  sub-parts of queries that are normally joined together from several
  tables for nearly every query you might want to run.  Essentially
  exactly the kind of thing you might use a view for, only you select
  it into a table.  Since the database is lacks updates except,
  typically, for nightly batch imports, this works out just fine as
  long as you remember to re-form the "views" each time you import new
  data.


  On the other hand, with the advent of the data driven website, many
  databases see a far greater number of reads than writes, so it makes
  some sense to push read performance over other issues.  This is a
  tricky balance.  It exposes a lot of data management problems and
  pushes more application logic and validation up into the database
  access layer.  On the other hand, if the database access layer is a
  website (and not a desktop application) that has one code base (fewer
  deployment problems) that are controlled by the same folks that run
  the database, there is at lease some hope of keeping all the Is
  dotted and Ts crossed.

  On the gripping hand, memory and processor is really cheap (fast
  storage, not so much) put next to programmer time.  Unless the
  database has hundreds of thousands of rows, if you're running on
  modern system chances are none of it matters.  You'll generally do
  fine and it is doubtful the question will ever come up.  Those
  running on a phone, perhaps not, but a server with a pile of RAM
  should let you do almost anything (just remember to adjust things so
  that said resources can be effectively used).


  Overall, I'd say that normalization is like many "rules."  There are
  good reasons for normalization and the "always normalize" rule should
  be strongly respected.  *Respecting* a rule doesn't always mean
  *following* a rule, however.  If you understand the rule, why it is
  there, the thoughts and ideas behind it, and the full consequences of
  breaking it, then you're allowed to make a conscious decision to cross
  that line and accept the consequences.  Breaking a rule by default or
  ignorance is usually bad news, however.

  
          "Normalize until it hurts, denormalize until it works."

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Our opponent is an alien starship packed with atomic bombs.  We have
 a protractor."   "I'll go home and see if I can scrounge up a ruler
 and a piece of string."  --from Anathem by Neal Stephenson
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Denormalisation

Reply via email to