Re: Accountability with MySQL

Stephen Cook Wed, 15 Mar 2006 16:48:57 -0800

A commonly used example (at least at my last 3 jobs) would be a table ofdemographics for people (whether they be employees, clients, whatever).

You can have one table and allow NULLs for some of the fields (id,LastName, FirstName, MiddleInitial, Title, NameSuffix, AddressLine1,AddressLine2, City, State, Zip, Phone1, Phone2, Fax, Pager), or you canhave a separate table for each of the nullable fields (I count 8 inaddition to the base demographics table).

I am not a master DBA but it seems kind of brain damaged to insist onusing 9 tables left joined together every time I need a line ofdemographics.



Martijn Tonies wrote:

Hello,

(please read all before replying)

Basics of database design: store what you know.

Given that NULLs basically means the absence of a value
(eg: unknown), you shouldn't be storing NULLs.


Nonsense!!


That's a bold statement ...

That's simply wrong. A null means "unknown or not applicable" and is

Ah, so now things become clear, a NULL actually can mean
two things? So much for clarity then...


Come on; that's not fair. "Unknown" and "not applicable" are more like
different senses of the same thing, not two opposite things.


Read the literature on how to design databases. What you do
is storing "true propositions". That is, each attribute defines a
certain "true proposition". For example:

Employee "Martijn" has Employee# 14.

You should be able to derive these sentences from every row.
Putting a NULL or N/A in there fails to meet this requirement.


Perhaps my theory is a bit rusty but I have never heard of this "true
propositions" business before. Do you have a citation where I can read


about

this?



I would certainly recommend this one:
http://www.amazon.com/exec/obidos/ASIN/0321197844/databasede095-20/103-0880452-0854264?creative=327641&camp=14573&adid=0VQ1KDVJ56MV7389RPEQ&link_code=as1

Frankly, I'm skeptical about your interpretation of this idea. While you
clearly don't want false information in a database, it isn't false to say
that student such-and-such's exam mark or such-and-such an employee's
termination date is unknown or not applicable. It _would_ be false to


store

a grade of 0 for a student who didn't take the exam because it implies


that

he got every question wrong when in fact he didn't write the test at all;



Agreed.

that would be an example of 0 having two contradicatory meanings. Given a
phobia about nulls, it is a reasonable design to put students who don't


take

the test in a separate table but it's not the ONLY way to handle the
situation: a null to indicate a student who didn't take the test is
perfectly reasonable too.



Disagreed. Let me explain further:
In this case, a "null" means (according to you and in your design)
that the student did not take the test.

It could also mean that we lots it's grade for now, and that we
want to fill it in later, as in: "unknown", which is what NULL also
means (your own words).

In another case, null would be the employee isn't terminated yet,
but could be in the future.

That's the problems with nulls -> there's no actual defined
meaning. You can never say:

"This table holds test results, a row will mean:
Student with StudentID 123 made test #12 and scored a 4"

Cause you can have rows that can mean this:
Student with StudentID 123 made test #12 and scored a
yet unknown score

or

Student with StudentID 123 made test #12 and did not make
the test

or
Student with StudentID 123 and test #12 doesn't apply to him


In short: NULL is ambigious.

perfectly valid value to use in many, many situations. Two classic


examples:

Let me first state that there's a difference between storing NULLs
and handling NULLs in your result.


Yes, of  course there is: so what? You said NULLs should never be


_stored_;

that's what I'm responding to. Using NULLs is not very _difficult_
anyway,
just a bit tedious because it is another case to handle.


There's no point in storing what you don't know.


But you _do_ know something: that the student didn't take the exam. That

is

a piece of true knowledge all by itself.



But that's not what the rows in the table would mean - see above.
In the relational model, the idea is to store true propositions - that is,
each row means the same thing, but with different attributes. See above.

So, let's debunk these two classic examples...


Please note that I did not say that you MUST use NULLs, just that they


were

perfectly valid to use in a design.


Depends on who you're asking ...

A few years ago, I was working at a company that had developed
an application with an Oracle database with about 400 tables, nothing
too large.

Plenty of tables had NULLs, not because of missing business data,
but rather used internally in the system.


I'm not sure what you mean by that phrase "used internally in the system".



Missing business data, eg: no middle name. Some people would
put in a NULL in there as well, while NULL can also mean: I do
not KNOW this persons middle name. Ambiguity, see above.

The number of problems we had with them, I cannot count on my
two hands, nor can I count them on the hands of all other developers
(5) for that project.


How about enumerating a few of them here? I've dealt with relational
databases in several capacities for over 20 years and I've never found


nulls

to be a big problem.



Not big, but a problem... :-)

Do you often go around looking for (little) problems while you
can easily avoid them?

Just last week, I spoke to a guy who used to be my collegue there
and we discussed the design a bit (he still does database design and
development for a large company, on Sybase with over 4000
deployed systems) and we both agreed that using NULLs in most
cases as part of the initial design was causing us more problems
then helping us.


Again, how about listing some of these problems?



Well, for one, we had a table that stored all kinds of different
actions that should take place. Each action could be part of
a list of actions.

Someone decided that actions with a NULL for the "belongs
to list of actions #" would be an action "by itself".

Business wise, one could create an order for a product and that
product had to be created by a list of certain actions. The order
creator would then use any pre-defined list of actions, or create
such a list himself by selecting the different actions.

Asking for stand alone actions would then (obviously simplified):
select * from bewerking where list is null

This is a slow query. Period. One that could have easily been
avoided by not storing NULLs, but using a separate table.
Why a table? Because "standalone actions" do not equal
"actions that belong to a list".

Okay, that might be acceptable, if it doesn't cause you to lose track

of

the

student altogether. But if this table was the only one that even


recorded

the _existence_ of the student, you'd have a problem;


Indeed, then I would have a problem, cause you cannot derive from
the "test results" table that a student exists. This is a rather silly
statement.
If the "test results" table should ALSO store the existence of a


student,

you're design is wrong :-)

if someone tried to
verify that the student had attended this school, you wouldn't know


that

they had.


In your own words: nonsense.

Each table should store what is was designed to store. If I would
have to know if a student was enlisted in a certain course or would
be attending this school, I would not be using the "test results" table.


Look, I agree that this scenario is not particularly likely to happen but

it

COULD happen.



It's not me who made up the example :-)

Suppose you run a small company that runs certification exams:
your only contact with the students is that they show up to take an exam

and

you record the mark and send it to whoever awards the certification. In


that

case, you likely wouldn't have an elaborate set of tables containing a
variety of student information the way you would if you were a university.
In that case, it's quite reasonable to believe that the only table in the
system is the one that records the test marks. That table may include the
student name, address, test name, score, etc. etc. That table may only be
populated as the student comes in the door: the 'greeter' welcomes the
student to the test center, asks his name and other details, points him to
the exam room, and records the test result when the student hands in the
exam. In that scenario, if student Joe Blow does not show up for the test
because his car broke down, he may never be entered in the system at all

and

you have therefore lost sight of the fact that he ever existed. Again,


this

is not a very likely scenario but it is not an impossible one and that's

the

key point: if it COULD happen, you will have a problem.



See above. There's no point in taking exams from random people.

You are trying to come up with all sorts of unlikely scenarios to
justify a solution that's, IMO, faulty.

Or in a more probable case, if that was the only test for that
course and the student missed it and then had no row in the table, you


might

not have any way of knowing that they took the course!


See above.


Dismissing the less probable case does not dismiss this more probable
scenario.



Wrong problem description for task at hand. See above (my previous
reply, not this one). Attack each problem with the right solution, don't
try to find problems in solutions that aren't meant to solve the problem
you just found.

And if they later
wanted to write the exam, having recovered from their illness, your


query

might have the effect of keeping them from taking the second exam: your
query would report that they had never taken the first exam so an
adminstrator might refuse to let him/her take the second exam because
they
(apparently) had never been scheduled to take the first one.


Different problem, see above.


Again, you can't dismiss the second scenario by dismissing the first one;
the scenarios are different and the second one is more probable than the
first.



See above. The "Different problem" remark still holds.

Now, you could have a second table to record people who were scheduled

to

take tests but failed to take them to cover that situation but I think

it

Sounds like a decent design to me.


Not surprisingly since it is the one you suggested as the "correct" way to
handle the situation. But the design that uses nulls is NOT invalid and

you

have not demonstrated that it is.



What I DID demonstrate, however, is that it's _very_ easy to
create a design without NULLs, that is much clearer to read,
doesn't have to handle NULLs, avoids NULLs in general and
make it easy to understand what problems are solved.

Considering maintainability, ability to refactor and without having
to think up missing data, I prefer my design.

Do note I don't claim to have the key to every problem, but I
would like to know people that design WITH NULLs doesn't
have to be the standard although SQL provides them.

would be easier to record all students in one table and then simply


store

null for any test that they fail to take and a zero for every case


where

a
student got every answer wrong. When you compute the class average, the
avg() function would ensure that the students who got every question
wrong
would pull down the class average but that students who failed to write

the

test at all would NOT skew the average because the avg() function


ignores

nulls.


Ah, an excellent example of why NULLs are tedious: they are ignored
by some functions, but not by others.


_ALL_ of the column functions ignore nulls: it is NOT inconsistent. In


fact,

it is the only logical thing that these functions can do, which you'd
realize if you gave it a bit of open-minded thought.



Don't accuse me of not being open minded. I did develop systems
with NULLs before. I became a database developer back then
from practicing. However, I'm beginning to understand the theory
more and more and dislike NULLs more and more because of it.

Suppose you had a table with a primary key of employee number and a column
containing bonuses. Some of the rows contain nulls in the bonus column
because the employee didn't earn a bonus yet. Now, you need to answer


these

questions: what is the largest bonus amount? What is the smallest bonus
amount? What is the sum of the bonuses? What is the average bonus? Since a
null, by definition, is unknown, it isn't bigger or smaller or equal to

any

other value, including another null so it can't qualify as the biggest


value

or the smallest value. A number plus (or minus, or times, or divided by) a
null is undefined so it makes no sense to add it to the sum of the


bonuses.

A null is NOT a zero so treating it as a zero in an average would unduly
skew the result. So, in each case, the function does the logical thing and
IGNORES the null.

Without having the record there in the first place, there would have
to be no rule of AVG ignoring NULLs. Problem solved.


That is the other way of the solving the problem. But it is NOT the ONLY
way.

Why store a date column if you don't know?

Why not use:

EMPLOYEES
EmployeeID int,
StartingDate Date,
...

TERMINATED_EMPLOYEES (albeit a bit agressive ;) )
EmployeeID
TerminationDate

Once more: perfectly valid design.


Yes, that is also a valid design but it means you have to have yet
another
table that you could have avoided simply by permitting a null in the
employee table. Hey, if you really want your tables to proliferate like
this, that's up to you.


It seems you have a fear of creating new tables ;-)


No, I don't. It seems to me that you have an undue fear of nulls.

This is what database systems are designed to do...


Database systems are designed to accomodate large quantities of tables but
any designer with any real world experience will advise you not to create
tables simply for the sake of creating tables: that is an unnecessary
proliferation of tables. Look at normalization. As you probably know, each
phase of normalization tends to increase the number of tables in the
database. I know for a fact that the theorists have identified at least up
to 17NF (17th Normal Form) but how many levels of normalization do
businesses typically use in their normalizations? Answer: Three.
(Occasionally four). And I think you'll find that the main reason is that
the benefits of normalizing beyond 3NF are outweighed by the increasing
number of tables to manage.

The key point is that NULLs _do_ work and are a legitimate design


decision.

You don't _have_ to use them but they can save you some work and reduce

the

number of tables you need. (They can also be a bit more work on the
programming side.)

I took your remarks to mean that NULLs were always a bad idea and were

symptom of bad design and I strongly disagree with that.


So I've noticed.


I'm trying to be gracious here and accept that I may just have


misunderstood

the emphasis in what you are saying.



See above.

The message I'm trying to get across is that:
1 - according to (proper) design literature, you should not use NULLs
if you don't have to


Perhaps you can cite some of this "proper" literature. I'm interested in
seeing their definition of what an appropriate time to use nulls is.

2 - NULLs can cause you more problems than you can think of


I'd like to hear what these problems are. I've never had any big problems
because of nulls.



Ah well, SQL (these days) provides plenty of functions to avoid
the problems. Coalesce being one of them...

How often do people write:

(probably not valid MySQL)
select firstname || coalesce(middlename || ' ', ' ') || lastname
from ...

3 - its sometimes much easier to avoid storing NULLs AND to be
able to refactor your database because of it


I agree that nulls can frequently be avoided and that avoiding them will
simplify _some_ situations.

4 - the meaning of NULL can change, so why store it in the first place


Simple: because it happens to be true when it is stored. In the case of

the

hypothetical employee, I store a null termination date when I hire him
because I don't know when he is going to leave.



Why store something that you -don't- know.

If he laters gives his
notice, then I know when he is leaving and can store that date for his
termination date instead of a null; then his row of the table is true


again,

based on the new facts.

If you are simply saying that you don't like them and prefer to use
different designs to avoid them, then I don't have any problem with


that.

I have seen that when I avoid storing NULLs, my applications
became more clear and easier to understand.


Beauty is in the eye of the beholder as we all know. I have no problem


with

having nulls in my tables and consider that a better design than a


separate

table for special cases most of the time.

Again, if you are saying that you don't like nulls and prefer to avoid


them

in your designs, I have no problem with that; that's just your personal
preference. It's the same as if we sat down to eat a meal and you asked

for

chocolate ice cream for dessert and I had vanilla; neither choice is
"wrong", they are just personal preference.



Well, to that we agree and I would order sugared whipped cream with it ;-)

But if you are stating categorically that nulls are always "bad" or


"wrong",

I disagree strongly. That's like saying only chocolate ice cream is
acceptable and that all other flavours are evil. That's just wrong.



We are not going to agree, that much is clear.

I think we raised some very interesting points though (which is exactly
why I started this in the first place).

I would avoid NULLs if I could. It makes - very often - more sense
to avoid them than to include them.

You say that it's not "bad" to have NULLs every now and then, a
statement to which I can agree up to a certain height.

I would also like to state, that I've seen design where NULLs is
grossly overused for all sorts of situations. Your simple examples
can be among those. IF you're using NULLs, you'd better be aware
of what you're doing. Just don't use NULL for the sole purpose of
avoiding a table or relation.

I'd still say that in general, the "true proposition" remark holds and
is very very valid. With that as the basis, design your database.
I think I've also described why using NULLs in a "true proposition"
makes them invalid. I would advice against that.


This topic will pop up every now and then. If there's one thing that
I would like, is that people stop using NULLs way too often, I think
we can agree on that.

I for one had many big and small problems with NULLs stored in
the database. That made me appreciate more how easy it is to avoid
them and how to handle situations differently.

Martijn Tonies
Database Workbench - development tool for MySQL, and more!
Upscene Productions
http://www.upscene.com
My thoughts:
http://blog.upscene.com/martijn/
Database development questions? Check the forum!
http://www.databasedevelopmentforum.com


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Accountability with MySQL

Reply via email to