[RBASE-L] - Re: How do I make NULL truly NULL? - A Word of Caution With NULLs in Research & Analysis ...

Wills, Steve Tue, 01 Apr 2008 09:23:14 -0700

********************************************
*** Don't Get Burned By NULL-Conclusions ***
********************************************
Now, most of y'all probably already know something about this - hopefully, 
gained by observation of the bitter experience of others - but, given a couple 
of phrases I saw in one each of Rachael's and Alastair's messages, I would like 
to caution that one must be EXTREMELY cautious about NULL's versus 0's [zeroes] 
in the area of research and/or analysis.  In simple terms, a NULL value should 
(almost?) ALWAYS be excluded from any sort of numeric calculation or the 
results might cost someone their job, literally ... well, unless their 
superiors and/or clients never bother to check the data.


RBase, TTBOMK, has always handled NULL-related calculations appropriately by 
default.  In a GROUP BY, for example, NULLs will be GROUPed, which is 
appropriate.  However, if NULLs are present in a column on which a calculation 
is being performed, such as COUNT(), SUM(), AVG(), etc, they are EXCLUDED from 
the calculation.

(I think I even recall from way back in RB System V[?] two variants of either 
SELECT ... and/or COMPUTE ..., one of which included NULLs and one of which did 
not.  The utility, at the time, seems to have been related to knowing how many 
records were actually in the table versus how many of them had values.  At this 
point in time, my memory may fail me on this, but I'm sure that someone else 
can elaborate further, if they feel it helpful and/or have too much time on 
their hands ... but I digress.)
 
*******************************************
*** NULL <> AnyThing [including 0/Zero] ***
*******************************************
Simon and Garfunkel sang, "I'd rather be a hammer than a nail."  Well, had they 
been db folk, they might well have said, "I'd rather be a zero than a NULL."  
Again, most especially from a research/analytical perspective, NULL and 0 
[zero] are two DISTINCTLY DIFFERENT things.

Let's say that you are a marine biologist/environmental engineer and you are 
collecting samples in sample tubes of the floodwaters after Hurricane Katrina 
(a good friend of mine was the Principal Investigator leading this research 
effort).  The purpose was to assess the water for the presence of chemicals, 
their amounts, toxicity levels, etc.  Now, let's say that 1000 samples of 100ml 
each were collected.  Then, let's say that 100 of the sample tubes could not be 
properly analyzed for various, albeit valid, reasons and let's also say that 
200 samples showed absolutely no presence of some dangerous chemical, which 
we'll call "X".  

Those 100 un-assayable samples are, of course, NULLs in the data.  Samples were 
collected, but they produced no data.  As such, they cannot be included in any 
calculations or conclusions about the nature of the post-Katrina floodwaters as 
they tell us nothing about it.  Conversely, the 200 samples indicating an 
absence of chemical "X" are NOT NULLs.  What they tell us is that there is a 
PRESENCE of 0/zero quantity of chemical "X" in these samples.  We don't need to 
discuss a geographical grid-array for relative differences, water flow, 
potential sources, or anything more sophisticated to illustrate from these 
samples that NULL and 0/zero are different.  The 100 samples that told us 
nothing remain "nothings", i.e. NULLs in the data.  Those 200 samples which 
told us that there was 0 chemical "X" will have a 0 in the rows for 
observations of chemical "X" for each of those samples.

This should seem a pretty basic understanding.  However, distinguishing between 
what is a NULL and what is a 0/zero can be a bit more challenging as the 
difference may be subtle, but certainly discrete.
 
So, let's also use an example from sports, say baseball.  Let's think about 
pitching records.  The last Pittsburgh Pirate - your home team, RBTI - to 
garner the Cy Young Award was Doug Drabek, in 1990.  That season, he pitched 22 
Victories and 6 Losses.  (We will avoid ERA, although it's a great metric for 
pitchers, as it fails to illustrate my point.)  Let's calculate a simple 
winning percentage of W%=((Games_Won/(Games_Won + Games_Lost)) * 100).  
Drabek's W% for 1990 was 78.57=((22/(22 + 6)) * 100).  

However, his team played 162 regular games that season.  

It would probably be unfair and would certainly be inaccurate to factor into 
the calculation of his W% those games in which he did not pitch.  Still, he was 
on the roster for each of those games.  However, were we to unload 
TeamRoster_Game data for all 162 games, an un-normalizedby but not atypical 
extract, and, in so doing, treated Drabek's "NULL-appearances" as 0/Zeroes.  
Without some degree of expertise, both in the subject matter and the nature of 
the extract data, his W% could appear as "whacky" as a wild-pitch from a 
knuckleballer.  You see, it all depends on what is desired of the data and how 
the data is handled or treated.  


******************************************************
*** MISSING DATA VALUES, "NULLs, with personality" ***
******************************************************
Now, if NULLs are conceptually troublesome, one may and should, according to 
good research practices, create a set of "Missing Data Values" to use rather 
than just a NULL.  In fact, this topic appears to be an area of great interest 
among researchers/statisticians, including how to impute those values ... but 
that's WAY BEYOND my expertise.

(Razzak, as I recall, you do||did possess that background, so feel free to 
chime in, in the case that I haven't managed yet to induce an extreme soporific 
state among the audience.)

If you're still awake here, Missing Data Values are, typically, ultimately 
cast||stored as some sort of numeric value, just like valid data collected from 
samples.  They also have some sort of descriptive text label associated with 
them.  These then, even though they are NOT observational or sample data, 
become valid data points, i.e. part of the solution domain, rather than 
unexplained NULLs.

Possible MISSING_DATA_VALUE records:

SubjectMatterArea  MissingDataValue  ValueLabel
-----------------  ----------------  
----------------------------------------------
Biology                         -11  Sample Contaminated
Biology                         -33  Sample Lost - Trap Failed
Biology                         -34  Sample Lost - Trap Thawed
Biology                         -35  Sample Lost - Trap Missing
Biology                        -999  Instrument Failure
Survey Research                 -11  Respondent Refused To Answer (Telephone)
Survey Research                 -12  Respondent Failed To Answer (Paper)
Survey Research                 -22  Respondent Does Not Know
Survey Research                 -33  Response Not Intelligible 
(Writing/Recording)
Survey Research                -888  Did Not Ask - Respondent Broke Off 
Interview
Survey Research                -999  Did Not Ask - Programmatic SKIP


Obviously, these are not NULLs, although they do represent what could have been 
NULLs.  Of course, within any research topic, they cannot intersect with any 
"Existing Values" in the solution domain.  

So, if we're looking at a population count in a trap that is an integer ranging 
from 0..N, we would have to use some negative integer value for any Missing 
Data Values we define.  Then, we can exclude these Missing Data Values in our 
calculations by :

  "SELECT ... FROM DATA_POINTS WHERE DataValue >= 0 ...".  

Of course, if the domain of valid data values includes negative numbers, we 
must devise other values for Missing Data and, if this is the case, probably 
utilize the ValueTypeCode :

  "SELECT ... FROM DATA_POINTS WHERE ValueTypeCode <> 0 ..." 
  -- Assume that ValueTypeCode is a column in DATA_POINTS and 0 means 
'MissingDataValue'


All this effort is really to help the researcher and/or research administrator, 
primarily in two general areas.  One is to help the researcher do a little 
"CYA", as NULLs, especially a lot of NULLs - I know, this fundamentally 
illogical, as a bunch of nothin' is still a bunch of nothin' - among 
researchers, can automatically engender doubt about the research among 
peers/reviewers, whether it's about process or outcome.  The other benefit, in 
the area of research management, is to aid in devising improvements in the 
process, such as sample improvement, better data collection, improved 
instrumentation, etc.

Hopefully, if you made it this far, the potential impact of mishandling NULLs 
is apparent.  If not, try to think about some examples in your data and outputs 
that might be adversely impacted from a lack of awareness about NULLs and how 
to treat them.  Despite the best efforts at modeling a database structure, 
predicated and integrated with good business (data) anlaysis, NULLs can and do 
exist in the real world and can therefore exist in the virtual world.  However, 
they can also be meaningfully "morphed" by employing Missing Data Values, as 
the appropriate for the business or subject matter context.

This can be a very important aspect of research (data) and, you might have 
guessed, something with which I've had some degree of experience, vis-à-vis the 
data-side.  I have seen NULLs treated as 0's/zeroes and, amazingly, 0's/zeroes 
treated as NULLs.  There are - fortunately, not often - researchers who are 
sometimes careless in their understanding and/or treatment of their own data.  
Believe me, research-support level dba better have his/her ducks in a row when 
he/she makes the decision to point out such an error, even when the objective 
is to protect from disrepute the research organization, its work-product, and 
even that erring researcher.  

I will add that, IMNSHO, this was another good deed that did not go unpunished 
...

Well, let us calculate the value of my usual $0.02 by dividing it with the sum 
of my word-count ... well, thanks for listening and I hope this somehow helped 
someone.


NULL-ified,
Steve in Memphis






From: [email protected] [mailto:[EMAIL PROTECTED] On Behalf Of Alastair Burr
Sent: Monday, March 31, 2008 6:55 PM
To: RBASE-L Mailing List
Subject: [RBASE-L] - Re: How do I make NULL truly NULL?

Way back when I first started my database in R:Base one of _my_ first rules was 
don't have any nulls under any circumstances whatsoever!
 
It's been a good rule but, of course, I don't keep to it 100%
 
N/A for text, -1, 0 or 9999 for integers has served me well over all these 
years. In my experience, if you find you need a lot of nulls then you probably 
need a separate table. It's easy to convert your favourite representation into 
a meaningful comment in forms and reports: (IFEQ(Field, 9999, '(Unknown)', 
(CTXT(Field)) )).
 
A bit more thinking might be needed for exporting but it's easy enough to get 
R:base to do what you want or to convert in the receiving program - although 
R:Base is (usually) easier.
 
To really stick my head above the parapet I have never understood how anything 
can be null - if a value is not known then show that it is not known. To not 
know whether it is known or not known seems to me to be a failure -but take 
note of my second paragraph above!
 
Regards,
Alastair.
 
 
 
----- Original Message ----- 
From: Dennis McGrath 
To: RBASE-L Mailing List 
Sent: Monday, March 31, 2008 7:10 PM
Subject: [RBASE-L] - Re: How do I make NULL truly NULL?

Matt,

Have you tried exporting as ASCII when NULL = ' ' (a space)
Try it and look at the generated file.
There will be no spaces between the commas for a NULL value.
This is exactly what you need.

Here are my rules for the use of the NULL setting:
ALWAYS have NULL='-0-'  EXCEPT:
When exporting data as ASCII, CSV, etc.
When editing with a form
When printing a report

If you do not have NULL='-0-' and you edit tables manually, you will never get 
a null entered in text fields.
Try as you might, the value will always be a space.

The exact opposite is true in forms.  If null='-0-' not only will your forms 
look ugly, but you won't know the difference between NULL and '-0-'

When Unloading data and structure from RBASE, ALWAYS set NULL='-0-'

RBASE has worked like this as long as I can remember. 

Dennis McGrath

________________________________________
From: [email protected] [mailto:[EMAIL PROTECTED] On Behalf Of Matthew Burke
Sent: Monday, March 31, 2008 12:46 PM
To: RBASE-L Mailing List
Subject: [RBASE-L] - Re: How do I make NULL truly NULL?

       Well here's my problem with how it seems to work.  I understand the need 
to visually separate null from a blank space within R:Base.  So sure, when I 
open a table within R:Base show me some kind of value, -0- will do just fine.  
However when I export data from my table to say... and basic csv text file. 
That data should not be represented by a -0-, but nothing at all, in my 
opinion.  That way if I want to use the data in another database, or as in this 
case a third party wants to use it in another database, they can do so without 
the other database having to understand that -0- = null.  I thought about 
setting NULL = ' ' but even that wouldn't work, because even then the third 
party database would still have to know that null = ' '.  I'm certainly no 
database guru and I really appreciate all the help I receive from you guys and 
gals.  My only other database experience lies with MySQL and it's always been 
my understanding that null = nothing at all.

       So I guess ultimately I don't really need to know why R:Base uses text 
representation for null, but I need to know if there is a solution I can use to 
make null = nothing at all or empty.  Basically if 'empty' was or possibly is a 
valid command, the logic I want to accomplish is R> set null empty.  Is this 
possible?

To further explain what I want, here's an example.

Let's say we have a table with one row containing three fields of data; a, b, 
and c.
Field b is null in this case.  Something like this.

a
b
c
1
text
-0-
text

As my R:Base is setup now, if I were to export this data to a text file (csv) I 
get:
"text","-0- ","text"

The end result text file (csv) I want to achieve is:
"text","","text"

Thanks again for all your help with my novice problems/questions.

Matthew Burke
Pioneer Bank - IT Department
Computer Systems Technician
505-627-4415

This email contains PRIVILEGED AND CONFIDENTIAL information intended only for 
the use of the addressee(s) named above. If you are not the intended recipient 
of this email, or the employee or agent responsible for delivering it to the 
intended recipient, you are hereby notified that any dissemination or copying 
of this email is strictly prohibited. If you have received this email in error, 
please notify us by replying to this email and delete this email from your 
records. Thank you for your cooperation.

________________________________________
No virus found in this incoming message.
Checked by AVG. 
Version: 7.5.519 / Virus Database: 269.22.1/1350 - Release Date: 30/03/2008 
12:32

[RBASE-L] - Re: How do I make NULL truly NULL? - A Word of Caution With NULLs in Research & Analysis ...

Reply via email to