[RBASE-L] - Re: How do I make NULL truly NULL? - A Word of Caution With NULLs in Research & Analysis ...

Wills, Steve Wed, 02 Apr 2008 13:15:04 -0700

Actually Alastair, it was something you said that touched on the concept of 
Missing Data Values.


>>Way back when I first started my database in R:Base one of _my_ first rules 
>>was don't have any nulls under any circumstances whatsoever!
>> ...
>>N/A for text, -1, 0 or 9999 for integers has served me well over all these 
>>years.

It seems that you have developed your own domain of Missing Data Values, even 
if you don't explicitly call them that.  I'm sure there are others, but your 
(good) thinking is part of what prompted me to wax so verbose yesterday ... but 
let no person, including myself, try to blame you for my long message. ;-)

It's just that when I saw that you had already arrived at something that I 
never really considered until I worked in a university research group, I 
thought I would attempt to elaborate on what I believe you were sharing with 
Rachael.

Hope you didn't take anything I wrote as being any sort of correction for you.

Also, I agree with you about the presence of too many NULLs extant in the data 
being a potential indicator to re-visit the design/data model, but I'm not 
gonna' go into that at all as I think yesterday's missive was enuf' from me for 
now ...  :-)

Sincerely,
Steve in Memphis

 
-----Original Message-----
From: [email protected] [mailto:[EMAIL PROTECTED] On Behalf Of Alastair Burr
Sent: Wednesday, April 02, 2008 1:33 PM
To: RBASE-L Mailing List
Subject: [RBASE-L] - Re: How do I make NULL truly NULL? - A Word of Caution 
With NULLs in Research & Analysis ...

I'm not sure which particular phrase of mine caught your eye, Steve, but 
just in case there was any misunderstanding:

I was NOT saying that I use anything other than -0- to represent a null.

What I was saying is that I always have something to represent a positive 
entry.
In a text field that will usually be 'N/A' or '(unknown)'.
In a number (integer/numeric/double) it will be a number that cannot be part 
of the expected results.
Unknown time is 0:00:00 - my time values are timings (length) rather than 
clock time.
Unknown date is 01/01/1900 - dates haven't reached that far back.

With, I think, only one exception I never have to change my null setting 
from -0- to anything else because there is always data.

Instead of a where clause "where xyz is NOT NULL" I would use "where xyz < 
9999" or "where xyz <> 'N/A' "

Why no nulls? Because I've seen too many errors if something can be left 
"empty". Because I've had people say in the days when computers weren't as 
reliable as now "the machine's lost the data" or "what idiot didn't fill 
that in?"

Yes, R:Base handles nulls just fine and always has - I'd just rather not 
have any if it can be avoided.

As a very quick example I have a menu that is used in various places:

Excellent (returns an integer value of 5)
Good (returns an integer value of 4)
Average (returns an integer value of 3)
Low (returns an integer value of 2)
Poor (returns an integer value of 1)

The column will not accept nulls but will accept those integer values along 
with:
(Unknown) (returns an integer value of 9)
which does not appear on the displayed menu/combobox

I choose to enter dummy rows and then edit the default data however my forms 
will not allow a change _to_ a value of 9 for (Unknown) - the value is 9 to 
start with but it has to be updated on row edit.
A report would show the text values including '(Unknown)' if there were any.

As ever, it works for me but that doesn't mean that I'm right or it's the 
only way.
Not-null or, indeed, more than zero regards,
Alastair.


----- Original Message ----- 
From: "Wills, Steve" <[EMAIL PROTECTED]>
To: "RBASE-L Mailing List" <[email protected]>
Sent: Tuesday, April 01, 2008 5:20 PM
Subject: [RBASE-L] - Re: How do I make NULL truly NULL? - A Word of Caution 
With NULLs in Research & Analysis ...


********************************************
*** Don't Get Burned By NULL-Conclusions ***
********************************************
Now, most of y'all probably already know something about this - hopefully, 
gained by observation of the bitter experience of others - but, given a 
couple of phrases I saw in one each of Rachael's and Alastair's messages, I 
would like to caution that one must be EXTREMELY cautious about NULL's 
versus 0's [zeroes] in the area of research and/or analysis.  In simple 
terms, a NULL value should (almost?) ALWAYS be excluded from any sort of 
numeric calculation or the results might cost someone their job, literally 
... well, unless their superiors and/or clients never bother to check the 
data.

RBase, TTBOMK, has always handled NULL-related calculations appropriately by 
default.  In a GROUP BY, for example, NULLs will be GROUPed, which is 
appropriate.  However, if NULLs are present in a column on which a 
calculation is being performed, such as COUNT(), SUM(), AVG(), etc, they are 
EXCLUDED from the calculation.

(I think I even recall from way back in RB System V[?] two variants of 
either SELECT ... and/or COMPUTE ..., one of which included NULLs and one of 
which did not.  The utility, at the time, seems to have been related to 
knowing how many records were actually in the table versus how many of them 
had values.  At this point in time, my memory may fail me on this, but I'm 
sure that someone else can elaborate further, if they feel it helpful and/or 
have too much time on their hands ... but I digress.)

*******************************************
*** NULL <> AnyThing [including 0/Zero] ***
*******************************************
Simon and Garfunkel sang, "I'd rather be a hammer than a nail."  Well, had 
they been db folk, they might well have said, "I'd rather be a zero than a 
NULL."  Again, most especially from a research/analytical perspective, NULL 
and 0 [zero] are two DISTINCTLY DIFFERENT things.

Let's say that you are a marine biologist/environmental engineer and you are 
collecting samples in sample tubes of the floodwaters after Hurricane 
Katrina (a good friend of mine was the Principal Investigator leading this 
research effort).  The purpose was to assess the water for the presence of 
chemicals, their amounts, toxicity levels, etc.  Now, let's say that 1000 
samples of 100ml each were collected.  Then, let's say that 100 of the 
sample tubes could not be properly analyzed for various, albeit valid, 
reasons and let's also say that 200 samples showed absolutely no presence of 
some dangerous chemical, which we'll call "X".

Those 100 un-assayable samples are, of course, NULLs in the data.  Samples 
were collected, but they produced no data.  As such, they cannot be included 
in any calculations or conclusions about the nature of the post-Katrina 
floodwaters as they tell us nothing about it.  Conversely, the 200 samples 
indicating an absence of chemical "X" are NOT NULLs.  What they tell us is 
that there is a PRESENCE of 0/zero quantity of chemical "X" in these 
samples.  We don't need to discuss a geographical grid-array for relative 
differences, water flow, potential sources, or anything more sophisticated 
to illustrate from these samples that NULL and 0/zero are different.  The 
100 samples that told us nothing remain "nothings", i.e. NULLs in the data. 
Those 200 samples which told us that there was 0 chemical "X" will have a 0 
in the rows for observations of chemical "X" for each of those samples.

This should seem a pretty basic understanding.  However, distinguishing 
between what is a NULL and what is a 0/zero can be a bit more challenging as 
the difference may be subtle, but certainly discrete.

So, let's also use an example from sports, say baseball.  Let's think about 
pitching records.  The last Pittsburgh Pirate - your home team, RBTI - to 
garner the Cy Young Award was Doug Drabek, in 1990.  That season, he pitched 
22 Victories and 6 Losses.  (We will avoid ERA, although it's a great metric 
for pitchers, as it fails to illustrate my point.)  Let's calculate a simple 
winning percentage of W%=((Games_Won/(Games_Won + Games_Lost)) * 100). 
Drabek's W% for 1990 was 78.57=((22/(22 + 6)) * 100).

However, his team played 162 regular games that season.

It would probably be unfair and would certainly be inaccurate to factor into 
the calculation of his W% those games in which he did not pitch.  Still, he 
was on the roster for each of those games.  However, were we to unload 
TeamRoster_Game data for all 162 games, an un-normalizedby but not atypical 
extract, and, in so doing, treated Drabek's "NULL-appearances" as 0/Zeroes. 
Without some degree of expertise, both in the subject matter and the nature 
of the extract data, his W% could appear as "whacky" as a wild-pitch from a 
knuckleballer.  You see, it all depends on what is desired of the data and 
how the data is handled or treated.


******************************************************
*** MISSING DATA VALUES, "NULLs, with personality" ***
******************************************************
Now, if NULLs are conceptually troublesome, one may and should, according to 
good research practices, create a set of "Missing Data Values" to use rather 
than just a NULL.  In fact, this topic appears to be an area of great 
interest among researchers/statisticians, including how to impute those 
values ... but that's WAY BEYOND my expertise.

(Razzak, as I recall, you do||did possess that background, so feel free to 
chime in, in the case that I haven't managed yet to induce an extreme 
soporific state among the audience.)

If you're still awake here, Missing Data Values are, typically, ultimately 
cast||stored as some sort of numeric value, just like valid data collected 
from samples.  They also have some sort of descriptive text label associated 
with them.  These then, even though they are NOT observational or sample 
data, become valid data points, i.e. part of the solution domain, rather 
than unexplained NULLs.

Possible MISSING_DATA_VALUE records:

SubjectMatterArea  MissingDataValue  ValueLabel
-----------------  ----------------  
----------------------------------------------
Biology                         -11  Sample Contaminated
Biology                         -33  Sample Lost - Trap Failed
Biology                         -34  Sample Lost - Trap Thawed
Biology                         -35  Sample Lost - Trap Missing
Biology                        -999  Instrument Failure
Survey Research                 -11  Respondent Refused To Answer 
(Telephone)
Survey Research                 -12  Respondent Failed To Answer (Paper)
Survey Research                 -22  Respondent Does Not Know
Survey Research                 -33  Response Not Intelligible 
(Writing/Recording)
Survey Research                -888  Did Not Ask - Respondent Broke Off 
Interview
Survey Research                -999  Did Not Ask - Programmatic SKIP


Obviously, these are not NULLs, although they do represent what could have 
been NULLs.  Of course, within any research topic, they cannot intersect 
with any "Existing Values" in the solution domain.

So, if we're looking at a population count in a trap that is an integer 
ranging from 0..N, we would have to use some negative integer value for any 
Missing Data Values we define.  Then, we can exclude these Missing Data 
Values in our calculations by :

  "SELECT ... FROM DATA_POINTS WHERE DataValue >= 0 ...".

Of course, if the domain of valid data values includes negative numbers, we 
must devise other values for Missing Data and, if this is the case, probably 
utilize the ValueTypeCode :

  "SELECT ... FROM DATA_POINTS WHERE ValueTypeCode <> 0 ..."
  -- Assume that ValueTypeCode is a column in DATA_POINTS and 0 means 
'MissingDataValue'


All this effort is really to help the researcher and/or research 
administrator, primarily in two general areas.  One is to help the 
researcher do a little "CYA", as NULLs, especially a lot of NULLs - I know, 
this fundamentally illogical, as a bunch of nothin' is still a bunch of 
nothin' - among researchers, can automatically engender doubt about the 
research among peers/reviewers, whether it's about process or outcome.  The 
other benefit, in the area of research management, is to aid in devising 
improvements in the process, such as sample improvement, better data 
collection, improved instrumentation, etc.

Hopefully, if you made it this far, the potential impact of mishandling 
NULLs is apparent.  If not, try to think about some examples in your data 
and outputs that might be adversely impacted from a lack of awareness about 
NULLs and how to treat them.  Despite the best efforts at modeling a 
database structure, predicated and integrated with good business (data) 
anlaysis, NULLs can and do exist in the real world and can therefore exist 
in the virtual world.  However, they can also be meaningfully "morphed" by 
employing Missing Data Values, as the appropriate for the business or 
subject matter context.

This can be a very important aspect of research (data) and, you might have 
guessed, something with which I've had some degree of experience, vis-à-vis 
the data-side.  I have seen NULLs treated as 0's/zeroes and, amazingly, 
0's/zeroes treated as NULLs.  There are - fortunately, not often - 
researchers who are sometimes careless in their understanding and/or 
treatment of their own data.  Believe me, research-support level dba better 
have his/her ducks in a row when he/she makes the decision to point out such 
an error, even when the objective is to protect from disrepute the research 
organization, its work-product, and even that erring researcher.

I will add that, IMNSHO, this was another good deed that did not go 
unpunished ...

Well, let us calculate the value of my usual $0.02 by dividing it with the 
sum of my word-count ... well, thanks for listening and I hope this somehow 
helped someone.


NULL-ified,
Steve in Memphis






From: [email protected] [mailto:[EMAIL PROTECTED] On Behalf Of Alastair 
Burr
Sent: Monday, March 31, 2008 6:55 PM
To: RBASE-L Mailing List
Subject: [RBASE-L] - Re: How do I make NULL truly NULL?

Way back when I first started my database in R:Base one of _my_ first rules 
was don't have any nulls under any circumstances whatsoever!

It's been a good rule but, of course, I don't keep to it 100%

N/A for text, -1, 0 or 9999 for integers has served me well over all these 
years. In my experience, if you find you need a lot of nulls then you 
probably need a separate table. It's easy to convert your favourite 
representation into a meaningful comment in forms and reports: (IFEQ(Field, 
9999, '(Unknown)', (CTXT(Field)) )).

A bit more thinking might be needed for exporting but it's easy enough to 
get R:base to do what you want or to convert in the receiving program - 
although R:Base is (usually) easier.

To really stick my head above the parapet I have never understood how 
anything can be null - if a value is not known then show that it is not 
known. To not know whether it is known or not known seems to me to be a 
failure -but take note of my second paragraph above!

Regards,
Alastair.



----- Original Message ----- 
From: Dennis McGrath
To: RBASE-L Mailing List
Sent: Monday, March 31, 2008 7:10 PM
Subject: [RBASE-L] - Re: How do I make NULL truly NULL?

Matt,

Have you tried exporting as ASCII when NULL = ' ' (a space)
Try it and look at the generated file.
There will be no spaces between the commas for a NULL value.
This is exactly what you need.

Here are my rules for the use of the NULL setting:
ALWAYS have NULL='-0-' EXCEPT:
When exporting data as ASCII, CSV, etc.
When editing with a form
When printing a report

If you do not have NULL='-0-' and you edit tables manually, you will never 
get a null entered in text fields.
Try as you might, the value will always be a space.

The exact opposite is true in forms. If null='-0-' not only will your forms 
look ugly, but you won't know the difference between NULL and '-0-'

When Unloading data and structure from RBASE, ALWAYS set NULL='-0-'

RBASE has worked like this as long as I can remember.

Dennis McGrath

________________________________________
From: [email protected] [mailto:[EMAIL PROTECTED] On Behalf Of Matthew 
Burke
Sent: Monday, March 31, 2008 12:46 PM
To: RBASE-L Mailing List
Subject: [RBASE-L] - Re: How do I make NULL truly NULL?

       Well here's my problem with how it seems to work. I understand the 
need to visually separate null from a blank space within R:Base. So sure, 
when I open a table within R:Base show me some kind of value, -0- will do 
just fine. However when I export data from my table to say... and basic csv 
text file. That data should not be represented by a -0-, but nothing at all, 
in my opinion. That way if I want to use the data in another database, or as 
in this case a third party wants to use it in another database, they can do 
so without the other database having to understand that -0- = null. I 
thought about setting NULL = ' ' but even that wouldn't work, because even 
then the third party database would still have to know that null = ' '. I'm 
certainly no database guru and I really appreciate all the help I receive 
from you guys and gals. My only other database experience lies with MySQL 
and it's always been my understanding that null = nothing at all.

       So I guess ultimately I don't really need to know why R:Base uses 
text representation for null, but I need to know if there is a solution I 
can use to make null = nothing at all or empty. Basically if 'empty' was or 
possibly is a valid command, the logic I want to accomplish is R> set null 
empty. Is this possible?

To further explain what I want, here's an example.

Let's say we have a table with one row containing three fields of data; a, 
b, and c.
Field b is null in this case. Something like this.

a
b
c
1
text
-0-
text

As my R:Base is setup now, if I were to export this data to a text file 
(csv) I get:
"text","-0- ","text"

The end result text file (csv) I want to achieve is:
"text","","text"

Thanks again for all your help with my novice problems/questions.

Matthew Burke
Pioneer Bank - IT Department
Computer Systems Technician
505-627-4415

This email contains PRIVILEGED AND CONFIDENTIAL information intended only 
for the use of the addressee(s) named above. If you are not the intended 
recipient of this email, or the employee or agent responsible for delivering 
it to the intended recipient, you are hereby notified that any dissemination 
or copying of this email is strictly prohibited. If you have received this 
email in error, please notify us by replying to this email and delete this 
email from your records. Thank you for your cooperation.

________________________________________
No virus found in this incoming message.
Checked by AVG.
Version: 7.5.519 / Virus Database: 269.22.1/1350 - Release Date: 30/03/2008 
12:32




-- 
No virus found in this incoming message.
Checked by AVG.
Version: 7.5.519 / Virus Database: 269.22.3/1354 - Release Date: 01/04/2008 
05:38

[RBASE-L] - Re: How do I make NULL truly NULL? - A Word of Caution With NULLs in Research & Analysis ...

Reply via email to