******************************************** *** Don't Get Burned By NULL-Conclusions *** ******************************************** Now, most of y'all probably already know something about this - hopefully, gained by observation of the bitter experience of others - but, given a couple of phrases I saw in one each of Rachael's and Alastair's messages, I would like to caution that one must be EXTREMELY cautious about NULL's versus 0's [zeroes] in the area of research and/or analysis. In simple terms, a NULL value should (almost?) ALWAYS be excluded from any sort of numeric calculation or the results might cost someone their job, literally ... well, unless their superiors and/or clients never bother to check the data.
RBase, TTBOMK, has always handled NULL-related calculations appropriately by default. In a GROUP BY, for example, NULLs will be GROUPed, which is appropriate. However, if NULLs are present in a column on which a calculation is being performed, such as COUNT(), SUM(), AVG(), etc, they are EXCLUDED from the calculation. (I think I even recall from way back in RB System V[?] two variants of either SELECT ... and/or COMPUTE ..., one of which included NULLs and one of which did not. The utility, at the time, seems to have been related to knowing how many records were actually in the table versus how many of them had values. At this point in time, my memory may fail me on this, but I'm sure that someone else can elaborate further, if they feel it helpful and/or have too much time on their hands ... but I digress.) ******************************************* *** NULL <> AnyThing [including 0/Zero] *** ******************************************* Simon and Garfunkel sang, "I'd rather be a hammer than a nail." Well, had they been db folk, they might well have said, "I'd rather be a zero than a NULL." Again, most especially from a research/analytical perspective, NULL and 0 [zero] are two DISTINCTLY DIFFERENT things. Let's say that you are a marine biologist/environmental engineer and you are collecting samples in sample tubes of the floodwaters after Hurricane Katrina (a good friend of mine was the Principal Investigator leading this research effort). The purpose was to assess the water for the presence of chemicals, their amounts, toxicity levels, etc. Now, let's say that 1000 samples of 100ml each were collected. Then, let's say that 100 of the sample tubes could not be properly analyzed for various, albeit valid, reasons and let's also say that 200 samples showed absolutely no presence of some dangerous chemical, which we'll call "X". Those 100 un-assayable samples are, of course, NULLs in the data. Samples were collected, but they produced no data. As such, they cannot be included in any calculations or conclusions about the nature of the post-Katrina floodwaters as they tell us nothing about it. Conversely, the 200 samples indicating an absence of chemical "X" are NOT NULLs. What they tell us is that there is a PRESENCE of 0/zero quantity of chemical "X" in these samples. We don't need to discuss a geographical grid-array for relative differences, water flow, potential sources, or anything more sophisticated to illustrate from these samples that NULL and 0/zero are different. The 100 samples that told us nothing remain "nothings", i.e. NULLs in the data. Those 200 samples which told us that there was 0 chemical "X" will have a 0 in the rows for observations of chemical "X" for each of those samples. This should seem a pretty basic understanding. However, distinguishing between what is a NULL and what is a 0/zero can be a bit more challenging as the difference may be subtle, but certainly discrete. So, let's also use an example from sports, say baseball. Let's think about pitching records. The last Pittsburgh Pirate - your home team, RBTI - to garner the Cy Young Award was Doug Drabek, in 1990. That season, he pitched 22 Victories and 6 Losses. (We will avoid ERA, although it's a great metric for pitchers, as it fails to illustrate my point.) Let's calculate a simple winning percentage of W%=((Games_Won/(Games_Won + Games_Lost)) * 100). Drabek's W% for 1990 was 78.57=((22/(22 + 6)) * 100). However, his team played 162 regular games that season. It would probably be unfair and would certainly be inaccurate to factor into the calculation of his W% those games in which he did not pitch. Still, he was on the roster for each of those games. However, were we to unload TeamRoster_Game data for all 162 games, an un-normalizedby but not atypical extract, and, in so doing, treated Drabek's "NULL-appearances" as 0/Zeroes. Without some degree of expertise, both in the subject matter and the nature of the extract data, his W% could appear as "whacky" as a wild-pitch from a knuckleballer. You see, it all depends on what is desired of the data and how the data is handled or treated. ****************************************************** *** MISSING DATA VALUES, "NULLs, with personality" *** ****************************************************** Now, if NULLs are conceptually troublesome, one may and should, according to good research practices, create a set of "Missing Data Values" to use rather than just a NULL. In fact, this topic appears to be an area of great interest among researchers/statisticians, including how to impute those values ... but that's WAY BEYOND my expertise. (Razzak, as I recall, you do||did possess that background, so feel free to chime in, in the case that I haven't managed yet to induce an extreme soporific state among the audience.) If you're still awake here, Missing Data Values are, typically, ultimately cast||stored as some sort of numeric value, just like valid data collected from samples. They also have some sort of descriptive text label associated with them. These then, even though they are NOT observational or sample data, become valid data points, i.e. part of the solution domain, rather than unexplained NULLs. Possible MISSING_DATA_VALUE records: SubjectMatterArea MissingDataValue ValueLabel ----------------- ---------------- ---------------------------------------------- Biology -11 Sample Contaminated Biology -33 Sample Lost - Trap Failed Biology -34 Sample Lost - Trap Thawed Biology -35 Sample Lost - Trap Missing Biology -999 Instrument Failure Survey Research -11 Respondent Refused To Answer (Telephone) Survey Research -12 Respondent Failed To Answer (Paper) Survey Research -22 Respondent Does Not Know Survey Research -33 Response Not Intelligible (Writing/Recording) Survey Research -888 Did Not Ask - Respondent Broke Off Interview Survey Research -999 Did Not Ask - Programmatic SKIP Obviously, these are not NULLs, although they do represent what could have been NULLs. Of course, within any research topic, they cannot intersect with any "Existing Values" in the solution domain. So, if we're looking at a population count in a trap that is an integer ranging from 0..N, we would have to use some negative integer value for any Missing Data Values we define. Then, we can exclude these Missing Data Values in our calculations by : "SELECT ... FROM DATA_POINTS WHERE DataValue >= 0 ...". Of course, if the domain of valid data values includes negative numbers, we must devise other values for Missing Data and, if this is the case, probably utilize the ValueTypeCode : "SELECT ... FROM DATA_POINTS WHERE ValueTypeCode <> 0 ..." -- Assume that ValueTypeCode is a column in DATA_POINTS and 0 means 'MissingDataValue' All this effort is really to help the researcher and/or research administrator, primarily in two general areas. One is to help the researcher do a little "CYA", as NULLs, especially a lot of NULLs - I know, this fundamentally illogical, as a bunch of nothin' is still a bunch of nothin' - among researchers, can automatically engender doubt about the research among peers/reviewers, whether it's about process or outcome. The other benefit, in the area of research management, is to aid in devising improvements in the process, such as sample improvement, better data collection, improved instrumentation, etc. Hopefully, if you made it this far, the potential impact of mishandling NULLs is apparent. If not, try to think about some examples in your data and outputs that might be adversely impacted from a lack of awareness about NULLs and how to treat them. Despite the best efforts at modeling a database structure, predicated and integrated with good business (data) anlaysis, NULLs can and do exist in the real world and can therefore exist in the virtual world. However, they can also be meaningfully "morphed" by employing Missing Data Values, as the appropriate for the business or subject matter context. This can be a very important aspect of research (data) and, you might have guessed, something with which I've had some degree of experience, vis-à-vis the data-side. I have seen NULLs treated as 0's/zeroes and, amazingly, 0's/zeroes treated as NULLs. There are - fortunately, not often - researchers who are sometimes careless in their understanding and/or treatment of their own data. Believe me, research-support level dba better have his/her ducks in a row when he/she makes the decision to point out such an error, even when the objective is to protect from disrepute the research organization, its work-product, and even that erring researcher. I will add that, IMNSHO, this was another good deed that did not go unpunished ... Well, let us calculate the value of my usual $0.02 by dividing it with the sum of my word-count ... well, thanks for listening and I hope this somehow helped someone. NULL-ified, Steve in Memphis From: [email protected] [mailto:[EMAIL PROTECTED] On Behalf Of Alastair Burr Sent: Monday, March 31, 2008 6:55 PM To: RBASE-L Mailing List Subject: [RBASE-L] - Re: How do I make NULL truly NULL? Way back when I first started my database in R:Base one of _my_ first rules was don't have any nulls under any circumstances whatsoever! It's been a good rule but, of course, I don't keep to it 100% N/A for text, -1, 0 or 9999 for integers has served me well over all these years. In my experience, if you find you need a lot of nulls then you probably need a separate table. It's easy to convert your favourite representation into a meaningful comment in forms and reports: (IFEQ(Field, 9999, '(Unknown)', (CTXT(Field)) )). A bit more thinking might be needed for exporting but it's easy enough to get R:base to do what you want or to convert in the receiving program - although R:Base is (usually) easier. To really stick my head above the parapet I have never understood how anything can be null - if a value is not known then show that it is not known. To not know whether it is known or not known seems to me to be a failure -but take note of my second paragraph above! Regards, Alastair. ----- Original Message ----- From: Dennis McGrath To: RBASE-L Mailing List Sent: Monday, March 31, 2008 7:10 PM Subject: [RBASE-L] - Re: How do I make NULL truly NULL? Matt, Have you tried exporting as ASCII when NULL = ' ' (a space) Try it and look at the generated file. There will be no spaces between the commas for a NULL value. This is exactly what you need. Here are my rules for the use of the NULL setting: ALWAYS have NULL='-0-' EXCEPT: When exporting data as ASCII, CSV, etc. When editing with a form When printing a report If you do not have NULL='-0-' and you edit tables manually, you will never get a null entered in text fields. Try as you might, the value will always be a space. The exact opposite is true in forms. If null='-0-' not only will your forms look ugly, but you won't know the difference between NULL and '-0-' When Unloading data and structure from RBASE, ALWAYS set NULL='-0-' RBASE has worked like this as long as I can remember. Dennis McGrath ________________________________________ From: [email protected] [mailto:[EMAIL PROTECTED] On Behalf Of Matthew Burke Sent: Monday, March 31, 2008 12:46 PM To: RBASE-L Mailing List Subject: [RBASE-L] - Re: How do I make NULL truly NULL? Well here's my problem with how it seems to work. I understand the need to visually separate null from a blank space within R:Base. So sure, when I open a table within R:Base show me some kind of value, -0- will do just fine. However when I export data from my table to say... and basic csv text file. That data should not be represented by a -0-, but nothing at all, in my opinion. That way if I want to use the data in another database, or as in this case a third party wants to use it in another database, they can do so without the other database having to understand that -0- = null. I thought about setting NULL = ' ' but even that wouldn't work, because even then the third party database would still have to know that null = ' '. I'm certainly no database guru and I really appreciate all the help I receive from you guys and gals. My only other database experience lies with MySQL and it's always been my understanding that null = nothing at all. So I guess ultimately I don't really need to know why R:Base uses text representation for null, but I need to know if there is a solution I can use to make null = nothing at all or empty. Basically if 'empty' was or possibly is a valid command, the logic I want to accomplish is R> set null empty. Is this possible? To further explain what I want, here's an example. Let's say we have a table with one row containing three fields of data; a, b, and c. Field b is null in this case. Something like this. a b c 1 text -0- text As my R:Base is setup now, if I were to export this data to a text file (csv) I get: "text","-0- ","text" The end result text file (csv) I want to achieve is: "text","","text" Thanks again for all your help with my novice problems/questions. Matthew Burke Pioneer Bank - IT Department Computer Systems Technician 505-627-4415 This email contains PRIVILEGED AND CONFIDENTIAL information intended only for the use of the addressee(s) named above. If you are not the intended recipient of this email, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination or copying of this email is strictly prohibited. If you have received this email in error, please notify us by replying to this email and delete this email from your records. Thank you for your cooperation. ________________________________________ No virus found in this incoming message. Checked by AVG. Version: 7.5.519 / Virus Database: 269.22.1/1350 - Release Date: 30/03/2008 12:32

