What factor variables are
A "factor" is a vector whose elements can take on one of a specific set of
values. For example, "Sex" will usually take on only the values "M" or "F,"
whereas "Name" will generally have lots of possibilities. The set of values
that the elements of a factor can take are called its levels. If you want to
add a new level to a factor, you can do that, but you can't just change
elements to have new values that aren't already levels. Here's an example. I'll
start by creating a factor whose values are "a", "b", and "c." The factor()
function will do this, and it will generate the labels automatically.
> a <- factor (c("a", "b", "c", "b", "c", "b", "a", "c")) # create the factor
> a # Print the new variable
[1] a b c b c b a c # You can tell it's not a character vector: no quotes
> levels(a) # Here is the set of levels
[1] "a" "b" "c"
#
# What if I try to change an element to a new value, like "d"?
#
> a[3] <- "d"
Warning messages:
Warning in "[<-.factor"(a, 3, value = "d"): replacement values not all in
levels(x): NA's
generated
#
# The warning message tells you that some NAs have geen generated.
#
> a
[1] a b NA b c b a c
#
# However it's okay to set elements to values that are already levels:
#
> a[3] <- "a"
> a
[1] a b a b c b a c
#
# It's also easy to change levels. Here I'll change the "a"'s to "AA". Notice
that
# I don't change the values themselves, just the levels.
#
> levels(a)
[1] "a" "b" "c"
> levels(a)[1] <- "AA"
> a
[1] AA b AA b c b AA c
#
# The general way to convert a factor to character is with as.character():
#
> as.character(a)
[1] "AA" "b" "NA" "b" "c" "b" "AA" "c"
#
# Note the "NA" is a regular string, not a missing value.
#
By default the levels are the unique data values sorted alphabetically. This
turns out to matter in some statistical models. You can reorder the levels if
you want.
Internal Storage and Extra LevelsFactor variables are stored, internally, as
numeric variables together with their levels. The actual values of the numeric
variable are 1, 2, and so on. Not every level has to appear in the vector. In
this example I create a factor variable with four levels, even though I only
actually have data in three of them.
> a <- factor (c(1, 2, 3, 2, 3, 2, 1), levels=1:4, labels=c("Small", "Medium",
> "Large", "Huge")) # Create it
#
# In this example, the "levels=1:4" is required. Otherwise the mismatch between
the fact that
# there are four labels but only three values will get you in trouble. Of
course the values in "levels"
# need to match the values in the data.
#
> a
[1] Small Medium Large Medium Large Medium Small
Levels:
[1] "Small" "Medium" "Large" "Huge"
#
# Notice how the levels (including "Huge") print out. In general the levels
will print out whenever they
# don't all appear in whatever's being printed.
#
# Take a look at the table of a. The "Huge" level is remembered.
#
> table (a)
Small Medium Large Huge
2 3 2 0
This happens in the GUI when you try to change one value to another. If you try
to change "a" to "b" but accidentally type "ab," a level named "ab" is created.
If you then correct the "ab" to "b," the "ab" level remains. By the way, you
can get rid of unused levels when subscripting by using the drop=T argument:
> table (a[,drop=T])
Small Medium Large
2 3 2
Missing values in factorsMissing values in factor variables can be a drag.
They're invisible to the table() function, even when you use the exclude=NULL
argument that is supposed to work here. (Reading the help file carefully tells
us that S-Insightful knows this is a problem but hasn't fixed it.) The
na.include() function will add a new level named "NA" to a factor with NAs in
it.
> a[3] <- NA # Make one entry NA
> a # Sure enough
[1] Small Medium NA Medium Large Medium Small
Levels:
[1] "Small" "Medium" "Large" "Huge"
#
# It's still missing from table(), even with exclude=NULL
#
> table (a, exclude=NULL)
Small Medium Large Huge
2 3 1 0
> sum (is.na (a)) # How many NAs are there in this vector?
> [1] 1 # Answer: 1
#
# Now run na.include(a) and save the result
#
> aa <- na.include(a)
> table (aa)
Small Medium Large Huge NA
2 3 1 0 1
> sum (is.na (aa)) # How many NA's here?
> [1] 0
Q: Why is my character variable a factor?When you construct a data.frame with
read.table() or by importing, the default decision is to turn every character
variable into a factor. This may or may not be a good idea for you (see "When
do I need a factor variable?" below). If you don't want factors, use the
as.isargument to read.table(). A single T says "leave everything as is": a
vector of T's and F's results in the conversion of all the columns for which
the as.isargument is F. If you're using File | Import Data, go to the Options
tab and uncheck "Strings as factors." If you want some of your character
variables to be characters, and others to be factors, you'll need to use
read.table().
Q: Why is my numeric variable a factor?This usually happens when your "numeric"
variable actually contains some non-numeric entries (like "NA" or "Missing" or
an empty space). S-Plus sees that the column is not numeric, so it treats it as
if it were character, and factorizes it (see the preceding paragraph). If you
don't mind a few warnings, you can convert a column this has happened to into
numeric in the following way. Suppose your data frame is named Steve and the
column is G. Then this line converts the entries in Steve$G to numeric, where
possible. Non-numeric entries in G will be turned into NAs.
>Steve$G <- as.numeric(levels(Steve$G)[Steve$G]) # or <-
>as.numeric(as.character(Steve$G))Remember that, internally, Steve$G is
>numeric. So indexing something by Steve$G is certainly possible.
Watch out! Here's something to watch out for. If your numeric gets converted to
factor, then the levels will be what you want. The internal representation, the
numbers 1, 2, and so on which S-Plus uses to keep track of things, will
generally not be what you want. The reason is that by default level 1 gets
assigned to the first value in alphabetical order, the second level to the
second value, and so on. So suppose that your values are 8, 25, 111, and
"Missing". When that gets imported, it will be recognized as character data.
Then it will be converted to a factor, with levels corresponding to the values
of these alphabetic values. Of course the alphabetic sorting scheme is
different than the numeric one. Here's an example:
> factor (c(1, 3, 17, 4, "NA", 5)) # Create and display a factor variable. The
> whole vector
[1] 1 3 17 4 NA 5 # is converted to character before being
factorized.
> as.numeric (factor (c(1, 3, 17, 4, "NA", 5)))
[1] 1 3 2 4 NA 5
In that example, the character string "17" comes between "1" and "2" (just as
"Ag" comes between "A" and "B") and so the "17" gets level 2. The as.numeric()
function converts the factor into its level numbers. That's probably not what
you wanted. When do I want a factor variable?Factor variables are useful in
several places. First, some S-Plus functions that expect factors fail when
given a character vector. (However, these are rare. Generally the modeling
functions will convert character vectors to factors invisibly.) Second, it's
sometimes handy to carry the set of levels around with you. Suppose you have a
factor vector with four levels. Then table() is guaranteed to produce a
four-entry table, whether you operate on the whole vector or on any subset. In
contrast, that operation on a character vector will produce only as many
entries in the table as there are unique elements in the subset. So if you're
planning to compare the districution of subsets,
you'll want a factor. Third, for huge data sets factor variables will
generally be smaller, since each observation is stored as an integer and the
levels are only stored once.
When are factor variables a big pain?Factor variables are a pain when you're
cleaning your data because they're hard to update. My approach has always been
to convert the variable to character with as.character(), then handle the
variable as a character vector, and then convert it to factor (using factor()
or as.factor()) at the end.
What's one secret to converting factors to character vectors in a data
frame?Here's an interesting fact. Remember how you can refer to columns of a
data frame either in matrix style or in list style? When you use the
matrix-style notation S-Plus will often factorize your character variables
automatically. That's not true for list-style notation, so list-style is often
what you want. Here's an example:
#
# Create a simple data frame.
#
> d <- data.frame (a = c("a", "b", "c", "d"))
#
# The data.frame() function automatically converts characters to factors.
#
> is.factor (d[,1])
[1] T
#
# This should convert it back, but it doesn't.
#
> d[,1] <- as.character(d[,1])
> is.factor (d[,1])
[1] T
#
# This looks the same, but using list notation on the left makes all the
difference.
#
> d$a <- as.character(d[,1])
> is.factor (d[,1])
[1] F
#
# This would have worked: the I() function ("I" standing for "identity") says
# "leave this just as I give it to you; don't convert it".
#
> d[,1] <- I(as.character(d[,1]))
Reordering the levels of a factorThis question arises in some models. The first
level is set to be the baseline in the usual "treatment contrasts" setup (see
the discussion of contrasts.) Sometimes it's desireable to have a different
level be the baseline. To do that, convert the vector to character, then call
factor()passing the new levels in the levels=argument. The result will look
like the original; only the ordering of the levels will have changed. For
example:
> a <- factor (c("a", "b", "c", "b", "c", "b", "a", "c"))
> a # Print a
[1] a b c b c b a c # The table is produced in order of the levels
> table (a)
a b c
2 3 3
#
# Convert a to character, then back to factor with a new vector of levels
#
> a <- factor (as.character(a), levels=c("c", "a", "b"))
> a # a is unchanged
[1] a b c b c b a c
> table (a) # The table is, too, but it's in a different order.
c a b
3 2 3
Ordered FactorsCategoriesAn "ordered" factor is a factor whose levels have a
particular order. Ordered variables inherit from factors, so anything that you
can to a factor you can do to an ordered factor. S-Plus models generally ignore
ordering even if you put it in there. The reorder.factor() can be useful; this
creates an ordered factor by breaking some numeric variable down into subsets
by the different levels of the factor, and then calculating the mean of each
subset. Finally the levels are sorted in increasing order of these means. (You
don't have to use the mean, but that's the default.) This can be useful when
preparing plots.
In practice I don't make much use of ordered factors. Before S-Plus had factors
it had something similar called categories. The category object is
"deprecated": that means don't use it. However, the result of calling the cut()
function is still a category so you do run into these from time to time. My
advice is to convert the results of a call to cut() to factor right away.
Otherwise you may find that the small differences between factors and
categories can come back to bite you. Here's an example:
> x <- 1:20 # Set up a vector...
> thing <- cut (x, c(0, 2, 5, 11, 21)) # ...pass it to cut
#
# The result of cut() is a category. This is a numeric vector with levels.
#
> thing
[1] 1 1 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4
attr(, "levels"):
[1] " 0+ thru 2" " 2+ thru 5" " 5+ thru 11" "11+ thru 21"
#
# The levels print out, but the vector really is numeric. Notice that
# the first three levels have a leading space.
#
# This looks right when you pass it to factor() or as.character()...
#
> as.character(thing)
[1] " 0+ thru 2" " 0+ thru 2" " 2+ thru 5" " 2+ thru 5" " 2+ thru 5" " 5+
thru 11"
[7] " 5+ thru 11" " 5+ thru 11" " 5+ thru 11" " 5+ thru 11" " 5+ thru 11" "11+
thru 21"
[13] "11+ thru 21" "11+ thru 21" "11+ thru 21" "11+ thru 21" "11+ thru 21" "11+
thru 21"
[19] "11+ thru 21" "11+ thru 21"
#
# ...but check this out. When you assign any element to anything that isn't
# in the levels, the whole vector is converted to numeric forever.
#
> thing[5] <- NA
> thing
[1] 1 1 2 2 NA 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4
sandeep khokhar
----- Original Message ----
From: ~Rick <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, 1 May, 2008 10:52:23 PM
Subject: Re: [c-prog] need a sample programme
At Wednesday 4/30/2008 06:40 PM, you wrote:
> >I've already written this program, with a few differences. >Here's my
> >output. Can you tell me what grade your instructor will >give me?
> >Please enter a number (0 to quit): 115746.50
> >Your number was 115746.50
> >ONE HUNDRED FIFTEEN THOUSAND SEVEN >HUNDRED FORTY-SIX POINT FIVE ZERO
>maybe converting it to rupies, paes, and lakes helpes mask the project? ;)
>Thanks,
>~~TheCreator~~
I'm not exactly sure how to do Indian currency. I'm guessing 1 Lake =
100,000 Rupees. I'm not going to change my code logic specifically
for the OP. I changed the currency names, though. The OP (i.e.
student) will need to create their own program, of course.
Please enter a number (0 to quit): 1,15,746.50
One Hundred Fifteen Thousand Seven Hundred Forty Six Rupees and Fifty Paise
~Rick
>Visit TDS for quality software and website production
>http://tysdomain.com
>msn: [EMAIL PROTECTED]
>skype: st8amnd127
> ----- Original Message -----
> From: ~Rick
> To: [email protected]
> Sent: Wednesday, April 30, 2008 4:36 PM
> Subject: Re: [c-prog] need a sample programme
>
>
> At Tuesday 4/29/2008 03:54 AM, you wrote:
> >Dear friends have a nice time,
> > i need sample programme in c to convert number into words.
> >
> >like:
> >
> >input : 1,15,746.50
> >output: one lakes fifteen thousand seven hundred and forty six
> >rupees and fifty paise
> >i need your help.
> >thanking you guys.
>
> I've already written this program, with a few differences. Here's my
> output. Can you tell me what grade your instructor will give me?
>
> Please enter a number (0 to quit): 115746.50
>
> Your number was 115746.50
> ONE HUNDRED FIFTEEN THOUSAND SEVEN HUNDRED FORTY-SIX POINT FIVE ZERO
>
> >
> >---------------------------------
> > Unlimited freedom, unlimited storage. Get it now
> >
> >[Non-text portions of this message have been removed]
> >
> >
> >------------------------------------
> >
> >To unsubscribe, send a blank message to
> ><mailto:[EMAIL PROTECTED]>.Yahoo! Groups Links
> >
> >
> >
>
>
>
>
>[Non-text portions of this message have been removed]
>
>
>------------------------------------
>
>To unsubscribe, send a blank message to
><mailto:[EMAIL PROTECTED]>.Yahoo! Groups Links
>
>
>
------------------------------------
To unsubscribe, send a blank message to <mailto:[EMAIL PROTECTED]>.Yahoo!
Groups Links
Unlimited freedom, unlimited storage. Get it now, on
http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/
[Non-text portions of this message have been removed]