Re: [R] R equivalent of proc varclus

Ajay Ohri Fri, 07 Oct 2011 06:53:34 -0700

Dear List

got the answer- thanks-

http://127.0.0.1:11568/library/Hmisc/html/varclus.html

 varclus {Hmisc}R Documentation Variable Clustering Description

Does a hierarchical cluster analysis on variables, using the Hoeffding D
statistic, squared Pearson or Spearman correlations, or proportion of
observations for which two variables are both positive as similarity
measures. Variable clustering is used for assessing collinearity,
redundancy, and for separating variables into clusters that can be scored as
a single variable, thus resulting in data reduction. For computing any of
the three similarity measures, pairwise deletion of NAs is done. The
clustering is done by hclust(). A small function naclus is also provided
which depicts similarities in which observations are missing for variables
in a data frame. The similarity measure is the fraction of NAs in common
between any two variables. The diagonals of this sim matrix are the fraction
of NAs in each variable by itself. naclus also computes na.per.obs, the
number of missing variables in each observation, and mean.na, a vector whose
ith element is the mean number of missing variables other than variable i,
for observations in which variable i is missing. The naplot function makes
several plots (see the which argument).

So as to not generate too many dummy variables for multi-valued character or
categorical predictors, varclus will automatically combine infrequent cells
of such variables using an auxiliary function combine.levels that is defined
here.

plotMultSim plots multiple similarity matrices, with the similarity measure
being on the x-axis of each subplot.

na.pattern prints a frequency table of all combinations of missingness for
multiple variables. If there are 3 variables, a frequency table entry
labeled 110 corresponds to the number of observations for which the first
and second variables were missing but the third variable was not missing.
 Usage

varclus(x, similarity=c("spearman","pearson","hoeffding","bothpos","ccbothpos"),
        type=c("data.matrix","similarity.matrix"),
        method=if(.R.)"complete" else "compact",
        data=NULL, subset=NULL, na.action=na.retain, ...)
## S3 method for class 'varclus'
print(x, abbrev=FALSE, ...)
## S3 method for class 'varclus'
plot(x, ylab, abbrev=FALSE, legend.=FALSE, loc, maxlen, labels, ...)

naclus(df, method)
naplot(obj, which=c('all','na per var','na per obs','mean na',
                    'na per var vs mean na'), ...)

combine.levels(x, minlev=.05)

plotMultSim(s, x=1:dim(s)[3],
            slim=range(pretty(c(0,max(s,na.rm=TRUE)))),
            slimds=FALSE,
            add=FALSE, lty=par('lty'), col=par('col'),
            lwd=par('lwd'), vname=NULL, h=.5, w=.75, u=.05,
            labelx=TRUE, xspace=.35)

na.pattern(x)

Argumentsxa formula, a numeric matrix of predictors, or a similarity matrix.
If x is a formula, model.matrix is used to convert it to a design matrix. If
the formula excludes an intercept (e.g., ~ a + b -1), the first categorical
(factor) variable in the formula will have dummy variables generated for all
levels instead of omitting one for the first level. For combine.levels, x is
a character, category, or factor vector (or other vector that is converted
to factor). For plot and print, x is an object created by varclus. For
na.pattern, x is a list, data frame, or numeric matrix.

For plotMultSim, is a numeric vector specifying the ordered unique values on
the x-axis, corresponding to the third dimension of s.
df a data framesan array of similarity matrices. The third dimension of this
array corresponds to different computations of similarities. The first two
dimensions come from a single similarity matrix. This is useful for
displaying similarity matrices computed by varclus, for example. A use for
this might be to show pairwise similarities of variables across time in a
longitudinal study (see the example below). If vname is not given, smust
have dimnames. similaritythe default is to use squared Spearman correlation
coefficients, which will detect monotonic but nonlinear relationships. You
can also specify linear correlation or Hoeffding's (1948) D statistic, which
has the advantage of being sensitive to many types of dependence, including
highly non-monotonic relationships. For binary data, or data to be made
binary, similarity="bothpos" uses as a similarity measure the proportion of
observations for which two variables are both positive.
similarity="ccbothpos" uses a chance-corrected measure which is the
proportion of observations for which both variables are positive minus the
product of the two marginal proportions. This difference is expected to be
zero under independence. For diagonals, "ccbothpos" still uses the
proportion of positives for the single variable. So "ccbothpos" is not
really a similarity measure, and clustering is not done. This measure is
useful for plotting with plotMultSim (see the last example). typeif x is not
a formula, it may be a data matrix or a similarity matrix. By default, it is
assumed to be a data matrix.method see hclust. The default, for both varclus
 and naclus, is "compact" (for *R* it is "complete"). datasubsetna.actionThese
may be specified if x is a formula. The default na.action is na.retain,
defined by varclus. This causes all observations to be kept in the model
frame, with later pairwise deletion of NAs. ...for varclus these are
optional arguments to pass to the
dataframeReduce<http://127.0.0.1:11568/library/Hmisc/help/dataframeReduce>
function.
Otherwise, passed to plclust (or to dotchart or dotchart2 for naplot).
ylaby-axis
label. Default is constructed on the basis of similarity.legend.set to TRUE to
plot a legend defining the abbreviations loca list with elements x and
y defining
coordinates of the upper left corner of the legend. Default is locator(1).
maxlenif a legend is plotted describing abbreviations, original labels
longer than maxlen characters are truncated at maxlen.labels a vector of
character strings containing labels corresponding to columns in the similar
matrix, if the column names of that matrix are not to be usedobjan object
created by naclus whichdefaults to "all" meaning to have naplot make 4
separate plots. To make only one of the plots, use which="na per var" (dot
chart of fraction of NAs for each variable), ,"na per obs" (dot chart
showing frequency distribution of number of variables having NAs in an
observation), "mean na" (dot chart showing mean number of other variables
missing when the indicated variable is missing), or "na per var vs mean na",
a scatterplot showing on the x-axis the fraction of NAs in the variable and
on the y-axis the mean number of other variables that are NA when the
indicated variable is NA. minlevthe minimum proportion of observations in a
cell before that cell is combined with one or more cells. If more than one
cell has fewer than minlev*n observations, all such cells are combined into
a new cell labeled "OTHER". Otherwise, the lowest frequency cell is combined
with the next lowest frequency cell, and the level name is the combination
of the two old level levels. abbrevset to TRUE to abbreviate variable names
for plotting or printing. Is set to TRUE automatically if legend=TRUE.
slim2-vector
specifying the range of similarity values for scaling the y-axes. By default
this is the observed range over all of s.slimds set to slimds to TRUE to
scale diagonals and off-diagonals separatelyaddset to TRUE to add
similarities to an existing plot (usually specifying lty or col) ltycollwdline
type, color, or line thickness for plotMultSim vnameoptional vector of
variable names, in order, used in shrelative height for subplot wrelative
width for subploturelative extra height and width to leave unused inside the
subplot. Also used as the space between y-axis tick mark labels and graph
border. labelxset to FALSE to suppress drawing of labels in the x direction
xspaceamount of space, on a scale of 1:n where n is the number of variables,
to set aside for y-axis labels Details

options(contrasts= c("contr.treatment", "contr.poly")) is issued temporarily
by varclus to make sure that ordinary dummy variables are generated for
factor variables. Pass arguments to
thedataframeReduce<http://127.0.0.1:11568/library/Hmisc/help/dataframeReduce>
function
to remove problematic variables (especially if analyzing all variables in a
data frame).
 Value

for varclus or naclus, a list of class varclus with elements call (containing
the calling statement), sim (similarity matrix), n (sample size used if x was
not a correlation matrix already - n is a matrix), hclust, the object
created by hclust, similarity, and method. For plot, returns the object
created by plclust. naclus also returns the two vectors listed under
description, and naplot returns an invisible vector that is the frequency
table of the number of missing variables per observation. plotMultSim invisibly
returns the limits of similarities used in constructing the y-axes of each
subplot. For similarity="ccbothpos" the hclust object is NULL.

na.pattern creates an integer vector of frequencies.
Side Effects

plots
Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
f.harr...@vanderbilt.edu<https://mail.google.com/mail/?view=cm&fs=1&tf=1&to=f.harr...@vanderbilt.edu>
 References

Sarle, WS: The VARCLUS Procedure. SAS/STAT User's Guide, 4th Edition, 1990.
Cary NC: SAS Institute, Inc.

Hoeffding W. (1948): A non-parametric test of independence. Ann Math Stat
19:546â57.
 See Also

hclust <http://127.0.0.1:11568/library/Hmisc/help/hclust>,
plclust<http://127.0.0.1:11568/library/Hmisc/help/plclust>
, hoeffd <http://127.0.0.1:11568/library/Hmisc/help/hoeffd>,
rcorr<http://127.0.0.1:11568/library/Hmisc/help/rcorr>
, cor <http://127.0.0.1:11568/library/Hmisc/help/cor>,
model.matrix<http://127.0.0.1:11568/library/Hmisc/help/model.matrix>
, locator <http://127.0.0.1:11568/library/Hmisc/help/locator>,
na.pattern<http://127.0.0.1:11568/library/Hmisc/help/na.pattern>
 Examples

set.seed(1)
x1 <- rnorm(200)
x2 <- rnorm(200)
x3 <- x1 + x2 + rnorm(200)
x4 <- x2 + rnorm(200)
x <- cbind(x1,x2,x3,x4)
v <- varclus(x, similarity="spear")  # spearman is the default anyway
v    # invokes print.varclus
print(round(v$sim,2))
plot(v)

# plot(varclus(~ age + sys.bp + dias.bp + country - 1), abbrev=TRUE)
# the -1 causes k dummies to be generated for k countries
# plot(varclus(~ age + factor(disease.code) - 1))
#
#
# use varclus(~., data= fracmiss= maxlevels= minprev=) to analyze all
# "useful" variables - see dataframeReduce for details about arguments

df <- data.frame(a=c(1,2,3),b=c(1,2,3),c=c(1,2,NA),d=c(1,NA,3),
                 e=c(1,NA,3),f=c(NA,NA,NA),g=c(NA,2,3),h=c(NA,NA,3))
par(mfrow=c(2,2))
for(m in if(.R.)c("ward","complete","median") else
                c("compact","connected","average")) {
  plot(naclus(df, method=m))
  title(m)
}
naplot(naclus(df))
n <- naclus(df)
plot(n); naplot(n)
na.pattern(df)      # builtin function

x <- c(1, rep(2,11), rep(3,9))
combine.levels(x)
x <- c(1, 2, rep(3,20))
combine.levels(x)

# plotMultSim example: Plot proportion of observations
# for which two variables are both positive (diagonals
# show the proportion of observations for which the
# one variable is positive).  Chance-correct the
# off-diagonals by subtracting the product of the
# marginal proportions.  On each subplot the x-axis
# shows month (0, 4, 8, 12) and there is a separate
# curve for females and males
d <- data.frame(sex=sample(c('female','male'),1000,TRUE),
                month=sample(c(0,4,8,12),1000,TRUE),
                x1=sample(0:1,1000,TRUE),
                x2=sample(0:1,1000,TRUE),
                x3=sample(0:1,1000,TRUE))
s <- array(NA, c(3,3,4))
opar <- par(mar=c(0,0,4.1,0))  # waste less space
for(sx in c('female','male')) {
  for(i in 1:4) {
    mon <- (i-1)*4
    s[,,i] <- varclus(~x1 + x2 + x3, sim='ccbothpos', data=d,
                      subset=d$month==mon & d$sex==sx)$sim
    }
  plotMultSim(s, c(0,4,8,12), vname=c('x1','x2','x3'),
              add=sx=='male', slimds=TRUE,
              lty=1+(sx=='male'))
  # slimds=TRUE causes separate  scaling for diagonals and
  # off-diagonals
}
par(opar)

------------------------------
[Package *Hmisc* version 3.8-3
Index<http://127.0.0.1:11568/library/Hmisc/html/00Index.html>
]

Websites-
http://decisionstats.com

On Fri, Oct 7, 2011 at 18:19, Ajay Ohri <ohri2...@gmail.com> wrote:

>
> Dear List
>
> What is the R package equivalent of Proc Varclus or Information Value. ANy
> assistance in determining R equivalents of f Oblique Component Analysis
> (PROC VARCLUS), Information Value
> (IV) and Weight Of Evidence (WOE) analysis, and business intelligence
>
> http://www.nesug.org/proceedings/nesug06/an/da23.pdf
>
> Regards,
>
> Ajay
> Websites-
> http://decisionstats.com
>
>
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R equivalent of proc varclus

Reply via email to