[R] Re: projection pursuit

2004-03-18 Thread Jim_Garrett
Luis,

See the fastICA package, in particular the final example in the function
fastICA's help page.  This doesn't leave you with density estimates, but
with projection-pursuit directions; you still have to figure out how to fit
a density estimate to the rotated data.  Actually, as I understand it,
XGobi also finds directions but does not fit a density.  However, any
multivariate density estimator ought to be applicable.  I'm not aware at
the moment of the tools R offers for multivariate density estimation, but
I'm sure there are multiple possibilities.

_The Elements of Statistical Learning_ by Hastie, Tibshirani, and Friedman
mention a "trick" to use classification tools (which model class
probabilities) to estimate density.  Fundamentally, generate data from a
reference distribution, and use the classification tool to estimate
probability of observed data (as opposed to generated data) as a function
of the inputs.  These probabilities, normalized to integrate to 1, form a
density estimate.  Since there are so very many classification tools
available, this trick offers a lot of flexibility.

Good luck.

Jim Garrett
Baltimore, Maryland, USA


**
This message is intended only for the designated recipient(s...{{dropped}}

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Loading SparseM on Win2K

2004-02-26 Thread Jim_Garrett
I'm having trouble loading the package SparseM in R 1.8.1, OS = Windows
2000.

Installing appeared to go well; I saw no error messages, html documentation
was installed, and "installed.packages()" lists SparseM among the installed
packages.

When I try to load the library, however, I get the following:

> library(SparseM)
Error in slot(mlist, "argument") : Can't get a slot ("argument") from an
object of type "NULL"
Error in print("SparseM library loaded") :
S language method selection got an error when called from internal
dispatch for function "print"
Error in library(SparseM) : .First.lib failed

Does anyone have an idea what could be wrong?  Or what I should do next to
diagnose the problem?

Thanks,

Jim Garrett

**
This message is intended only for the designated recipient(s...{{dropped}}

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Calculate Closest 5 Cases?

2004-02-16 Thread Jim_Garrett
Danny,

Here's another approach that doesn't use sorting.  Instead, after
calculating distances it considers a threshold on distance and counts how
many cases are within the threshold.  Then a search over thresholds is
conducted to find a threshold yielding the desired number of cases.  Then
case numbers satisfying the found threshold are returned.

Why go to this bother to avoid sorting?  Since the computation required to
sort N distances grows more quickly than the computation required to make a
pass through a vector of distances and count values less than a threshold,
for some N this approach ought to be more efficient than one based on
sorting.  Well, maybe, if the number of iterations grows slowly enough with
N.  Call it a conjecture.  On the other hand, the N for which this occurs
depends on the efficiency of the sort algorithm (pretty darned efficient
and using compiled code, I imagine) and the efficiency of the search
algorithm.  Here I use uniroot, which isn't really designed with step
functions in mind, so perhaps there are possible improvements.  So I
couldn't say whether this will work faster than the other suggestions
you've gotten with your data, or even for data one is likely to ever see.

If there are ties for the 6th-nearest case (counting the point itself),
this routine should either include or exclude all the ties, whichever comes
closest to yielding 5 points.

Here's the code:

## Args:
## CaseNo: Row number of case for which to find nearest neighbors.
## x:  A numeric matrix.
## k:  The desired number of neighbors, not counting the point of
interest.
## TempIndex:  Index of row numbers.  If you use this function many many
times you
## might gain an iota of efficiency by creating this vector
once and
## passing it in as an argument with each call.  Probably not
worth the
## trouble, I couldn't help myself
## verbose:Tells you how many iterations uniroot used, if you're
curious.
## Value:
## A vector of (hopefully) k row numbers indicating the k nearest
neighbors.
nearestKNeighbors <- function(CaseNo, x, k, TempIndex = 1:nrow(x), verbose
= F) {
  ## make sure x is a matrix
  if(!is.matrix(x))
stop("x must be a matrix.")

  TempVect <- x[CaseNo, ]
  SquaredDistances <- apply(x, 1, function(s) sum((s - TempVect)^2) )

  tempFun <- function(h) sum(SquaredDistances < h^2) - (k + 1)
  TempSolution <- uniroot(tempFun, interval = range(SquaredDistances))
  if(verbose)
cat("Required", TempSolution$iter, "iterations.\n")

  sort(setdiff(TempIndex[SquaredDistances < TempSolution$root^2], CaseNo))
}

You would apply this to each row of your dataset using some variety of
"apply" or a loop.  I don't know if memory usage would differ.  In the
event of ties, you may not have 5 neighbors, so I might put the results in
a list, which can accommodate different lengths in each component (indeed,
it can accommodate completely different data structures):

## data in matrix temp.matrix
ListOutput <-
  lapply(as.list(1:nrow(temp.matrix)),
 function(s) nearestKNeighbors(s, temp.matrix, 5))

or

ListOutput <-
  vector(nrow(temp.matrix), mode = "list")
for(i in 1:nrow(temp.matrix))
  ListOutput[[i]] <- nearestKNeighbors(i, temp.matrix, 5)

and then you can manipulate the list however you like.  For instance, to
see if all components have length 5,

all(unlist(lapply(ListOutput, length)) == 5)

or, if there are such components, find out which one(s):

(1:length(ListOutput))[unlist(lapply(ListOutput, length)) != 5]

Welcome to R!

Best,

-Jim Garrett

***

> I've only begun investigating R as a substitute for SPSS.

> I have a need to identify for each CASE the closest (or most similar) 5
> other CASES (not including itself as it is automatically the closest).  I

> have a fairly large matrix (5 cases by 50 vars).  In SPSS, I can use
> Correlate > Distances to generate a matrix of similarity, but only on a
small
> sample.  The entire matrix can not be processed at once due to memory
limitations.

> The data are all percents, so they are easy comparable.

> Is there any way to do this in R?

> Below is a small sample of the data (from SPSS) and the desired output.

> Thanks,

> Danny


**
This message is intended only for the designated recipient(s...{{dropped}}

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Job opportunity

2004-02-04 Thread Jim_Garrett
I'm pleased to announce that my employer, Becton Dickinson, has an open
position for software implementation in the Matlab and S languages.  If
interested, send a cover letter and resume (or CV) to Dr. Richard Moore at
[EMAIL PROTECTED]  If you have questions, contact myself (at
[EMAIL PROTECTED]) or Dr. Moore.

Below find a position description.

Jim Garrett
Becton Dickinson Diagnostics
Baltimore, Maryland, USA

***

BECTON DICKINSON DIAGNOSTIC SYSTEMS
POSITION DESCRIPTION

POSITION TITLE: Senior Engineer, Algorithm Implementation

DEPARTMENT: 10148014, Systems Engineering

REPORTS TO: Richard Moore, Manager Systems Engineering

Date: 1/6/04

The statements below are intended to describe the general nature and
level of work being performed by associates assigned to this job. This
job description is not intended to be an exhaustive list of all
responsibilities, duties, and skills required of associates so
classified.

JOB SUMMARY:

Works closely with senior members of the R&D engineering and
statistics group to implement and optimize algorithms in the areas
of signal processing, classification, and other fields relative to
in vitro diagnostics.

DUTIES AND RESPONSIBILITIES:

* Uses and integrates "high-level" tools and packages to perform data
  analysis and implement algorithms designed by R&D scientists,
  engineers, and statisticians
* Optimizes existing tools and algorithms using low level languages as
required
* Applies analyses to large data sets in a database environment
* Modifies databases to match analysis needs
* Adheres to R&D software development, documentation, and validation
standards

QUALIFICATIONS:

KNOWLEDGE AND SKILLS:
* Experience with Matlab is required
* Experience with C or C++ is required
* Experience with relational database management systems is required
* Experience with the S language (R or S-Plus) is desirable
* Experience with signal processing techniques is desirable
* Experience with object-oriented software design is desirable
* Knowledge of linear algebra and numerical analysis is desirable
* Strong interpersonal skills
* Strong oral and written communication skills
* Strong problem solving skills and attention to detail

EDUCATION AND EXPERIENCE:
* B.S. and 5 years experience or M.S. and 3 years experience in
  Applied Math, Computer Science, Engineering or related field is
  required.


**
This message is intended only for the designated recipient(s...{{dropped}}

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] multidimensional function fitting

2003-03-03 Thread Jim_Garrett


Have you considered fitting a neural net using the nnet package?  Since
nets are determined by a functional form that is straightforward
(recursive, but straightforward) and a set of coefficients (weights),
they're fairly easy to program from scratch.  There are no "exotic" (to
some) basis functions to deal with.  You can find descriptions sufficient
to guide coding in Venables and Ripley's _Modern Applied Statistics with
S-Plus_, Ripley's _Pattern Recognition and Neural Networks_, and also _The
Elements of Statistical Learning_ by Hastie, Tibshirani, and Friedman.
I've found each of these to be a good general reference, by the way--if
you're going to be dealing with "curve fitting" on more than this one
occasion, one or all of these would be good to have.  Ripley's pattern
recognition book naturally focuses on classification rather than general
curve-fitting, though many ideas are common to both.

Fitting a net will put more of a burden on you to optimize the model than
mgcv will.  I would suggest  the following guidelines (most of which I
learned from Ripley's pattern recognition book):
   Any net "fitting" should actually consist of multiple fits from random
   starting points--take the fit that minimizes the fitting criterion.
   Scale predictors if necessary to ensure that predictors are on similar
   scales.
   For curve fitting, use the options "linout = T" and "skip = T" (the
   latter is optional but recommended).
   You must optimize some criterion that keeps you from overfitting.  Two
   different techniques I have used successfully are
   (a)  choose the number of hidden units and weight decay parameter to
   optimize cross-validation performance, and
   (b)  choose the number of hidden units to optimize the Bayesian
   Information Criterion (BIC) (setting weight decay to zero).
   The former should give you better performance, but the latter is less
   work and usually works pretty well.  BIC penalizes the number of
   parameters, and here the "number of parameters" is taken to be the
   number of weights.  I could send you R code that automates this.
   Using weight decay > 0 generally improves predictive performance.  Once
   you have enough hidden units, you will not overfit if you use a suitable
   weight-decay parameter.  In fact, a reasonable strategy would be to use
   BIC to select the number of hidden units, then add (say) 2 hidden units
   and find weight-decay that optimizes cross-validation performance.

In short, mgcv makes model optimization convenient but has a price in
implementation, while neural nets make implementation convenient but have a
price in optimization.  I sometimes fit a model which is then implemented
by our Software Engineering group in C, and in that situation I prefer to
pay the price that falls only on me and which I can afford to pay.  Hence I
have used neural nets.  In fact early on I specified for our software group
a "neural net evaluator" C routine that takes some net architectural
parameters (number of inputs, hidden nodes, etc.) and weights in the order
that the nnet package displays them.  Now I can essentially fit a net with
package nnet, send someone the parameters, and they'll have it running in C
in short order.

Of course, GAM's also offer much in the way of model interpretation, and if
the phenomenon at hand obeys additivity, a GAM will outperform a neural
net.  I should add that I use the mgcv package frequently, and like it very
much.

Regarding computational resources, with some datasets I've fitted complex
GAM's, neural nets, and projection pursuit (ppr in package modreg), and
didn't notice much difference in a very rough, qualitative sense.  With
enough data, they all take a while.  I didn't investigate memory usage.  Of
course, gam and ppr are doing more work--gam is optimizing smoothness, and
ppr is fitting a range of models.  If you have so much data that none of
these are feasible,  you could do worse than fitting to a random subset.
How much precision do you need anyway?  And how complex a model do you
anticipate?  1000 cases or fewer is probably more than enough for most
regression problems.  Set aside the rest, and you have a large validation
set!  In fact you might fit multiple models, each to a random subset,
implement all the models, and average their responses.

Further along these lines, you could fit quite a few regression trees using
rpart because it's very fast.  You could even do thorough cross-validation
without great demands in time or memory.  A tree won't offer a smooth
response function, but if you average enough of them, the granularity of
response should be small.  Trees, like nets, are easy to code from scratch.
This just about describes bagging, except that the multiple subsets in
bagging are generated by bootstrap sampling, so are not mutually exclusive,
can contain cases more than once, and are of the same size as the original
data.

Good luck,

Jim Garrett
Becton Dickinson Diagnostic Systems
Baltimore, Mary

[R] Re: R-help digest, Vol 1 #89 - 53 msgs

2003-02-26 Thread Jim_Garrett


I couldn't resist tossing in my two-cents' worth on this, because R has
some language features that allow you to use efficient optimization
routines in a straightforward, elegant way.  I'm particularly enthusiastic
about this because I have suffered through other languages, which made this
approach either painful or impossible, depending on the problem.  Thanks to
the R development team for their great work!

Jose has offered a "brute force" approach, evaluating distance on a fine
grid of points.  This certainly is very robust.  However, using an
optimization routine can lead to fewer evaluations than evaluating on a
fine grid of points, and can give resolution that is finer than the grid.
You may wish to combine the two, and start the optimization from a grid of
starting points.  This grid need not be very fine, though.  And in the
one-dimensional case, the "optimize" function appears to be very robust
(perhaps it already incorporates this strategy).

For any search to work, You need to have the curve be "parameterized" in
some sense, allowing you to identify an infinite number of points lying on
the curve as you vary some underlying parameter.  For instance, perhaps
you've fitted a nonparametric regression model allowing you to make
predictions for any point x.  Or you have some means to calculate a y
value, given any x (such that x is on the curve).

The overall strategy I'm enthusiastic about is to create a loss function
that's specific to the point in question, and then use an optimization
routine to, well, optimize.  And because you'll want to do this again and
again, for convenience you can write a function that makes point-specific
loss functions (this is the part that is so straightforward to R, relative
to many other languages).

Here's an example:

example.data <- data.frame(x = seq(-1, 1, length = 200))
set.seed(123)
example.data$y <- example.data$x^2 + rnorm(200, sd = 0.2)
library(modreg)
example.spline <- smooth.spline(example.data)
plot(example.data)
lines(example.spline)
X1 <- c(0.25, 0.75)
points(X1[1], X1[2], pch = "x")

# What's the point on the curve defined by example.spline
# that's closest to X1?

# Euclidean distance
MyDistance <- function(x1, x2)
  {
sqrt(sum((x2 - x1)^2))
  }

# For convenient re-use, a function that makes point-specific loss
functions.
# (For some model objects, the predict method requires something
# like "predict(CurveObject, newdata = data.frame(x = s))".)
MakeLoss <- function(x, CurveObject)
  {
function(s) MyDistance(x, c(s, predict(CurveObject, s)$y))
  }

# Now make the loss function for X1...
MyLoss <- MakeLoss(X1, example.spline)

# ...and optimize.  For a higher-dimensional problem, use "optim"
# rather than "optimize".
closest.x <- optimize(MyLoss, interval = c(-1, 1))$minimum
lines(c(X1[1], closest.x),
  c(X1[2], predict(example.spline, closest.x)$y))

Good luck!

-Jim Garrett
Becton Dickinson Diagnostic Systems
Baltimore, Maryland, USA


**
This message is intended only for the designated recipient(s).   ... [[dropped]]

__
[EMAIL PROTECTED] mailing list
http://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] Re: Problems for 13 year old

2003-01-27 Thread Jim_Garrett

How about spam filtering?

Granted, there's some infrastructure involved, which means gratification is
not instant.  But it involves something that most people who use computers
care about:  e-mail, and spam.

I mention this because the following web site sparked some interest in
statistics among some acquaintances who were otherwise very cool to it:

 http://www.paulgraham.com/spam.html

This outlines a "Bayesian" spam filter.  I'm not sure it's wholly Bayesian,
but it comes close, the author's are good, and I hear that it performs
well, in fact better than many commercial spam filters (or so I hear).
Moreover, the web site virtually gushes about the virtues of statistical
methods.  The interesting thing about the filter is that you get to see
what "features" it's discovering.

A quick search also indicated that Mozilla apparently offers a plug-in for
the same spam filter.  That would offer a quick way to get the filter up
and running with real e-mail.  But I don't know if Mozilla offers
interesting diagnostics about which features it's using, which is the
pedagogically interesting part.  Mozilla mentions it here:

 http://www.mozilla.org/mailnews/spam.html

Of course, you can use any number of classification techniques to
distinguish spam from other e-mail, you just need data.  Hastie and
Tibshirani's _The Elements of Statistical Learning_ demonstrates a couple
of types of models applied to the spam problem, and points to data at

 ftp.ics.uci.edu

Ideally, you would do some exploration to design a filter, implement it in
R, and then integrate it with your nephew's e-mail program.  This would be
a long-term project, maybe even a science-fair project, with long-term
benefits (educational and practical).  I know this can be done with Linux,
but I have no idea about Mac OS 9!  It's probably a stretch for typical
13-year-olds, but for the right 13-year-old, it would be a blast.

Good luck!

Jim Garrett
Baltimore, Maryland, USA


*
This message is intended only for the designated recipient(s).   ... [[dropped]]

__
[EMAIL PROTECTED] mailing list
http://www.stat.math.ethz.ch/mailman/listinfo/r-help



[R] Re: pasting "\" into character strings

2002-12-23 Thread Jim_Garrett

Is it really necessary to use "\"?

On my Windows 2000 system here at work, it appears that the "dos" shell
interprets "/" as "\", even though "\" is the "official" Windows folder
delimiter.  For instance, I can type "C:/Program
Files/R/rw1061/bin/Rgui.exe" (quotes included) and R opens.  Using "/"
would simplify matters, but are there pitfalls?  Do all Windows systems
exhibit this behavior?

Just curious.

Jim Garrett
Becton Dickinson Diagnostic Systems
Baltimore, Maryland, USA


*
This message is intended only for the designated recipient(s).  It may
contain confidential or proprietary information and may be subject to
the attorney-client privilege or other confidentiality protections.
If you are not a designated recipient, you may not review, use, copy
or distribute this message.  If you receive this in error, please
notify the sender by reply e-mail and delete this message.  Thank you. 

**

__
[EMAIL PROTECTED] mailing list
http://www.stat.math.ethz.ch/mailman/listinfo/r-help