[R] Re: projection pursuit
Luis, See the fastICA package, in particular the final example in the function fastICA's help page. This doesn't leave you with density estimates, but with projection-pursuit directions; you still have to figure out how to fit a density estimate to the rotated data. Actually, as I understand it, XGobi also finds directions but does not fit a density. However, any multivariate density estimator ought to be applicable. I'm not aware at the moment of the tools R offers for multivariate density estimation, but I'm sure there are multiple possibilities. _The Elements of Statistical Learning_ by Hastie, Tibshirani, and Friedman mention a "trick" to use classification tools (which model class probabilities) to estimate density. Fundamentally, generate data from a reference distribution, and use the classification tool to estimate probability of observed data (as opposed to generated data) as a function of the inputs. These probabilities, normalized to integrate to 1, form a density estimate. Since there are so very many classification tools available, this trick offers a lot of flexibility. Good luck. Jim Garrett Baltimore, Maryland, USA ** This message is intended only for the designated recipient(s...{{dropped}} __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Loading SparseM on Win2K
I'm having trouble loading the package SparseM in R 1.8.1, OS = Windows 2000. Installing appeared to go well; I saw no error messages, html documentation was installed, and "installed.packages()" lists SparseM among the installed packages. When I try to load the library, however, I get the following: > library(SparseM) Error in slot(mlist, "argument") : Can't get a slot ("argument") from an object of type "NULL" Error in print("SparseM library loaded") : S language method selection got an error when called from internal dispatch for function "print" Error in library(SparseM) : .First.lib failed Does anyone have an idea what could be wrong? Or what I should do next to diagnose the problem? Thanks, Jim Garrett ** This message is intended only for the designated recipient(s...{{dropped}} __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Calculate Closest 5 Cases?
Danny, Here's another approach that doesn't use sorting. Instead, after calculating distances it considers a threshold on distance and counts how many cases are within the threshold. Then a search over thresholds is conducted to find a threshold yielding the desired number of cases. Then case numbers satisfying the found threshold are returned. Why go to this bother to avoid sorting? Since the computation required to sort N distances grows more quickly than the computation required to make a pass through a vector of distances and count values less than a threshold, for some N this approach ought to be more efficient than one based on sorting. Well, maybe, if the number of iterations grows slowly enough with N. Call it a conjecture. On the other hand, the N for which this occurs depends on the efficiency of the sort algorithm (pretty darned efficient and using compiled code, I imagine) and the efficiency of the search algorithm. Here I use uniroot, which isn't really designed with step functions in mind, so perhaps there are possible improvements. So I couldn't say whether this will work faster than the other suggestions you've gotten with your data, or even for data one is likely to ever see. If there are ties for the 6th-nearest case (counting the point itself), this routine should either include or exclude all the ties, whichever comes closest to yielding 5 points. Here's the code: ## Args: ## CaseNo: Row number of case for which to find nearest neighbors. ## x: A numeric matrix. ## k: The desired number of neighbors, not counting the point of interest. ## TempIndex: Index of row numbers. If you use this function many many times you ## might gain an iota of efficiency by creating this vector once and ## passing it in as an argument with each call. Probably not worth the ## trouble, I couldn't help myself ## verbose:Tells you how many iterations uniroot used, if you're curious. ## Value: ## A vector of (hopefully) k row numbers indicating the k nearest neighbors. nearestKNeighbors <- function(CaseNo, x, k, TempIndex = 1:nrow(x), verbose = F) { ## make sure x is a matrix if(!is.matrix(x)) stop("x must be a matrix.") TempVect <- x[CaseNo, ] SquaredDistances <- apply(x, 1, function(s) sum((s - TempVect)^2) ) tempFun <- function(h) sum(SquaredDistances < h^2) - (k + 1) TempSolution <- uniroot(tempFun, interval = range(SquaredDistances)) if(verbose) cat("Required", TempSolution$iter, "iterations.\n") sort(setdiff(TempIndex[SquaredDistances < TempSolution$root^2], CaseNo)) } You would apply this to each row of your dataset using some variety of "apply" or a loop. I don't know if memory usage would differ. In the event of ties, you may not have 5 neighbors, so I might put the results in a list, which can accommodate different lengths in each component (indeed, it can accommodate completely different data structures): ## data in matrix temp.matrix ListOutput <- lapply(as.list(1:nrow(temp.matrix)), function(s) nearestKNeighbors(s, temp.matrix, 5)) or ListOutput <- vector(nrow(temp.matrix), mode = "list") for(i in 1:nrow(temp.matrix)) ListOutput[[i]] <- nearestKNeighbors(i, temp.matrix, 5) and then you can manipulate the list however you like. For instance, to see if all components have length 5, all(unlist(lapply(ListOutput, length)) == 5) or, if there are such components, find out which one(s): (1:length(ListOutput))[unlist(lapply(ListOutput, length)) != 5] Welcome to R! Best, -Jim Garrett *** > I've only begun investigating R as a substitute for SPSS. > I have a need to identify for each CASE the closest (or most similar) 5 > other CASES (not including itself as it is automatically the closest). I > have a fairly large matrix (5 cases by 50 vars). In SPSS, I can use > Correlate > Distances to generate a matrix of similarity, but only on a small > sample. The entire matrix can not be processed at once due to memory limitations. > The data are all percents, so they are easy comparable. > Is there any way to do this in R? > Below is a small sample of the data (from SPSS) and the desired output. > Thanks, > Danny ** This message is intended only for the designated recipient(s...{{dropped}} __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Job opportunity
I'm pleased to announce that my employer, Becton Dickinson, has an open position for software implementation in the Matlab and S languages. If interested, send a cover letter and resume (or CV) to Dr. Richard Moore at [EMAIL PROTECTED] If you have questions, contact myself (at [EMAIL PROTECTED]) or Dr. Moore. Below find a position description. Jim Garrett Becton Dickinson Diagnostics Baltimore, Maryland, USA *** BECTON DICKINSON DIAGNOSTIC SYSTEMS POSITION DESCRIPTION POSITION TITLE: Senior Engineer, Algorithm Implementation DEPARTMENT: 10148014, Systems Engineering REPORTS TO: Richard Moore, Manager Systems Engineering Date: 1/6/04 The statements below are intended to describe the general nature and level of work being performed by associates assigned to this job. This job description is not intended to be an exhaustive list of all responsibilities, duties, and skills required of associates so classified. JOB SUMMARY: Works closely with senior members of the R&D engineering and statistics group to implement and optimize algorithms in the areas of signal processing, classification, and other fields relative to in vitro diagnostics. DUTIES AND RESPONSIBILITIES: * Uses and integrates "high-level" tools and packages to perform data analysis and implement algorithms designed by R&D scientists, engineers, and statisticians * Optimizes existing tools and algorithms using low level languages as required * Applies analyses to large data sets in a database environment * Modifies databases to match analysis needs * Adheres to R&D software development, documentation, and validation standards QUALIFICATIONS: KNOWLEDGE AND SKILLS: * Experience with Matlab is required * Experience with C or C++ is required * Experience with relational database management systems is required * Experience with the S language (R or S-Plus) is desirable * Experience with signal processing techniques is desirable * Experience with object-oriented software design is desirable * Knowledge of linear algebra and numerical analysis is desirable * Strong interpersonal skills * Strong oral and written communication skills * Strong problem solving skills and attention to detail EDUCATION AND EXPERIENCE: * B.S. and 5 years experience or M.S. and 3 years experience in Applied Math, Computer Science, Engineering or related field is required. ** This message is intended only for the designated recipient(s...{{dropped}} __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] multidimensional function fitting
Have you considered fitting a neural net using the nnet package? Since nets are determined by a functional form that is straightforward (recursive, but straightforward) and a set of coefficients (weights), they're fairly easy to program from scratch. There are no "exotic" (to some) basis functions to deal with. You can find descriptions sufficient to guide coding in Venables and Ripley's _Modern Applied Statistics with S-Plus_, Ripley's _Pattern Recognition and Neural Networks_, and also _The Elements of Statistical Learning_ by Hastie, Tibshirani, and Friedman. I've found each of these to be a good general reference, by the way--if you're going to be dealing with "curve fitting" on more than this one occasion, one or all of these would be good to have. Ripley's pattern recognition book naturally focuses on classification rather than general curve-fitting, though many ideas are common to both. Fitting a net will put more of a burden on you to optimize the model than mgcv will. I would suggest the following guidelines (most of which I learned from Ripley's pattern recognition book): Any net "fitting" should actually consist of multiple fits from random starting points--take the fit that minimizes the fitting criterion. Scale predictors if necessary to ensure that predictors are on similar scales. For curve fitting, use the options "linout = T" and "skip = T" (the latter is optional but recommended). You must optimize some criterion that keeps you from overfitting. Two different techniques I have used successfully are (a) choose the number of hidden units and weight decay parameter to optimize cross-validation performance, and (b) choose the number of hidden units to optimize the Bayesian Information Criterion (BIC) (setting weight decay to zero). The former should give you better performance, but the latter is less work and usually works pretty well. BIC penalizes the number of parameters, and here the "number of parameters" is taken to be the number of weights. I could send you R code that automates this. Using weight decay > 0 generally improves predictive performance. Once you have enough hidden units, you will not overfit if you use a suitable weight-decay parameter. In fact, a reasonable strategy would be to use BIC to select the number of hidden units, then add (say) 2 hidden units and find weight-decay that optimizes cross-validation performance. In short, mgcv makes model optimization convenient but has a price in implementation, while neural nets make implementation convenient but have a price in optimization. I sometimes fit a model which is then implemented by our Software Engineering group in C, and in that situation I prefer to pay the price that falls only on me and which I can afford to pay. Hence I have used neural nets. In fact early on I specified for our software group a "neural net evaluator" C routine that takes some net architectural parameters (number of inputs, hidden nodes, etc.) and weights in the order that the nnet package displays them. Now I can essentially fit a net with package nnet, send someone the parameters, and they'll have it running in C in short order. Of course, GAM's also offer much in the way of model interpretation, and if the phenomenon at hand obeys additivity, a GAM will outperform a neural net. I should add that I use the mgcv package frequently, and like it very much. Regarding computational resources, with some datasets I've fitted complex GAM's, neural nets, and projection pursuit (ppr in package modreg), and didn't notice much difference in a very rough, qualitative sense. With enough data, they all take a while. I didn't investigate memory usage. Of course, gam and ppr are doing more work--gam is optimizing smoothness, and ppr is fitting a range of models. If you have so much data that none of these are feasible, you could do worse than fitting to a random subset. How much precision do you need anyway? And how complex a model do you anticipate? 1000 cases or fewer is probably more than enough for most regression problems. Set aside the rest, and you have a large validation set! In fact you might fit multiple models, each to a random subset, implement all the models, and average their responses. Further along these lines, you could fit quite a few regression trees using rpart because it's very fast. You could even do thorough cross-validation without great demands in time or memory. A tree won't offer a smooth response function, but if you average enough of them, the granularity of response should be small. Trees, like nets, are easy to code from scratch. This just about describes bagging, except that the multiple subsets in bagging are generated by bootstrap sampling, so are not mutually exclusive, can contain cases more than once, and are of the same size as the original data. Good luck, Jim Garrett Becton Dickinson Diagnostic Systems Baltimore, Mary
[R] Re: R-help digest, Vol 1 #89 - 53 msgs
I couldn't resist tossing in my two-cents' worth on this, because R has some language features that allow you to use efficient optimization routines in a straightforward, elegant way. I'm particularly enthusiastic about this because I have suffered through other languages, which made this approach either painful or impossible, depending on the problem. Thanks to the R development team for their great work! Jose has offered a "brute force" approach, evaluating distance on a fine grid of points. This certainly is very robust. However, using an optimization routine can lead to fewer evaluations than evaluating on a fine grid of points, and can give resolution that is finer than the grid. You may wish to combine the two, and start the optimization from a grid of starting points. This grid need not be very fine, though. And in the one-dimensional case, the "optimize" function appears to be very robust (perhaps it already incorporates this strategy). For any search to work, You need to have the curve be "parameterized" in some sense, allowing you to identify an infinite number of points lying on the curve as you vary some underlying parameter. For instance, perhaps you've fitted a nonparametric regression model allowing you to make predictions for any point x. Or you have some means to calculate a y value, given any x (such that x is on the curve). The overall strategy I'm enthusiastic about is to create a loss function that's specific to the point in question, and then use an optimization routine to, well, optimize. And because you'll want to do this again and again, for convenience you can write a function that makes point-specific loss functions (this is the part that is so straightforward to R, relative to many other languages). Here's an example: example.data <- data.frame(x = seq(-1, 1, length = 200)) set.seed(123) example.data$y <- example.data$x^2 + rnorm(200, sd = 0.2) library(modreg) example.spline <- smooth.spline(example.data) plot(example.data) lines(example.spline) X1 <- c(0.25, 0.75) points(X1[1], X1[2], pch = "x") # What's the point on the curve defined by example.spline # that's closest to X1? # Euclidean distance MyDistance <- function(x1, x2) { sqrt(sum((x2 - x1)^2)) } # For convenient re-use, a function that makes point-specific loss functions. # (For some model objects, the predict method requires something # like "predict(CurveObject, newdata = data.frame(x = s))".) MakeLoss <- function(x, CurveObject) { function(s) MyDistance(x, c(s, predict(CurveObject, s)$y)) } # Now make the loss function for X1... MyLoss <- MakeLoss(X1, example.spline) # ...and optimize. For a higher-dimensional problem, use "optim" # rather than "optimize". closest.x <- optimize(MyLoss, interval = c(-1, 1))$minimum lines(c(X1[1], closest.x), c(X1[2], predict(example.spline, closest.x)$y)) Good luck! -Jim Garrett Becton Dickinson Diagnostic Systems Baltimore, Maryland, USA ** This message is intended only for the designated recipient(s). ... [[dropped]] __ [EMAIL PROTECTED] mailing list http://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Re: Problems for 13 year old
How about spam filtering? Granted, there's some infrastructure involved, which means gratification is not instant. But it involves something that most people who use computers care about: e-mail, and spam. I mention this because the following web site sparked some interest in statistics among some acquaintances who were otherwise very cool to it: http://www.paulgraham.com/spam.html This outlines a "Bayesian" spam filter. I'm not sure it's wholly Bayesian, but it comes close, the author's are good, and I hear that it performs well, in fact better than many commercial spam filters (or so I hear). Moreover, the web site virtually gushes about the virtues of statistical methods. The interesting thing about the filter is that you get to see what "features" it's discovering. A quick search also indicated that Mozilla apparently offers a plug-in for the same spam filter. That would offer a quick way to get the filter up and running with real e-mail. But I don't know if Mozilla offers interesting diagnostics about which features it's using, which is the pedagogically interesting part. Mozilla mentions it here: http://www.mozilla.org/mailnews/spam.html Of course, you can use any number of classification techniques to distinguish spam from other e-mail, you just need data. Hastie and Tibshirani's _The Elements of Statistical Learning_ demonstrates a couple of types of models applied to the spam problem, and points to data at ftp.ics.uci.edu Ideally, you would do some exploration to design a filter, implement it in R, and then integrate it with your nephew's e-mail program. This would be a long-term project, maybe even a science-fair project, with long-term benefits (educational and practical). I know this can be done with Linux, but I have no idea about Mac OS 9! It's probably a stretch for typical 13-year-olds, but for the right 13-year-old, it would be a blast. Good luck! Jim Garrett Baltimore, Maryland, USA * This message is intended only for the designated recipient(s). ... [[dropped]] __ [EMAIL PROTECTED] mailing list http://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Re: pasting "\" into character strings
Is it really necessary to use "\"? On my Windows 2000 system here at work, it appears that the "dos" shell interprets "/" as "\", even though "\" is the "official" Windows folder delimiter. For instance, I can type "C:/Program Files/R/rw1061/bin/Rgui.exe" (quotes included) and R opens. Using "/" would simplify matters, but are there pitfalls? Do all Windows systems exhibit this behavior? Just curious. Jim Garrett Becton Dickinson Diagnostic Systems Baltimore, Maryland, USA * This message is intended only for the designated recipient(s). It may contain confidential or proprietary information and may be subject to the attorney-client privilege or other confidentiality protections. If you are not a designated recipient, you may not review, use, copy or distribute this message. If you receive this in error, please notify the sender by reply e-mail and delete this message. Thank you. ** __ [EMAIL PROTECTED] mailing list http://www.stat.math.ethz.ch/mailman/listinfo/r-help