Re: [R] CPCA?

2007-11-26 Thread Jari Oksanen
Daniel Berner  mail.mcgill.ca> writes:

> 
> It would be great to know if and where an R code for Common Principal
Component Analysis is available.

The only code I know is in IMSL Fortran library, and that is proprietary, and
that is expensive. After a quick look, it really appears that IMSL C library
does not have the code, but you got to go to Fortran. Look for function DKPRIN
(KPRIN). That code cannot be ported to R, naturally.

Another issue is that the code obviously was written by Bernhard Flury, and he
also has FORTRAN listing in his book "Commaon Principal Components and Related
Multivariate Models" (J. Wiley & Sons, 1988). I don't know of the licensing
status of that code. I didn't find any explicit licensing information in the
book (but may have overlooked something), apart from the usual book copyright
that forbids everything. I assume it is the same code that is the base of DKPRIN
in the IMSL FORTRAN library. Would mean a lot of typing at minimum, and for any
sensible code, editing the code to use LAPACK where ever possible.

Further, Flury descirbes the algorithm in his book, and it could be implemented
independently in R. I think nobody has done that (yet).

cheers, jari oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Scatterplot Showing All Points

2007-12-18 Thread Jari Oksanen
Wayne Aldo Gavioli  fas.harvard.edu> writes:

> 
> 
> Hello all,
> 
> I'm trying to graph a scatterplot of a large (5,000 x,y coordinates) of data
> with the caveat that many of the data points overlap with each other (share 
> the
> same x AND y coordinates).  In using the usual "plot" command,
> 
> > plot(education, xlab="etc", ylab="etc")
> 
> it seems that the overlap of points is not shown in the graph.  Namely, there
> are 5,000 points that should be plotted, as I mentioned above, but because so
> many of the points overlap with each other exactly, only about 50-60 points 
> are
> actually plotted on the graph.  Thus, there's no indication that Point A 
> shares
> its coordinates with 200 other pieces of data and thus is very common while
> Point B doesn't share its coordinates with any other pieces of data and thus
> isn't common at all.  Is there anyway to indicate the frequency of such points
> on such a graph?  Should I be using a different command than "plot"?
> 
> 
One suggestion seems to be still missing: 'sunflowerplot' of base R. May look
taggy, though, if you have 200 "petals". 

Actually the documentation of sunflowerplot is wrong in botanical sense.
Sunflowers have composite flowers in capitula, and the things called 'petals' in
documentation are ligulate, sterile ray-florets (each with vestigial petals
which are not easily visible in sunflower, but in some other species you may see
three (occasionally two) teeth). 

cheers, jari oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error on distance matrix

2008-01-10 Thread Jari Oksanen
Marc Moragues  scri.ac.uk> writes:

> 
> Hi,
> 
> I am trying to calculate a distance matrix on a binary data frame using
> dist.binary() {ade4}. This is the code I run and the error I get:
> 
> > sjlc.dist <- dist.binary(as.data.frame(data), method=2) #D = (a+d) /
> (a+b+c+d)
> Error in if (any(df < 0)) stop("non negative value expected in df") :
>  missing value where TRUE/FALSE needed
> 
> I don't know if the problem are the missing values in my data. If so how
> can I handle them?
> 
Dear Marc Moragues,

At least adding NA to a data.frame gave the same error message as you report
above. Odds are good for NA being responsible (but we cannot know: we only
guess). Further, it seems that ade4:::dist.binary does not have an option to
handle NA input. Problem here is that what do you think should be done with NA?
Should you get a NA result? Should the whole observation be removed because of
NA? Or should the comparisons be based on pairwise omissions of NA meaning that
index entries are based on different data in the same matrix? Or should you
impute some values for missing entries (which is fun but tricky)?

One solution is to use function designdist in vegan where you can with some
acrobary design your own dissimilarity indices. Function designdist uses
different notations, because its author hates that misleading and dangerous 2x2
contingency table notation. The following, however, seems to define the same
index as ade4:

designdist(data, "sqrt(1-(2*J+P-A-B)/P)")

See the documentation of vegan:::designdist to see how to define things there
(and the sqrt(1-x) part comes from the way ade4 changes similarities to
dissimilarities).

BTW, don't call your data 'data'. R wisdom (see fortunes) tells you that you do
not call your dog dog, but I'm not quite sure of this. At least in yesterdays
horse races in national betting, one of the winner horses was called 'Animal',
so why not...

cheers, jari oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error on distance matrix

2008-01-10 Thread Jari Oksanen
Jari Oksanen  oulu.fi> writes:

> 
> Marc Moragues  scri.ac.uk> writes:
> 
> > 
> > Hi,
> > 
> > I am trying to calculate a distance matrix on a binary data frame using
> > dist.binary() {ade4}. This is the code I run and the error I get:
> > 
> > > sjlc.dist <- dist.binary(as.data.frame(data), method=2) #D = (a+d) /
> > (a+b+c+d)
> > Error in if (any(df < 0)) stop("non negative value expected in df") :
> >  missing value where TRUE/FALSE needed
> > 
> > I don't know if the problem are the missing values in my data. If so how
> > can I handle them?
> > 
> Dear Marc Moragues,
> 
> At least adding NA to a data.frame gave the same error message as you report
> above. Odds are good for NA being responsible (but we cannot know: we only
> guess). 
...clip...
> One solution is to use function designdist in vegan where you can with some
> acrobary design your own dissimilarity indices. 

It is a bad habit to comment on yourself, but if you write stupid things, you
should tell so: that designdist() "handles" missing values is accidental,
unintentional and dangerous. Now you still should figure out *how* they are
handled, and the function should warn users if they supply data with missing
values. I hope that will be changed. 

>Function designdist uses
> different notations, because its author hates that misleading and dangerous 
> 2x2
> contingency table notation. The following, however, seems to define the same
> index as ade4:
> 
> designdist(data, "sqrt(1-(2*J+P-A-B)/P)")
> 
You can make this cuter:

designdist(data, "sqrt((A+B-2*J)/P)")

Cheers, Jari Oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] png() and pdf() alignment problems

2008-02-03 Thread Jari Oksanen
Dear Roland Kaiser,

I think there may different pixel-to-point scaling in different  
devices. This scaling seems not only to vary with the device but even  
with your physical screen. This is what happens with my current box  
(MacBook):

 > quartz()
 > par("mar")/par("mai")
[1] 6 6 6 6
 > par("cra")
[1]  6 12
 > png()
 > par("mar")/par("mai")
[1] 4.686347 4.686347 4.686347 4.686347
 > par("cra")
[1] 13 16
 > dev.off()
 > pdf()
 > par("mar")/par("mai")
[1] 5 5 5 5
 > par("cra")
[1] 10.8 14.4
 > x11()
 > par("mar")/par("mai")
[1] 4.686347 4.686347 4.686347 4.686347
 > par("cra")
[1] 13 16

The quotient gives the par margin line height in inches, par("cra")  
the dimensions of the characters in "rasters" (which may be close to  
size in points, but may differ a lot in some screens). For your  
alignment, I think you want to make margin proportions equal in two  
devices, and you must scale so figs so instead of using physically  
equal dimensions. The mar/mai ratio or cra does not change with  
device width & height, so you must change those so that for fig looks  
like it should.

I have no ideas if this works, but I think this still is your problem...

Greetings to the Moderator,

Jari Oksanen

Cheers, Jari Oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] adonis (vegan package) and subsetted factors

2008-04-10 Thread Jari Oksanen

On 10 Apr 2008, at 18:57, tyler wrote:

> On Thu, Apr 10, 2008 at 04:47:44PM +0100, Gavin Simpson wrote:
>>
>> This behaviour arises from the following, using the in-built dune  
>> data:
>>
>>> newdune.env <- subset(dune.env, Management != "NM")
>>> newdune.env$Management
>> [1] BF SF SF SF HF SF HF HF BF BF HF SF SF HF
>> Levels: BF HF NM SF
>>
>> Notice this hasn't dropped the empty level "NM", and this is what is
>> catching out adonis --- it is not checking for empty levels in the
>> grouping factor, as this shows:
>>
>>> newdune <- dune[which(dune.env$Management != "NM"), ]
>>> adonis(newdune ~ Management*A1, data=newdune.env, permutations=100)
>>
>> Call:
>> adonis(formula = newdune ~ Management * A1, data = newdune.env,
>> permutations = 100)
>>
>>Df SumsOfSqs  MeanSqs  F.Model R2 Pr(>F)
>> Management 3.0   0.57288  0.19096  1.27735 0.2694  <0.01 ***
>>
>> For now, forcibly remove empty factor levels as per your second  
>> example,
>> but I'll take a look at fixing adonis()
>
> Great, thanks!

Howdy folks,

Fine that this was taken care of while my wife and I were seeing  
Fellini's 8 1/2 in our laptop. Actually, this same problem recently  
emerged in varpart and could be lurking elsewhere in vegan. With some  
luck, this fix would have made its way to the latest minor CRAN  
release of vegan we had today. Unfortunately the Earth is designed so  
that you folks over there in Canada are lagging behind, and you were  
still asleep when we had our release here in Europe. It seems that  
CRAN was fast today, and the new version already is available as a  
source and Windows binary in the main CRAN home. With our faster  
release cycles this will be in CRAN rather soon, and it is immediately  
(or from our morning and Canada's late night) in packages at R-Forge.

Best wishes, Jari Oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] gam and ordination (vegan and labdsv surf and ordisurf)

2008-11-20 Thread Jari Oksanen
On Thu, 2008-11-20 at 09:45 -0500, stephen sefick wrote:
> #for instance this
> ordisurf(bug.4, env.savannah[,"TSS"]+env.savannah[,"TIN.TP"])
> 
Stephen,

According to ordisurf documentation, this is correct if 'bug.4' is an
ordination result, and the sum of those two env.savannah columns is the
single variable you want to plot. The independent variables will be the
axes which are found from 'bug.4', and there can only be one response
variable, but that can be a sum of several variables. The sensibility of
this model is up to the user.

If you want to plot several y-variates in the same graph instead of
their sum, you can use argument add=TRUE for the second variable
(defaults FALSE) and ordisurf() overlays the fitted surface in an
existing plot.

Cheers, Jari Oksanen 
-- 
Jari Oksanen <[EMAIL PROTECTED]>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Plotting two distance matrices

2007-10-10 Thread Jari Oksanen

> 
> I'm trying to plot two distance matrices against each other
> (74x74,phylogenetic distance and phenotypic distances). However R
> gives an error message:"Error in plot.new() : figure margins too
> large".
> 
> Is it because I have too many points to plot?

No. It *sounds* like you don't have distance matrices but data frames of
distances, because this is the error message that you get. The default
plot of a data frame is a pairs() plot of all columns (variables)
against each other. You will have a data frame if you read in your
distances from an external file (or changed them into data frame). Try
changing your data sets into regular R dist objects, and plot these
against each other (and as.dist really seems to work to data frames
directly):

plot(as.dist(d1), as.dist(d2))

This was only guessing, since the message did not contain sufficient
information.

cheers, jari oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Canberra distance

2007-10-16 Thread Jari Oksanen
Frédéric Chiroleu  cirad.fr> writes:

> 
> Hi,
> 
> I misunderstand the definition of Canberra distance in R.
> 
> On Internet and in function description pages of dist() from stats and 
> Dist() from amap, Canberra distance between vectors x and y, d(x,y), is :
> 
> d(x,y) = sum(abs(x-y)/(x+y))
> 
> But in use, through simple examples, we find that the formula is :
> 
> d(x,y) = (NZ + 1)/NZ * sum(abs(x-y)/(x+y))
> 
> with NZ = nb of pairs of coordinates that are different from (0,0) (Non 
> Zeros)
> 
I think you must try another example. At least in my simple experiments the
multiplier seemed to be NZ/NZ or one instead of your almost one, and this one
was also the documented case.  I could not find any difference to the
documentation. However, there is a note about "double zeros" (zero denominator
and numerator) in the dist documentation. Could that cause some difference?

If you really want to know how the distance is calculated, download the R source
file and look at there. If you want to know how the index was originally
suggested to be calculated, you must find the Lance & Williams paper in Aust.
Comput. J. 1, 15-20, 1967 (I haven't found it, but would be curious to see it). 

Cheers, jari oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] prcomp(X,center=F) ??

2009-03-08 Thread Jari Oksanen

Dear Agustin & the Listers,

Noncentred PCA is an old and establishes method. It is rarely used,  
but still (methinks) it is used more often than it should be used.  
There is nothing wrong in having noncentred PCA in R, and it is a real  
PCA. Details will follow.


On 08/03/2009, at 11:07 AM, Agustin Lobo wrote:


I do not understand, from a PCA point of view, the option center=F
of prcomp()

According to the help page, the calculation in prcomp() "is done by  
a singular value decomposition of the (centered and possibly scaled)  
data matrix, not by using eigen on the covariance matrix"   (as it's  
done by princomp()) .

"This is generally the preferred method for numerical accuracy"

The question is that while
prcomp(X,center=T,scaling=F) is equivalent to princomp(X,scaling=F),
but prcomp(X,center=F) has no equivalent in princomp()


princomp() does not have explicit noncentred analysis because its  
algorithm is based on the eigen analysis of covariance matrix using  
funciton cov.wt(). While function cov.wt() knows argument  
'center' (which is the US spelling of centre), princomp() does not  
pass that argument to cov.wt(). However, if you supply a noncentred  
'covmat' instead of 'x' (see ?princomp), princomp() will give you  
noncentred PCA.


We can only guess why the authors of princomp() did not allow  
noncentred analysis. However, I know why the author of rda() function  
in vegan did not explicitly allow noncentred analysis: he thought that  
it is a Patently Bad Idea(TM). Further, he suggests in vegan function  
that if you think you need noncentred analysis, you are able to edit  
vegan::rda.default() to do so -- if you cannot edit the function, then  
you probably are wrong in assuming you need noncentred analysis. It  
may be that the princomp authors think in the same way (but they also  
allow noncentred analyis iif you supply a noncentred 'covmat').


One of the sins of my youth was to publish a paper using noncentred  
PCA. I'm penitent. (However, I commited bigger sins, too.)


Also, the rotation made with either the eigenvectors of  
prcomp(X,center=T,scaling=F) or the ones of princomp(X,scaling=F)

yields PCs with a minimum correlation, as expected
for a PCA. But the rotation made with the eigenvectors of  
prcomp(X,center=F) yields axes that are correlated.

Therefore, prcomp(X,center=F) is not really a PCA.


PCA axes are *orthogonal*, but not necessarily uncorrelated (people  
often mistake these as synonyms). Centred orthogonal axes also are  
uncorrelated, but noncentred orthogonal are (or may be) correlated.


Your example code may be simplified a bit: prcomp returns the rotated  
matrix so that you do not need the rotation %*% data multiplication  
after analysis: you get that in the analysis.  Below I use  
multivariate normal random numbers (MASS::mvrnorm) to generate  
correlated observations:


library(MASS)
x<- mvrnorm(100, mu = c(1, 3), Sigma=matrix(c(1, 0.6, 0.6, 1), nrow=2))
pc <- prcomp(x, cent=F)

Here the scores are orthogonal:

crossprod(pc$x)

or the off-diagonal elements are numerically zero, but they are  
correlated:


cor(pc$x)

The only requirement we have is orthogonality, and uncorrelatedness is  
a collateral damage in centred analysis.


Cheers, Jari Oksanen



See the following example, in which the second column of
data matrix X is linearly correlated to the first column:

> X <- cbind(rnorm(100,100,50),rnorm(100,100,50))
> X[,2] <- X[,1]*1.5-50 +runif(100,-70,70)
> plot(X)
> cor(X[,1],X[,2])
[1] 0.903597

> eigvnocent <- prcomp(X,center=F,scaling=F)[[1]]
> eigvcent <- prcomp(X,center=T,scaling=F)[[1]]
> eigvecnocent <- prcomp(X,center=F,scaling=F)[[2]]
> eigveccent <- prcomp(X,center=T,scaling=F)[[2]]

> PCnocent <- X%*%eigvecnocent
> PCcent <- X%*%eigveccent
> par(mfrow=c(2,2))
> plot(X)
> plot(PCnocent)
> plot(PCcent)

> cor(X[,1],X[,2])
[1] 0.903597
> cor(PCcent[,1],PCcent[,2])
[1] -8.778818e-16
> cor(PCnocent[,1],PCnocent[,2])
[1] -0.6908334
>

Also the help page of prcomp() states:
"Details

The calculation is done by a singular value decomposition of the  
(centered and possibly scaled) data matrix..."


The parenthesis implies some ambiguity, but I do interpret the  
sentence as indicating that the calculation should always be done  
using a centered data matrix.
Finally, all the examples in the help page use centering (or  
scaling, which implies centering)


Therefore, why the option center=F ?


There really is a small conflict between docs and function: the  
function allows noncentred analysis, and it also performs such an  
analysis if asked. This is documented for the argument "center", but  
the option is later ignored in the passage you cite. Probably because  
the authors of the function think that this offered option should not  
be used. You may submi

Re: [R] CCA - manual selection

2009-03-20 Thread Jari Oksanen
Fabricius Domingos  gmail.com> writes:

> 
> Hello,
> 
> I am trying to obtain f-values for response (independent) variables from a
> CCA  performed in vegan package, to see which ones of them have
> significative influence in my dependent variables (like the manual selection
> in canoco), but I can't find any function (or package) that do such a thing.
> The dependents variables are species data, and the independents are
> ambiental data.
> 
> Than you.
> 
Dear Fabricius Domingos,

Use add1/drop1 for the vegan object with argument test = "permutation". For
adding constraints, you need a 'scope' of variables considered for adding to the
model. 

The following example shows how to do this with vegan package and one of its
standard data sets:

> library(vegan)
> data(mite)
> data(mite.env)
> mbig <- cca(mite ~ ., mite.env)
> m0 <- cca(mite ~ 1, mite.env)
> add1(m0, scope = formula(mbig), test="perm")
> m0 <- update(m0, . ~ . + Topo)
> add1(m0, scope = formula(mbig), test="perm")
> m0 <- update(m0, . ~ . + WatrCont)

You can also see if some of the included variables should be dropped:

> drop1(m0, test="perm")

The usage of add1.cca is explained on its help page (see ?add1.cca).

Cheers, Jari Oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to calcualte Jaccard Coefficient

2009-03-25 Thread Jari Oksanen
Greg  gmail.com> writes:

> 
> You could try 'vegdist()' in the 'vegan' package.
> 
> 
Yep, you could. However, if you want to have Jaccard for binary data although
your data are not binary, you must set binary = TRUE in vegan::vegdist.

Indeed, the greatest problem with recommending Jaccard is that there is a huge
number of packages that have this index, and you probably forget some of those
in any recommendation. I think at least the following packages have Jaccard
index: ade4, analogue, ecodist, labdsv and probably many more.

Cheers, Jari Oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Kruskal's MDS results

2009-04-17 Thread Jari Oksanen
Dieter Vanderelst  ua.ac.be> writes:

> 
> A few people suggested taking a look at Ripley's book MASS. I know the formula
listed there.
> 
> The point is that the manual for the isoMDS function says it's stress output
is in "percent". Does this mean,
> the stress reported by isoMDS is just the stress value in MASS (which ranges
from 0 to 1) value multiplied by
> 100? I've haven't been able to find any resource that expresses stress in
values from 0 to 100. So, this
> would be a convention introduced by the authors of the package?

Yes, it is stress multplied with 100 which makes it "per cent" instead of "per
unit". You can see this also in the software. The extract is from VR_mds_fn
function in MASS.c:


sstar = 0.0;
tstar = 0.0;
for (i = 0; i < n; i++) {
tmp = y[i] - yf[i];
sstar += tmp * tmp;
tstar += y[i] * y[i];
}
ssq = 100 * sqrt(sstar / tstar);

The last line has the multiplication, the previous collect the terms for the
stress. 

Best wishes, Jari Oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Kruskal's MDS results

2009-04-17 Thread Jari Oksanen
Dieter Vanderelst  ua.ac.be> writes:

> 

> The point is that the manual for the isoMDS function says it's stress output
is in "percent". Does this mean,
> the stress reported by isoMDS is just the stress value in MASS (which ranges
from 0 to 1) value multiplied by
> 100? I've haven't been able to find any resource that expresses stress in
values from 0 to 100. So, this
> would be a convention introduced by the authors of the package?
>
A comment about novelty of using percentages. I also had a look at some NMDS
resources, and the first I found were two Kruskal's papers that happened to be
on my desk (Psychometrika 29, 1-27 and Psychometrika 29, 115-129, both from
1964). Both of these expressed stress in percents. Certainly this is not a
convention introduced by the authors of the package, since they are much too
young to have done that prior to 1964.

Cheers, Jari Oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re : PCA and automatic determination of the number of components

2009-04-20 Thread Jari Oksanen
justin bem  yahoo.fr> writes:

> 
> See ade4 or mva package.
>  Justin BEM
> BP 1917 Yaoundé
> 
I guess the problem was not to find PCA (which is easy to find), but
 finding an automatic method of selecting ("determining" sounds like 
that selection would be correct in some objective sense) numbers of 
components to be retained. I thin neither ade4 nor mva give much support 
here (in particular the latter which does not exist any more).

The usual place to look at is multivariate task view: 

http://cran.r-project.org/web/views/Multivariate.html

Under the heading "Projection methods" and there under 
"Principal components" the taskview mentions packages 
nFactors and paran that help in selecting the number 
of components to retain. 

Are these Task Views really so invisible in R that people don't find
them? Usually they are the first place to look at when you need 
something you don't have. In statistics, I mean. If they are invisible, 
could they be made more visible?

Cheers, Jari Oksanen

> 
> De : nikolay12  gmail.com>
> À : r-help  r-project.org
> Envoyé le : Lundi, 20 Avril 2009, 4h37mn 41s
> Objet : [R] PCA and automatic determination of the number of components
> 
> Hi all, 
> 
> I have relatively small dataset on which I would like to perform a PCA. I am
> interested about a package that would also combine a method for determining
> the number of components (I know there are plenty of approaches to this
> problem). Any suggestions about a package/function?
> 
> thanks,
> 
> Nick

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Principle components analysis on a large dataset

2009-08-20 Thread Jari Oksanen
Moshe Olshansky  yahoo.com> writes:

> 
> Hi Misha,
> 
> Since PCA is a linear procedure and you have only 6000 observations, you do
not need 68000 variables. Using
> any 6000 of your variables so that the resulting 6000x6000 matrix is
non-singular will do. You can choose
> these 6000 variables (columns) randomly, hoping that the resulting matrix is
non-singular (and
> checking for this). Alternatively, you can try something like choosing one
"nice" column, then choosing
> the second one which is the mostly orthogonal to the first one (kind of
Gram-Schmidt), then choose the
> third one which is mostly orthogonal to the first two, etc. (I am not sure how
much rounoff may be a problem-
> try doing this using higher precision if you can). Note that you do not need
to load the entire 6000x68000
> matrix into memory (you can load several thousands of columns, proc
>  ess them and discard them).
> Anyway, you will end up with a 6000x6000 matrix, i.e. 36,000,000 entries,
which can fit into a memory and you
> can perform the usual PCA on this matrix.
> 
> Good luck!
> 
> Moshe.
> 
> P.S. I am curious to see what other people think.
> 
I think this will give you *a* principal component analysis, but it won't give
you *the* principal component analysis in the sense that the first principal
component would account for a certain proportion of the total variance etc. If
you try this, you see that each random sample will have different eigenvalues,
different proportions of eigenvalues and different sum of all eigenvalues like
you would expect for different data sets.

I even failed to create the raw data matrix of dimensins 68000 x 6000 (Error:
cannot allocate vector of size 3.0 Gb).

Cheers, Jari Oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Ellipse: Major Axis Minor Axis

2009-09-01 Thread Jari Oksanen
Vishal  gmail.com> writes:

> 
> I have a matrix(3000x2) of numbers and I have plotted a scatterplot
> (defined in the ``car'' library.)
> 
> scatterplot(r$V1,r$V2,ellipse=TRUE)
> 
> The ellipse plotted is an error ellipse.
> 
> I want the find the major(semi), minor(semi) minor axis length of this
> ellipse.

> 
If you have the ellipse from cov.trob(), then you could do like this

sqrt(eigen(cov.trob(mydataforellipse)$cov)$values)

Cheers, Jari Oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] "biplot" graphical options?

2009-09-02 Thread Jari Oksanen
Manca Marco (PATH  path.unimaas.nl> writes:
. 
> I am struggling in the attempt to impose some graphical conditions (changing
point symbols, colors, etc)
> to biplot function (I am using it to visualize the results of princomp) but I
can't apparently manage to
> change anything but the axis... and I have been browsing manuals and vignettes
without finding any
> explicit suggestions on how to operate...
> 
> Can anyone, please, point my attention to the relevant documentation?

Howdy,

The relevant documentation in this case is ?biplot. The documentation to the
specific method is less useful (?biplot.princomp). That documentation tells you
that you can't do what you want to do. No choice. For instance, col (for colour)
can be of length 2 giving different colours for each set of points, but you
cannot vary the colour within the set. Further, you cannot turn off using arrows
and text or even suppress the plotting and add your own points. 

If you use RSiteSearch("biplot"), you can find some packages with more
flexibility than stats:::biplot.princomp and stats:::biplot.default. Another
alternative is to write your own biplot function or do the plot by hand using
the basic commands.

Best wishes, Jari Oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How can I change characteristics of a cca biplot in R

2009-09-08 Thread Jari Oksanen
Gavin Simpson  ucl.ac.uk> writes:

> 
> On Mon, 2009-09-07 at 12:49 +0300, Rousi Heta wrote:
> > 
> > Hi,
> > 
> > I´m doing cca for a community data set in R and I have made a biplot
> > for my data. Otherwise everything seems to be allright but the biplot
> > is so messy I can´t read it well enough or publish it. I would like to
> > get the row numbers out of the plot: I want the species position and
> > the environmental variables in it as vectors but not the station
> > numbers. How do I get them out? 
> 
> Is this in vegan? If so, if obj contains your fitted cca model, then
> 
> plot(obj, display = c("species","bp"))
> 
> (or, if you have centroids for factor variables as well)
> 
> plot(obj, display = c("species","bp","cn"))
> 
> If you only want to plot the points rather than the labels, add argument
> 'type' to the call setting it to "p", e.g.:
> 
> plot(obj, display = c("species","bp"), type = "p")
> 
> > 
> > I have a raw species data set and a raw environmental data set and I
> > don´t have station names or numbers in the data set but R puts the row
> > names in anyway. I understand they are necessary but I just don´t want
> > to show them in the biplot.
> 
> All of this (and more) is explained in ?plot.cca (if you are talking
> about vegan::cca)
> 
> You might also like to look at ?orditkplot (which allows you to move the
> labels around yourself to get a clear plot) and ?orditorp with which you
> can set a priority for certain sites to be labelled with text whilst
> points are used for those sites that would cause text labels to overlap.
> 
Heta,

Indeed, if this cca of vegan (and not some of the other cca's) you may also
consider reading the documentation. You can use

vegandocs("intro")

which deals with cluttered plots in chapter 2.1, or

vegandocs("FAQ")

and read points 2.1.12 and 2.1.13.

In addition to those alternatives that Gav listed, these also mention functions
ordipointlabel (which can be further processed with orditkplot) and ordilabel.
You can also see plot.cca help which has some examples.

Cheers, Jari Oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] NA in cca (vegan)

2009-09-08 Thread Jari Oksanen
Gavin Simpson  ucl.ac.uk> writes:

> 
> On Fri, 2009-09-04 at 17:15 +0200, Kim Vanselow wrote:
> > Dear all,
> > I would like to calculate a cca (package vegan) with species and
> > environmental data. One of these environmental variables is
> > cos(EXPOSURE).
> > The problem: for flat releves there is no exposure. The value is
> > missing and I can't call it 0 as 0 stands for east and west.
> > The cca does not run with missing values. What can I do to make vegan
> > cca ignoring these missing values?
> > Thanks a lot,
> > Kim
> 
> 
> This is timely as Jari Oksanen (lead developer on vegan) has been
> looking into making this happen automatically in vegan ordination
> functions. The solution for something like cca is very simple but it
> gets more complicated when you might like to allow features like
> na.exclude etc and have all the functions that operate on objects of
> class "cca" work nicely.
> 
> For the moment, you should just process your data before it goes into
> cca. Here I assume that you have two data frames; i) Y is the species
> data, and ii) X the environmental data. Further I assume that only one
> variable in X has missings, lets call this Exposure:
> 
Kim,

A test version of NA handling in cca is now in the development version of vegan
at http://vegan.r-forge.r-project.org/. You may get current source code or a bit
stale packages from that address (when writing this, the packages are two to
three days behind the current devel version). Instruction of downloading the
working version of vegan can be found in the same web site.  

Basically the development version does exactly the same thing as Gavin showed
you in his response. It does a "listwise" elimination of missing values. Indeed,
it may be better to do that manually and knowingly than to use perhaps
surprising automation of handling missing values within the function. 

Your missing values are somewhat wierd as they are not missing values (= unknown
and unobserved), but you just decided to use a coding system that does not cope
with your well known and measured values. I would prefer to find a coding that
puts flat ground together with exposure giving similar conditions. In no case
should they be regarded as NA since they are available and known, and censoring
them from your data may distort your analysis. Perhaps having a new variable
(hasExposure, TRUE/FALSE) and coding them as east/west (=0) in Exposure could
make more sense. Indeed, model term hasExposure*Exposure would make sense as
this would separate flat ground from slopes of different Exposures. The
interaction term and aliasing would take care of having flat ground with known
values but separate from exposed slopes.

Cheers, Jari Oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Mantel test least square line

2009-09-08 Thread Jari Oksanen
swertie  voila.fr> writes:
> 
> Hello, I performed a Mantel test and plotted communitiy similarities. I
> would like to add a least square line. I thought about using abline taking
> as slope the r-statistic of the Mantel test and calculating the y-intercept
> analytically. Is this method correct? Is there any function for this
> calculation? Thank you

If you have Mantel statistic for two dist() objects (as produced by dist(), 
as.dist() or compatible functions), you can just use

abline(lm(ydist ~ xdist))

because dist object is a vector with some extra attributes.

Of course, this does not quite make sense, since distances do not have least 
squares fit in any reasonable sense. People do this all the time, though 
(ecologists, and aquatic ecologists in particular, I mean).

Cheers, Jari Okanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Ellipse: Major Axis Minor Axis

2009-09-08 Thread Jari Oksanen
Vishal  gmail.com> writes:

> 
> Jari, thanks for the quick answer.
> 
> > sqrt(eigen(cov.trob(mydataforellipse)$cov)$values)
> 
> what will this return?
> 
> For my data, I get:
> 
> > sqrt(eigen(cov.trob(r)$cov)$values)
> [1] 1.857733e-05 4.953181e-06
> 
> Is this Left hand value the major or the semi major length?
> 
> I also try to plot a circuit keeping this as the radius/diameter
> I don't get a circle that intersects the major axis of the ellipse.
> 
Vishal,

Then you probably do not have an ellipse directly defined by the 
robust covariance matrix, but it is scaled up by some statistic.
Check the code of the function (or in a happy case, its documentation)
to see what is the scaling factor. 

I checked with drawing an ellipse directly with the cov.trob data
and both of the circles fit nicely.

Cheers, Jari Oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] (no subject)

2009-09-09 Thread Jari Oksanen
Hello Kim & Gav,

Gavin Simpson  ucl.ac.uk> writes:

> 
> On Wed, 2009-09-09 at 14:43 +0200, Kim Vanselow wrote:
> > Dear r-Community,
> > Step1: I would like to calculate a NMDS (package vegan, function
> > metaMDS) with species data.
> > Step2: Then I want to plot environmental variables over it, using
> > function envfit.
> > The Problem: One of these environmental variables is cos(EXPOSURE).
> > But for flat releves there is no exposure. The value is missing and I
> > can't call it 0 as 0 stands for east and west. Therefore I kicked all
> > releves with missing environmental variables. Both, metaMDS and envfit
> > then work without problems.
> > Now I want to bring the releves with missing environmetal variables
> > back to my ordination-plot.
> > 
> > Gavin Simpson gave me the advice to use the predict-function for the
> > same missing value problem when I was calculating a cca. This worked
> > without problem.
>

> 
> Also note that Jari has commented on how you are coding your Exposure
> variable; I glossed over that bit when providing my response and you
> should probably rethink your coding along the lines Jari suggested.
>
Yes, and Dylan Beaudette in gave some more concrete ideas in this thread 
(provided it is the Exposure to sun, and not to, say, wind or pollution source).
 
> There isn't a predict function for metaMDS() because there aren't rules
> or relationships that would allow you to do what predict.cca does but
> for a nMDS. In the CCA case we were estimating the locations in
> ordination space for the left-out samples on the basis of their species
> composition and computing their site score as a weighted average of the
> species scores determined from the data that went into building the CCA.
>
Actually, you can add new points to NMDS. I once wrote a function for one
user who asked this, but I did not have it in vegan, because its use needed
great technical skill, and was too tricky for a general package. For instance, 
you need to have a rectangular dissimilarity matrix between new points and 
old points (which you can find with Gav's distance() function in analogue). 

> 
> What you now want to do is superimpose upon that plot the env data where
> you might have missingness. envfit doesn't allow missings and it is not
> immediately clear to me how it might be modified to do so, certainly not
> without a lot of changes.
>
True. Seems to be bad design. You should change four functions and the
user interface to have this. Even in that case it would be na.rm (remove rows
with any missing data), and if  you want to do that, you can do it by hand:

scor <- scores(mymetaMDSresult)
keep <- complete.cases(myenvdata)
envfit(scor[keep,] ~., myenvdata[keep,])
 
> Instead, ordisurf() may be more useful, but you will loose the nice
> "plot all vectors on the ordination at once" feature of envfit (you gain
> a lot with ordisurf though as there is no reason to assume anything is
> linear across an nMDS configuration).
> 
> A cursory look at the guts of ordisurf indicates that it can handle
> missing values in the (env) variable you wish to overlay onto the nMDS
> plot; the data is passed to mgcv::gam usig the formula interface which
> deals with the NA.
>
Yes, this is true (in most cases).
 

> 
> You could also try capscale() also in vegan. This is like CCA and RDA
> but allows you to use any dissimilarity coefficient. It is a bit like a
> constrained Principal Coordinates Analysis. This can use the rda method
> for predict to do what you did with the CCA earlier.
> 
However, it does not handle missing values directly (not even in the R-Forge
version, and there are no immediate plans to change this). So you must
remove missing data rows manually.

Cheers, Jari Oksanen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.