Hi All,
I have two functions (attached) which I would like to submit to
Octave-Forge. The first, poolstd.m, is a super simple function to pool
a set standard deviations. The second function performs a principal
component analysis akin to that provided by SPSS. This function can
accept several flags to use standard methods for identifying the number
of principal components to include.
I can provide a non-public data set for testing pca.m if needed.
Please let me know what modifications I should make to either so that
they can be added to the stats package.
Best,
Kyle
% Compute Pooled Standard Deviation from two vector arrays of precomputed
standard deviations and sample sizes.
%
% sd = poolstd(sd, n)
% where:
% sd = [sd_1, sd_2, ..., sd_k] % the standard deviation of several samples
from k seperate groups/subjects
% n = [n_1, n_2, ..., n_k] % the sample size of those k seperate
groups/subjects
%
% (n_1 - 1)sd_1^2 + (n_2 - 1)sd_2^2 + ... + (n_k - 1)sd_2^k
% SD^2 = ---------------------------------------------------------
% n_1 + n_2 + ... + n_k - k
%
% More information can be found at http://en.wikipedia.org/wiki/Pooled_variance
%
% Decemeber 2011 - Kyle Winfree, kyle.winf...@gmail.com
% This is free software (GPL).
function sd_pooled = poolstd(sd, n)
k = length(n);
sd_pooled = sqrt(sum((n - 1) .* (sd .^ 2)) / (sum(n) - k));
% testing:
% x1 = [2 4 5 2 1 3 1 3 4]
% x2 = [2 4 4 5 7 6 3 2 7]
% x3 = [3 5 9 7 1]
% sd = [std(x1), std(x2), std(x3)] % (1.3944 1.9437 3.1623)
% n = [length(x1), length(x2), length(x3)] % (9 9 5)
% k = length(n) % (3)
% sd_pooled = sqrt(sum((n - 1) .* (sd .^ 2)) / (sum(n) - k))
% function [D_ordered, return_n, pc, V, loadings_n] = pca(X, {flags})
%
% Perform Principal Component Analysis of X similar to that which would be
provided by programs like SPSS.
%
% flags can be as follows:
% pca(X, limit_type)
% where limit_type = {'all' | 'kaiser' | 'pa' | 'map' | 'scree'}
% all - return all components, this is the default if limit_type if
only X is provided
% kaiser - drop all eigenvalues with a value less than 1.0 (Kaiser
Criterion)
% pa - perform a parallel analysis and keep only those that provide
eigenvalues with a value better than that of a matched randomized set
% map - perform a Velicer's Minimum Average Partial (MAP) test, keep
those according to the year 2000 standard of map^4
% pca(X, limit_type, threshold)
% where limit_type = {'varaince' | 'first'}
% variance - return only enough to account for the amount of variance
specified by threshold
% first - return the first n components where n is the value of
threshold
% pca(X, limit_type, threshold, verbosity)
% where verbosity = {1 | 0}
% 1 - verbose
% 0 - quiet
% note that a threshold value must be provided if verbosity is specified.
For limit_types of all, kaiser, pa, or map, this threshold will have no effect.
%
% limit_type(s) of kaiser, pa, map, and scree will all produce a plot unless
verbosity is set to 0.
%
% Authors: Kyle Winfree (kyle.winf...@gmail.com)
% This program is free software (GPL).
function varargout = pca(varargin)
% check inputs to see if basic pca is requested, or if the number of
principal components to be returned should be some how shortened (Scree, PA,
MAP, etc).
switch nargin
case 1
X = varargin{1};
limit = 'all';
verbose = 0;
case 2
X = varargin{1};
limit = varargin{2}; % this can be a keyword only
verbose = 1;
case 3
X = varargin{1};
limit = varargin{2};
threshold = varargin{3};
verbose = 1;
case 4
X = varargin{1};
limit = varargin{2};
threshold = varargin{3};
verbose = varargin{4};
otherwise
error('Incorrect number of inputs given')
end
% report some basic descriptives, it is assumed that nanmean and nanstd
are available since this function is bound for the statistics package of
octave-forge. nanmean and nanstd are both available in the statistics package
of octave-forge.
X_mean = nanmean(X,1);
X_std = nanstd(X,1);
% Correlation Coef of X on X, commented out since this isn't need for
ergular use, and it can be easy found from the function input provided by the
user.
%corrcoef(X,X);
% find the size of our given X, use this later
[r, c] = size(X);
% if only the pcs are requested for output, use eig() instead of svd
% use eig
Z = zscore(X);
[V, D] = eig(cov(Z)); % V is the matrix composing the
eigenvectors, D are the actual eigenvalues
D_diag = diag(D);
D_ordered = flipud(D_diag);
return_n = length(D_diag);
contribs_ordered = D_ordered ./ sum(D_ordered);
cumulatives_ordered = cumsum(contribs_ordered);
if verbose
% provide output of these findings
printf('Total Variance Explained (contribution of each
factor component)\n');
printf('Component\t Eigenvalue \t %% of Variance \t
Cumulative %%\n');
for n = 1:return_n
printf(' %i \t\t %1.5f \t %1.5f \t %1.5f
\n', n, D_ordered(n), contribs_ordered(n), cumulatives_ordered(n));
end
printf('\n');
end
% interpret any limit flags:
if ischar(limit)
switch limit
case 'all'
% return_n should not be changed,
default is to return them all
% case 'economy'
% economy, do not return principal
components that have zero variance
% this is a feature slated to be added
later, as I'm not yet sure how I should do it.
case 'kaiser'
% kaiser, drop all eigenvalues less
than 1.0
return_n = sum(D_ordered >= 1);
% plot the mean eigenvalues found from
random data siimilar in shape to the given X
figure;
plot([1:length(D_ordered)],
ones(1,length(D_ordered)), 'ko-', 'LineWidth', 2);
hold on;
plot(D_ordered, 'ro-',
'LineWidth', 2);
legend('Kaiser Limit 1.0',
'Original X');
xlabel('Component Number');
ylabel('Eigenvalue');
title(['Kaiser Criterion, keep
first ', num2str(return_n)])
case 'pa'
% Horn's parallel analysis (pa), used a
Monte-Carlo based simulation that compares the observed eigenvalues with those
obtained from uncorrelated normal variables. Keep those with eigenvalues
greater than the corresponding eigenvalues found in the random data set
D_randoms = [];
printf('PA: ')
for t = 1:1000
if 0 == mod(t, 100)
printf('=')
end
R = [];
for i = 1:c
R = [R,
random('normal', X_mean(i), X_std(i), [size(X,1), 1])];
end
D_randoms = [D_randoms, pca(R)];
end
printf('\n');
D_percentile = mean(D_randoms, 2) +
2*std(D_randoms, 0, 2)
D_randoms = mean(D_randoms, 2)
return_n = sum(D_ordered >= D_randoms);
% plot the mean eigenvalues found from
random data siimilar in shape to the given X
figure;
plot(D_randoms, 'ko-',
'LineWidth', 2);
hold on;
plot(D_ordered, 'ro-',
'LineWidth', 2);
legend('Parallel Analysis',
'Original X');
xlabel('Component Number');
ylabel('Eigenvalue');
title(['Parallel Analysis of
1000 Iterations, keep first ', num2str(return_n)])
case 'map'
% Velicer's Minimum Average Partial
(MAP) Test
% the loadings are needed here
pc = Z * V;
loadings = corrcoef(X, pc); % note that
this is backwards, so the first pc loadings are on the right
loadings_ordered = fliplr(loadings);
% now for the map part
fm = [(1:c)]';
fm(1) =
(sum(sum(cov(Z).^2))-c)/(c*(c-1));
fm4 = fm;
fm4(1) =
(sum(sum(cov(Z).^4))-c)/(c*(c-1));
for m = 1:c - 1
loading_set =
loadings_ordered(:,1:m);
partcov = cov(Z) - (loading_set
* loading_set');
d = diag((1 ./
sqrt(diag(partcov))), 0);
pr = d * partcov * d;
fm(m+1) =
(sum(sum(pr.^2))-c)/(c*(c-1));
fm4(m+1) =
(sum(sum(pr.^4))-c)/(c*(c-1));
end
[fm_v, fm_i] = min(fm);
printf('Minimum Average Correlation^2
(yr 1976): %1.5f with %i factors.\n', fm_v, fm_i-1);
[fm4_v, fm4_i] = min(fm4);
printf('Minimum Average Correlation^4
(yr 2000): %1.5f with %i factors.\n', fm4_v, fm4_i-1);
% plot the fm and fm4
figure;
plot([0:length(fm)-1], fm,
'ro-', 'LineWidth', 2);
hold on;
plot([0:length(fm4)-1], fm4,
'bo-', 'LineWidth', 2);
legend('Average
Correlations^2', 'Average Correlations^4');
xlabel('Factors Included in
FM');
ylabel('MAP');
title(['Minimum Average Partial
Analysis, fm suggests keep ', num2str(fm_i-1), ', fm4 suggests keep ',
num2str(fm4_i-1)]);
return_n = fm_i-1;
case 'scree'
% scree plot, keep those to the left of
the inflection elbow
figure;
plot(D_ordered, 'ro-',
'LineWidth', 2);
legend('Original X');
xlabel('Component Number');
ylabel('Eigenvalue');
title(['Scree, keep number at
point of inflection']);
case 'variance'
% variance threshold, keep only enough
to account for the specified amount of variance (50% to 90%)
return_n = sum(cumulatives_ordered <
threshold);
case 'first'
% return first n, where n must be
supplied in threshold
return_n = threshold;
otherwise
warning('Limit keyword not recognized,
returning all principal components.');
end
else
warning('limit supplied must be of char type, optional
threshold must be of numeric type.')
end
% find the loadings
pc = Z * V;
loadings = corrcoef(X, pc); % note that this is backwards, so
the first pc loadings are on the right
loadings_ordered = fliplr(loadings);
h_squared = sumsq(loadings_ordered(:,1:return_n), 2);
high_loading = .63; % what is a mimimum loading needed to be
considered a high loading - Comrey and Lee (1992) - loadings in excess of .71
(50% overlapping variance) are considered excellent, .63 (40% overlapping
variance) are very good, .55 (30% overlapping variance) good, .45 (20%
overlapping variance) fair, and .32 (10% overlapping variance) poor.
n_high_loadings = sum(loadings_ordered(:,1:return_n) >
high_loading, 2);
if verbose
% provide these results in a readable output
printf('Variable Loadings (the linear combination of
variables to create factors)\n');
printf('Variable\t');
for c = 1:return_n
printf(' %i \t\t', c);
end
printf('h^2 \t\t High Loadings \n');
for r = 1:size(loadings_ordered, 1)
printf(' %i \t\t', r);
for c = 1:return_n
printf(' %1.5f \t',
loadings_ordered(r,c));
end
printf(' %1.5f \t %i \n', h_squared(r),
n_high_loadings(r));
end
end
% component correlation matrix, this is commented out because it isn't
used in regular use, and it can be easily found from the output provided by
this function. That being said, it might still be nice to return this, and
corrcoef(X, X) with nanmean(X) and nanstd(X) like SPSS does.
% corrcoef(pc, pc)
% return outputs
if nargout > 0
varargout{1} = D_ordered;
end
if nargout > 1
varargout{2} = return_n;
end
if nargout > 2
varargout{3} = pc;
end
if nargout > 3
varargout{4} = V;
end
if nargout > 4
varargout{5} = loadings_ordered(:, 1:return_n);
end
endfunction
------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual
desktops for less than the cost of PCs and save 60% on VDI infrastructure
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Octave-dev mailing list
Octave-dev@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/octave-dev