[OctDev] Octave-Forge Statistics Function Submission

Kyle N Winfree Mon, 02 Jan 2012 09:41:29 -0800

Hi All,

I have two functions (attached) which I would like to submit toOctave-Forge. The first, poolstd.m, is a super simple function to poola set standard deviations. The second function performs a principalcomponent analysis akin to that provided by SPSS. This function canaccept several flags to use standard methods for identifying the numberof principal components to include.I can provide a non-public data set for testing pca.m if needed.Please let me know what modifications I should make to either so thatthey can be added to the stats package.

Best,
Kyle

% Compute Pooled Standard Deviation from two vector arrays of precomputed 
standard deviations and sample sizes.
%
% sd = poolstd(sd, n)
% where:
%   sd = [sd_1, sd_2, ..., sd_k] % the standard deviation of several samples 
from k seperate groups/subjects
%   n = [n_1, n_2, ..., n_k] % the sample size of those k seperate 
groups/subjects
%
%        (n_1 - 1)sd_1^2 + (n_2 - 1)sd_2^2 + ... + (n_k - 1)sd_2^k
% SD^2 = ---------------------------------------------------------
%                        n_1 + n_2 + ... + n_k - k
%
% More information can be found at http://en.wikipedia.org/wiki/Pooled_variance
%
% Decemeber 2011 - Kyle Winfree, kyle.winf...@gmail.com
% This is free software (GPL).
function sd_pooled = poolstd(sd, n)


k = length(n);
sd_pooled = sqrt(sum((n - 1) .* (sd .^ 2)) / (sum(n) - k));

% testing:
% x1 = [2 4 5 2 1 3 1 3 4]
% x2 = [2 4 4 5 7 6 3 2 7]
% x3 = [3 5 9 7 1]
% sd = [std(x1), std(x2), std(x3)] % (1.3944   1.9437   3.1623)
% n = [length(x1), length(x2), length(x3)] % (9   9   5)
% k = length(n) % (3)
% sd_pooled = sqrt(sum((n - 1) .* (sd .^ 2)) / (sum(n) - k))

% function [D_ordered, return_n, pc, V, loadings_n] = pca(X, {flags})
%
% Perform Principal Component Analysis of X similar to that which would be 
provided by programs like SPSS.
%
% flags can be as follows:
%   pca(X, limit_type)
%     where limit_type = {'all' | 'kaiser' | 'pa' | 'map' | 'scree'}
%           all - return all components, this is the default if limit_type if 
only X is provided
%           kaiser - drop all eigenvalues with a value less than 1.0 (Kaiser 
Criterion)
%           pa - perform a parallel analysis and keep only those that provide 
eigenvalues with a value better than that of a matched randomized set
%           map - perform a Velicer's Minimum Average Partial (MAP) test, keep 
those according to the year 2000 standard of map^4
%   pca(X, limit_type, threshold)
%     where limit_type = {'varaince' | 'first'}
%           variance - return only enough to account for the amount of variance 
specified by threshold
%           first - return the first n components where n is the value of 
threshold
%   pca(X, limit_type, threshold, verbosity)
%     where verbosity = {1 | 0}
%           1 - verbose
%           0 - quiet
%     note that a threshold value must be provided if verbosity is specified.  
For limit_types of all, kaiser, pa, or map, this threshold will have no effect.
%
%   limit_type(s) of kaiser, pa, map, and scree will all produce a plot unless 
verbosity is set to 0.
%
% Authors: Kyle Winfree (kyle.winf...@gmail.com)
% This program is free software (GPL).

function varargout = pca(varargin)
        % check inputs to see if basic pca is requested, or if the number of 
principal components to be returned should be some how shortened (Scree, PA, 
MAP, etc).
        switch nargin
                case 1
                        X = varargin{1};
                        limit = 'all';
                        verbose = 0;
                case 2
                        X = varargin{1};
                        limit = varargin{2}; % this can be a keyword only
                        verbose = 1;
                case 3
                        X = varargin{1};
                        limit = varargin{2};
                        threshold = varargin{3};
                        verbose = 1;
                case 4
                        X = varargin{1};
                        limit = varargin{2};
                        threshold = varargin{3};
                        verbose = varargin{4};
                otherwise
                        error('Incorrect number of inputs given')
        end

        % report some basic descriptives, it is assumed that nanmean and nanstd 
are available since this function is bound for the statistics package of 
octave-forge.  nanmean and nanstd are both available in the statistics package 
of octave-forge.
                X_mean = nanmean(X,1);
                X_std = nanstd(X,1);

        % Correlation Coef of X on X, commented out since this isn't need for 
ergular use, and it can be easy found from the function input provided by the 
user.
                %corrcoef(X,X);

        % find the size of our given X, use this later
                [r, c] = size(X);

        % if only the pcs are requested for output, use eig() instead of svd
                % use eig
                Z = zscore(X);
                [V, D] = eig(cov(Z)); % V is the matrix composing the 
eigenvectors, D are the actual eigenvalues
                D_diag = diag(D);
                D_ordered = flipud(D_diag);
                return_n = length(D_diag);
                contribs_ordered = D_ordered ./ sum(D_ordered);
                cumulatives_ordered = cumsum(contribs_ordered);
        if verbose
                % provide output of these findings
                        printf('Total Variance Explained (contribution of each 
factor component)\n');
                        printf('Component\t Eigenvalue \t %% of Variance \t 
Cumulative %%\n');
                        for n = 1:return_n
                                printf(' %i \t\t  %1.5f \t  %1.5f \t  %1.5f 
\n', n, D_ordered(n), contribs_ordered(n), cumulatives_ordered(n));
                        end
                        printf('\n');
        end

        % interpret any limit flags:
                if ischar(limit)
                        switch limit
                                case 'all'
                                        % return_n should not be changed, 
default is to return them all

%                               case 'economy'
                                        % economy, do not return principal 
components that have zero variance
                                        % this is a feature slated to be added 
later, as I'm not yet sure how I should do it.

                                case 'kaiser'
                                        % kaiser, drop all eigenvalues less 
than 1.0
                                        return_n = sum(D_ordered >= 1);
                                        % plot the mean eigenvalues found from 
random data siimilar in shape to the given X
                                                figure;
                                                plot([1:length(D_ordered)], 
ones(1,length(D_ordered)), 'ko-', 'LineWidth', 2);
                                                hold on;
                                                plot(D_ordered, 'ro-', 
'LineWidth', 2);
                                                legend('Kaiser Limit 1.0', 
'Original X');
                                                xlabel('Component Number');
                                                ylabel('Eigenvalue');
                                                title(['Kaiser Criterion, keep 
first ', num2str(return_n)])

                                case 'pa'
                                        % Horn's parallel analysis (pa), used a 
Monte-Carlo based simulation that compares the observed eigenvalues with those 
obtained from uncorrelated normal variables.  Keep those with eigenvalues 
greater than the corresponding eigenvalues found in the random data set
                                        D_randoms = [];
                                        printf('PA: ')
                                        for t = 1:1000
                                                if 0 == mod(t, 100)
                                                        printf('=')
                                                end
                                                R = [];
                                                for i = 1:c
                                                        R = [R, 
random('normal', X_mean(i), X_std(i), [size(X,1), 1])];
                                                end
                                                D_randoms = [D_randoms, pca(R)];
                                        end
                                        printf('\n');
                                        D_percentile = mean(D_randoms, 2) + 
2*std(D_randoms, 0, 2)
                                        D_randoms = mean(D_randoms, 2)
                                        return_n = sum(D_ordered >= D_randoms);
                                        % plot the mean eigenvalues found from 
random data siimilar in shape to the given X
                                                figure;
                                                plot(D_randoms, 'ko-', 
'LineWidth', 2);
                                                hold on;
                                                plot(D_ordered, 'ro-', 
'LineWidth', 2);
                                                legend('Parallel Analysis', 
'Original X');
                                                xlabel('Component Number');
                                                ylabel('Eigenvalue');
                                                title(['Parallel Analysis of 
1000 Iterations, keep first ', num2str(return_n)])

                                case 'map'
                                        % Velicer's Minimum Average Partial 
(MAP) Test
                                        % the loadings are needed here
                                        pc = Z * V;
                                        loadings = corrcoef(X, pc); % note that 
this is backwards, so the first pc loadings are on the right
                                        loadings_ordered = fliplr(loadings);
                                        % now for the map part
                                        fm = [(1:c)]';
                                        fm(1) = 
(sum(sum(cov(Z).^2))-c)/(c*(c-1));
                                        fm4 = fm;
                                        fm4(1) = 
(sum(sum(cov(Z).^4))-c)/(c*(c-1));
                                        for m = 1:c - 1
                                                loading_set = 
loadings_ordered(:,1:m);
                                                partcov = cov(Z) - (loading_set 
* loading_set');
                                                d = diag((1 ./ 
sqrt(diag(partcov))), 0);
                                                pr = d * partcov * d;
                                                fm(m+1)  = 
(sum(sum(pr.^2))-c)/(c*(c-1));
                                                fm4(m+1) = 
(sum(sum(pr.^4))-c)/(c*(c-1));
                                        end
                                        [fm_v, fm_i] = min(fm);
                                        printf('Minimum Average Correlation^2 
(yr 1976): %1.5f with %i factors.\n', fm_v, fm_i-1);
                                        [fm4_v, fm4_i] = min(fm4);
                                        printf('Minimum Average Correlation^4 
(yr 2000): %1.5f with %i factors.\n', fm4_v, fm4_i-1);
                                        % plot the fm and fm4
                                                figure;
                                                plot([0:length(fm)-1], fm, 
'ro-', 'LineWidth', 2);
                                                hold on;
                                                plot([0:length(fm4)-1], fm4, 
'bo-', 'LineWidth', 2);
                                                legend('Average 
Correlations^2', 'Average Correlations^4');
                                                xlabel('Factors Included in 
FM');
                                                ylabel('MAP');
                                                title(['Minimum Average Partial 
Analysis, fm suggests keep ', num2str(fm_i-1), ', fm4 suggests keep ', 
num2str(fm4_i-1)]);
                                                return_n = fm_i-1;

                                case 'scree'
                                        % scree plot, keep those to the left of 
the inflection elbow
                                                figure;
                                                plot(D_ordered, 'ro-', 
'LineWidth', 2);
                                                legend('Original X');
                                                xlabel('Component Number');
                                                ylabel('Eigenvalue');
                                                title(['Scree, keep number at 
point of inflection']);

                                case 'variance'
                                        % variance threshold, keep only enough 
to account for the specified amount of variance (50% to 90%)
                                        return_n = sum(cumulatives_ordered < 
threshold);

                                case 'first'
                                        % return first n, where n must be 
supplied in threshold
                                        return_n = threshold;

                                otherwise
                                        warning('Limit keyword not recognized, 
returning all principal components.');
                        end
                else
                        warning('limit supplied must be of char type, optional 
threshold must be of numeric type.')
                end

        % find the loadings
                pc = Z * V;
                loadings = corrcoef(X, pc); % note that this is backwards, so 
the first pc loadings are on the right
                loadings_ordered = fliplr(loadings);
                h_squared = sumsq(loadings_ordered(:,1:return_n), 2);
                high_loading = .63; % what is a mimimum loading needed to be 
considered a high loading - Comrey and Lee (1992) - loadings in excess of .71 
(50% overlapping variance) are considered excellent, .63 (40% overlapping 
variance) are very good, .55 (30% overlapping variance) good, .45 (20% 
overlapping variance) fair, and .32 (10% overlapping variance) poor.
                n_high_loadings = sum(loadings_ordered(:,1:return_n) > 
high_loading, 2);
        if verbose
                % provide these results in a readable output
                        printf('Variable Loadings (the linear combination of 
variables to create factors)\n');
                        printf('Variable\t');
                        for c = 1:return_n
                                printf(' %i \t\t', c);
                        end
                        printf('h^2 \t\t High Loadings \n');
                        for r = 1:size(loadings_ordered, 1)
                                printf(' %i \t\t', r);
                                for c = 1:return_n
                                        printf('  %1.5f \t', 
loadings_ordered(r,c));
                                end
                                printf(' %1.5f \t  %i \n', h_squared(r), 
n_high_loadings(r));
                        end
        end

        % component correlation matrix, this is commented out because it isn't 
used in regular use, and it can be easily found from the output provided by 
this function.  That being said, it might still be nice to return this, and 
corrcoef(X, X) with nanmean(X) and nanstd(X) like SPSS does.
                % corrcoef(pc, pc)

        % return outputs
                if nargout > 0
                        varargout{1} = D_ordered;
                end
                if nargout > 1
                        varargout{2} = return_n;
                end
                if nargout > 2
                        varargout{3} = pc;
                end
                if nargout > 3
                        varargout{4} = V;
                end
                if nargout > 4
                        varargout{5} = loadings_ordered(:, 1:return_n);
                end
endfunction

------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual 
desktops for less than the cost of PCs and save 60% on VDI infrastructure 
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox

_______________________________________________
Octave-dev mailing list
Octave-dev@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/octave-dev

[OctDev] Octave-Forge Statistics Function Submission

Reply via email to