---------- Forwarded message ---------- From: Alexandre F. Souza <alexsouza.cb.ufrn...@gmail.com> Date: 2018-06-23 7:04 GMT-03:00 Subject: Is stepacross a must? To: Lista de discussao R-sig-ecology <r-sig-ecology@r-project.org>
Dear friends, I am in deep doubt regarding whether or not to apply vegan'a stepacross function to my dissimilarity matrix. The matrix is a simpson dissimilarity matrix with 665 rows reflecting beta diversity between localities along the brazilian atlantic florest, with over 4000 species most of which are singletons. They were retained for biological reasons. We are appying a community modelling approach with k-means in order to find the best number of clusters this dataset can be divided. The problem is that we are obtaining quite different results with or without applying the stepacross function. According to Tuomisto et al. 2012: ..."the stepacross method or extended dissimilarities (Williamson 1978, De’ath 1999), originally developed to improve the performance of ordination. Th e aim is to fi nd the shortest path between sites that do not share any species by using intermediate sites as stepping-stones. The process replaces all saturated dissimilarity values (or any dissimilarities larger than a specifi ed threshold value) with new values that can exceed unity. The new values can be as large as is necessary to reflect the amount of compositional heterogeneity in the dataset, so using extended dissimilarities instead of the original ones in Mantel tests and multiple regression on dissimilarity matrices can be expected to improve the performance of these methods." I reproduce the code below. Without stepacross we obtain dissimilarities by large dominated by 1, and as many as 25 groups. With stepacross they are more unimodal since the =1 distances were allowed to have higher values, but as few as 6 groups! Furthermore, the final number of groups varies more between runs with stepacross than without it. With stepacross it seems almost to alternate between 6 and 11 groups, with 6 being more frequent. What do you thin it would be wiser to do? I am leaning towards the stepacross.... but I am not that sure. The help page of the function mentions that in cases with high beta-diversity datasets stepacross must be used. Thanks for any thoughts, Alexandre simpdist = recluster.dist(sp, dist = "simpson") simp.step = stepacross(simpdist) dissimilarity_matrix<-as.matrix(simp.step) # Perform an ordination via Principal Coordinate Analysis (PCoA) with correction for negative eigenvalues [add = T] ordination_output<-cmdscale(dissimilarity_matrix, k=(nrow(dissimilarity_matrix)-1), eig = F, add = T, x.ret = FALSE) pcoa_scores<-as.data.frame(ordination_output[[1]]) # Extract the PCoA scores for each site # Get values of the evaluation criterion (Within-Groups SS) for k varying from 2 to 'number of sites minus 1' sum_of_squares <- foreach(i = 2:(ncol(dissimilarity_matrix)-1), .combine = 'cbind') %dopar% { # cbind the results of foreach classification<-kmeans(pcoa_scores, i, iter.max=1000, nstart=100)#usar inter.max=1000 e nstart=100 classification$tot.withinss} # the Within-Groups SS for k=i y<-t(as.matrix(sum_of_squares)) # convert Within-Group SS to a vector (y for piecewise regressions) x<-c(2:(ncol(dissimilarity_matrix)-1)) # number of clusters (x for piecewise regressions) # Create an empty matrix to receive the values of 'max-k entered a priori' and the respective 'optimal k selected' initial_kmax<-c(4:ncol(dissimilarity_matrix)) # At least 4 points to produce two lines (L-method) cluster_results<-cbind(initial_kmax, c(NA)) # Dataframe to receive values of the breakpoints # Perform multiple piecewise regression varying the max-k entered a priori from 4 to 'number of sites' for (j in 4:ncol(dissimilarity_matrix)){ # the maximum number of 'k' established a priori y_selected<-y[1:j] # Limit the number of starting points (y axis) x_selected<-x[1:j] # Limit the number of starting points (x axis) find_the_knee <- foreach(i = 2:(j-1), # calculate the RSE for all possible breakpoints (i) when starting with k-max = j .combine = 'cbind') %dopar% { # piecewise_k<-lm(y_selected~x_selected*(x_selected<i) + x_selected*(x_selected>=i)) summary(piecewise_k)[6]} # print the RSE for each piecewise regression find_the_knee<-c(do.call("cbind",find_the_knee)) # convert the previous output (list) to vector knee_found<-as.data.frame(cbind(c(2:(j-1)), find_the_knee)) # combine position of breakpoint and RSE value names(knee_found)<-c("breakpoint", "RSE") # rename knee_found<-na.omit(knee_found) # remove NaN values k_number<-knee_found[which(knee_found$RSE==min(knee_found$RSE)),1] # select breakpoint that minimizes RSE cluster_results[(j-3),2]<-k_number} # print the maximum number of initial groups (j) and the respective selected breakpoint (k) (Sys.time()-a) # How long it takes? ############################################################ ############################## # STEP 3 Inspect the variation in the values of optimal 'k' and identify the most frequent 'k' # Identify the most frequent 'k' (math-mode) most_frequent_k<-as.data.frame(cluster_results)[,2] optimal_k<-names(sort(-table(most_frequent_k)))[1] optimal_k # what is the most frequent value? hist(most_frequent_k, n=50, ylab="Frequency of k", xlab="Value of the optimal k", main="", cex.axis=1.0, cex.lab=1.2, las=1, col="grey") -- Dr. Alexandre F. Souza Professor Adjunto IV Universidade Federal do Rio Grande do Norte CB, Departamento de Ecologia Campus Universitário - Lagoa Nova 59072-970 - Natal, RN - Brasil lattes: lattes.cnpq.br/7844758818522706 http://www.esferacientifica.com.br http://www.docente.ufrn.br/alexsouza orcid.org/0000-0001-7468-3631 <http://www.docente.ufrn.br/alexsouza> [[alternative HTML version deleted]] _______________________________________________ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology