[R] Diptest- I'm getting significant values when I shouldn't?
From library(diptest): Shouldn't the following almost always be non-significant for Hartigan's dip test? dip(x = rnorm(1000)) I get dip scores of around 0.0008 which based on p values taken from the table (at N=1000), using the command: qDiptab, are 0.02 p 0.05. Anyone familiar with Hartigan's dip test and what I may not be understanding? Thanks, kbrownk __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Diptest- I'm getting significant values when I shouldn't?
Thanks, I found dip.test after posting. I reread the original paper and found that the probability is that the dip is less than the given dip score. Less here is ambiguous to me, and it is strange that dip.test interpolates from the same p value lookup table I was using (gDiptab), but returns very different p values. Anyway, the dip.test p values seem correct when I test them on different distributions. So, now that I can conclude there are multi-modal distributions, any suggestions for finding the best distribution fits? I'm looking for a procedure that can test different distribution types (normal. gamma, etc.), provides parameters such as means and SDs for each sub- distribution, and how quantifies how good the fits were. I'm currently looking into Expectation Maximization methods, particularly form the mixtools R library package. 'normalmixEM' looks like a good starting procedure. Thanks,kbrownk On Dec 22, 3:32 pm, Duncan Murdoch murdoch.dun...@gmail.com wrote: On 21/12/2011 3:37 PM, kbrownk wrote: From library(diptest): Shouldn't the following almost always be non-significant for Hartigan's dip test? dip(x = rnorm(1000)) Well, it should be non-significant about 95% of the time I get dip scores of around 0.0008 which based on p values taken from the table (at N=1000), using the command: qDiptab, are 0.02 p 0.05. Anyone familiar with Hartigan's dip test and what I may not be understanding? Why not use dip.test()? When I do that, I see the p-values are almost all quite large: hist(replicate(1000, dip.test(x=rnorm(1000))$p.value)) Using runif() gives something apparently on the boundary, as you'd expect: hist(replicate(1000, dip.test(x=runif(1000))$p.value)) Duncan Murdoch __ r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help Transforming sums into observations
Thanks for the heads up. I don't have a #. My data is as you suggest. I tried to generalize my example because I'm open to reformatting for the solution to my problem. Thanks, kbrownk On Dec 20, 6:14 pm, Sarah Goslee sarah.gos...@gmail.com wrote: bindata - 1:5 nobs - c(2, 3, 1, 4, 3) rep(bindata, times=nobs) [1] 1 1 2 2 2 3 4 4 4 4 5 5 5 for the R part, and see below: Sarah On Tue, Dec 20, 2011 at 5:45 PM, kbrownk kbro...@gmail.com wrote: I need to measure kurtosis, skew, and maybe dip test on some distributions I have. Currently my data is in the form of 2 vectors x and y. Where x is 10 bins and y is the number of observations found in that bin. It seems that the measures I want to run require the actual observations laid out rather than already summed like I have them. Any suggestions on how to transform the data automatically? I have a semi- automated method in Excel but I think r will do a better job. I provide a more specific example below: My csv file with the data looks like this: Bin: 1,2,3, ... ,10 #Observations: 23,42,1,... 56 Really? By default R will treat everything after the # as a comment, so you'll need to watch out for the comment character option when you import it. You're also trying to use two or maybe three separate delimiters, which R can't easily handle. Why not use a proper CSV file with Comma Separated Values? I need this transformed into a single vector like this: c(1,1,1,1...2,2,2,2...3,...10,10,10,10...) The vector would have 23 1s, 42 2s, 1 3, etc. I actually have 68 of these vectors laid out in rows that I will measure separately, so my csv file actually looks like this: Bin: 1,2,3, ... ,10 #Observations: 23,42,1,... 56 Bin: 1,2,3, ... ,10 #Observations: 13,33,32,...98 . . . Bin: 1,2,3, ... ,10 #Observations: 11,76,55,...46 I want to automate the process. -- Sarah Gosleehttp://www.functionaldiversity.org __ r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help Transforming sums into observations
I ended up just using a vba macro for Excel. Hopefully I can start transitioning to R for some of these tasks soon. Thanks, kbrownk On Dec 20, 6:14 pm, Sarah Goslee sarah.gos...@gmail.com wrote: bindata - 1:5 nobs - c(2, 3, 1, 4, 3) rep(bindata, times=nobs) [1] 1 1 2 2 2 3 4 4 4 4 5 5 5 for the R part, and see below: Sarah On Tue, Dec 20, 2011 at 5:45 PM, kbrownk kbro...@gmail.com wrote: I need to measure kurtosis, skew, and maybe dip test on some distributions I have. Currently my data is in the form of 2 vectors x and y. Where x is 10 bins and y is the number of observations found in that bin. It seems that the measures I want to run require the actual observations laid out rather than already summed like I have them. Any suggestions on how to transform the data automatically? I have a semi- automated method in Excel but I think r will do a better job. I provide a more specific example below: My csv file with the data looks like this: Bin: 1,2,3, ... ,10 #Observations: 23,42,1,... 56 Really? By default R will treat everything after the # as a comment, so you'll need to watch out for the comment character option when you import it. You're also trying to use two or maybe three separate delimiters, which R can't easily handle. Why not use a proper CSV file with Comma Separated Values? I need this transformed into a single vector like this: c(1,1,1,1...2,2,2,2...3,...10,10,10,10...) The vector would have 23 1s, 42 2s, 1 3, etc. I actually have 68 of these vectors laid out in rows that I will measure separately, so my csv file actually looks like this: Bin: 1,2,3, ... ,10 #Observations: 23,42,1,... 56 Bin: 1,2,3, ... ,10 #Observations: 13,33,32,...98 . . . Bin: 1,2,3, ... ,10 #Observations: 11,76,55,...46 I want to automate the process. -- Sarah Gosleehttp://www.functionaldiversity.org __ r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Help Transforming sums into observations
I need to measure kurtosis, skew, and maybe dip test on some distributions I have. Currently my data is in the form of 2 vectors x and y. Where x is 10 bins and y is the number of observations found in that bin. It seems that the measures I want to run require the actual observations laid out rather than already summed like I have them. Any suggestions on how to transform the data automatically? I have a semi- automated method in Excel but I think r will do a better job. I provide a more specific example below: My csv file with the data looks like this: Bin: 1,2,3, ... ,10 #Observations: 23,42,1,... 56 I need this transformed into a single vector like this: c(1,1,1,1...2,2,2,2...3,...10,10,10,10...) The vector would have 23 1s, 42 2s, 1 3, etc. I actually have 68 of these vectors laid out in rows that I will measure separately, so my csv file actually looks like this: Bin: 1,2,3, ... ,10 #Observations: 23,42,1,... 56 Bin: 1,2,3, ... ,10 #Observations: 13,33,32,...98 . . . Bin: 1,2,3, ... ,10 #Observations: 11,76,55,...46 I want to automate the process. Thanks for any advice, kbrownk __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Is there a way to print branch distances for hclust function?
Never mind on my reply post (hasn't posted yet, but assuming it does). I found a way to use the merge and height values with Excel to subtract out the lower level distance. Thanks for the help! kbrownk On Dec 12, 1:38 am, Peter Langfelder peter.langfel...@gmail.com wrote: On Sun, Dec 11, 2011 at 8:43 PM,kbrownkkbro...@gmail.com wrote: The R function hclust is used to do cluster analysis, but based on R help I see no way to print the actual fusion distances (that is, the vertical distances for each connected branch pairs seen in the cluster dendrogram). Any ideas? I'd like to use them test for significant differences from the mean fusion distance (i.e. The Best Cut Test). To perform a cluster analysis I'm using: x - dist(mydata, method = euclidean) # distance matrix y - hclust(x, method=ward) #clustering (i.e. fusion) method plot(y) # display dendogram Thanks, kbrownk You need to dig a bit deeper in the help file :) The return value is a list that contains, among others, components 'merge' and 'height'. The 'merge' component tells you which objects were merged at each particular step, and the 'height' component tells you what the merging height at that step was. The (slightly) tricky part is to relate the merge component to actual objects - AFAIK there is no function for that. The function cutree() using the argument k and varying it between 2 and n should basically do it for you but you need to match it to the entries in 'merge'. Maybe someone else knows a better way to do this. HTH, Peter __ r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Is there a way to print branch distances for hclust function?
Thanks, I had tried using height but I was using it wrong. If I had the distances alone it would be enough, except that the height returns the distance all the way to 0, rather than to the adjacent merge. So, if an initial merge at height 0 has a distance of 10, and that object then merges again at having a distance that reaches height 35, I'd want to see a distance value of 25 not 35. So I will need a way to relate merges with heights somehow so I can subtract lower level distances like this. I have 67 total distances, so if some manual work is involved I can manage. I'm working on the advice you provided, which may provide some answers. Thanks, kbrownk On Dec 12, 1:38 am, Peter Langfelder peter.langfel...@gmail.com wrote: On Sun, Dec 11, 2011 at 8:43 PM,kbrownkkbro...@gmail.com wrote: The R function hclust is used to do cluster analysis, but based on R help I see no way to print the actual fusion distances (that is, the vertical distances for each connected branch pairs seen in the cluster dendrogram). Any ideas? I'd like to use them test for significant differences from the mean fusion distance (i.e. The Best Cut Test). To perform a cluster analysis I'm using: x - dist(mydata, method = euclidean) # distance matrix y - hclust(x, method=ward) #clustering (i.e. fusion) method plot(y) # display dendogram Thanks, kbrownk You need to dig a bit deeper in the help file :) The return value is a list that contains, among others, components 'merge' and 'height'. The 'merge' component tells you which objects were merged at each particular step, and the 'height' component tells you what the merging height at that step was. The (slightly) tricky part is to relate the merge component to actual objects - AFAIK there is no function for that. The function cutree() using the argument k and varying it between 2 and n should basically do it for you but you need to match it to the entries in 'merge'. Maybe someone else knows a better way to do this. HTH, Peter __ r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Is there a way to print branch distances for hclust function?
The R function hclust is used to do cluster analysis, but based on R help I see no way to print the actual fusion distances (that is, the vertical distances for each connected branch pairs seen in the cluster dendrogram). Any ideas? I'd like to use them test for significant differences from the mean fusion distance (i.e. The Best Cut Test). To perform a cluster analysis I'm using: x - dist(mydata, method = euclidean) # distance matrix y - hclust(x, method=ward) #clustering (i.e. fusion) method plot(y) # display dendogram Thanks, kbrownk __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help understanding cutree used for Dunn Index
I found a way to draw rectangles around whatever cutree cuts, so I think I now have a better idea, which is that it looks for the largest k distances and cuts there, where k is the # clusters you want defined. Then, I assume Dunn index uses the same defined #clusters to determine within vs. across cluster similarities (i.e. distances). If I'm way off please let me know, otherwise thanks for taking the time to read my question. kbrownk __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.