[R] Diptest- I'm getting significant values when I shouldn't?

2011-12-22 Thread kbrownk
From library(diptest):

Shouldn't the following almost always be non-significant for
Hartigan's dip test?

dip(x = rnorm(1000))

I get dip scores of around 0.0008 which based on p values taken from
the table (at N=1000), using the command: qDiptab, are 0.02  p 
0.05.

Anyone familiar with Hartigan's dip test and what I may not be
understanding?

Thanks,
kbrownk

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Diptest- I'm getting significant values when I shouldn't?

2011-12-22 Thread kbrownk
Thanks, I found dip.test after posting. I reread the original paper
and found that the probability is that the dip is less than the given
dip score. Less here is ambiguous to me, and it is strange that
dip.test interpolates from the same p value lookup table I was using
(gDiptab), but returns very different p values. Anyway, the dip.test p
values seem correct when I test them on different distributions.
So, now that I can conclude there are multi-modal distributions, any
suggestions for finding the best distribution fits? I'm looking for a
procedure that can test different distribution types (normal. gamma,
etc.), provides parameters such as means and SDs for each sub-
distribution, and how quantifies how good the fits were. I'm currently
looking into Expectation Maximization methods, particularly form the
mixtools R library package. 'normalmixEM' looks like a good starting
procedure.
Thanks,kbrownk
On Dec 22, 3:32 pm, Duncan Murdoch murdoch.dun...@gmail.com wrote:
 On 21/12/2011 3:37 PM, kbrownk wrote:

   From library(diptest):

  Shouldn't the following almost always be non-significant for
  Hartigan's dip test?

  dip(x = rnorm(1000))

 Well, it should be non-significant about 95% of the time

  I get dip scores of around 0.0008 which based on p values taken from
  the table (at N=1000), using the command: qDiptab, are 0.02  p
  0.05.

  Anyone familiar with Hartigan's dip test and what I may not be
  understanding?

 Why not use dip.test()?  When I do that, I see the p-values are almost
 all quite large:

 hist(replicate(1000, dip.test(x=rnorm(1000))$p.value))

 Using runif() gives something apparently on the boundary, as you'd expect:

 hist(replicate(1000, dip.test(x=runif(1000))$p.value))

 Duncan Murdoch

 __
 r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help Transforming sums into observations

2011-12-21 Thread kbrownk
Thanks for the heads up. I don't have a #. My data is as you suggest.
I tried to generalize my example because I'm open to reformatting for
the solution to my problem.

Thanks,
kbrownk

On Dec 20, 6:14 pm, Sarah Goslee sarah.gos...@gmail.com wrote:
  bindata - 1:5
  nobs - c(2, 3, 1, 4, 3)
  rep(bindata, times=nobs)

  [1] 1 1 2 2 2 3 4 4 4 4 5 5 5

 for the R part, and see below:

 Sarah

 On Tue, Dec 20, 2011 at 5:45 PM, kbrownk kbro...@gmail.com wrote:
  I need to measure kurtosis, skew, and maybe dip test on some
  distributions I have. Currently my data is in the form of 2 vectors x
  and y. Where x is 10 bins and y is the number of observations found in
  that bin. It seems that the measures I want to run require the actual
  observations laid out rather than already summed like I have them. Any
  suggestions on how to transform the data automatically? I have a semi-
  automated method in Excel but I think r will do a better job. I
  provide a more specific example below:

  My csv file with the data looks like this:
  Bin: 1,2,3, ... ,10     #Observations:  23,42,1,...  56

 Really? By default R will treat everything after the # as a comment, so
 you'll need to watch out for the comment character option when you
 import it.

 You're also trying to use two or maybe three separate delimiters, which
 R can't easily handle. Why not use a proper CSV file with Comma Separated
 Values?

  I need this transformed into a single vector like this:
  c(1,1,1,1...2,2,2,2...3,...10,10,10,10...) The vector would have 23
  1s, 42 2s, 1 3, etc.

  I actually have 68 of these vectors laid out in rows that I will
  measure separately, so my csv file actually looks like this:
  Bin: 1,2,3, ... ,10     #Observations:  23,42,1,... 56
  Bin: 1,2,3, ... ,10     #Observations:  13,33,32,...98
  .
  .
  .
  Bin: 1,2,3, ... ,10     #Observations:  11,76,55,...46

  I want to automate the process.

 --
 Sarah Gosleehttp://www.functionaldiversity.org

 __
 r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help Transforming sums into observations

2011-12-21 Thread kbrownk
I ended up just using a vba macro for Excel. Hopefully I can start
transitioning to R for some of these tasks soon.

Thanks,
kbrownk

On Dec 20, 6:14 pm, Sarah Goslee sarah.gos...@gmail.com wrote:
  bindata - 1:5
  nobs - c(2, 3, 1, 4, 3)
  rep(bindata, times=nobs)

  [1] 1 1 2 2 2 3 4 4 4 4 5 5 5

 for the R part, and see below:

 Sarah

 On Tue, Dec 20, 2011 at 5:45 PM, kbrownk kbro...@gmail.com wrote:
  I need to measure kurtosis, skew, and maybe dip test on some
  distributions I have. Currently my data is in the form of 2 vectors x
  and y. Where x is 10 bins and y is the number of observations found in
  that bin. It seems that the measures I want to run require the actual
  observations laid out rather than already summed like I have them. Any
  suggestions on how to transform the data automatically? I have a semi-
  automated method in Excel but I think r will do a better job. I
  provide a more specific example below:

  My csv file with the data looks like this:
  Bin: 1,2,3, ... ,10     #Observations:  23,42,1,...  56

 Really? By default R will treat everything after the # as a comment, so
 you'll need to watch out for the comment character option when you
 import it.

 You're also trying to use two or maybe three separate delimiters, which
 R can't easily handle. Why not use a proper CSV file with Comma Separated
 Values?

  I need this transformed into a single vector like this:
  c(1,1,1,1...2,2,2,2...3,...10,10,10,10...) The vector would have 23
  1s, 42 2s, 1 3, etc.

  I actually have 68 of these vectors laid out in rows that I will
  measure separately, so my csv file actually looks like this:
  Bin: 1,2,3, ... ,10     #Observations:  23,42,1,... 56
  Bin: 1,2,3, ... ,10     #Observations:  13,33,32,...98
  .
  .
  .
  Bin: 1,2,3, ... ,10     #Observations:  11,76,55,...46

  I want to automate the process.

 --
 Sarah Gosleehttp://www.functionaldiversity.org

 __
 r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Help Transforming sums into observations

2011-12-20 Thread kbrownk
I need to measure kurtosis, skew, and maybe dip test on some
distributions I have. Currently my data is in the form of 2 vectors x
and y. Where x is 10 bins and y is the number of observations found in
that bin. It seems that the measures I want to run require the actual
observations laid out rather than already summed like I have them. Any
suggestions on how to transform the data automatically? I have a semi-
automated method in Excel but I think r will do a better job. I
provide a more specific example below:

My csv file with the data looks like this:
Bin: 1,2,3, ... ,10 #Observations:  23,42,1,...  56

I need this transformed into a single vector like this:
c(1,1,1,1...2,2,2,2...3,...10,10,10,10...) The vector would have 23
1s, 42 2s, 1 3, etc.

I actually have 68 of these vectors laid out in rows that I will
measure separately, so my csv file actually looks like this:
Bin: 1,2,3, ... ,10 #Observations:  23,42,1,... 56
Bin: 1,2,3, ... ,10 #Observations:  13,33,32,...98
.
.
.
Bin: 1,2,3, ... ,10 #Observations:  11,76,55,...46

I want to automate the process.


Thanks for any advice,
kbrownk

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a way to print branch distances for hclust function?

2011-12-13 Thread kbrownk
Never mind on my reply post (hasn't posted yet, but assuming it does).
I found a way to use the merge and height values with Excel to
subtract out the lower level distance. Thanks for the help!

kbrownk

On Dec 12, 1:38 am, Peter Langfelder peter.langfel...@gmail.com
wrote:
 On Sun, Dec 11, 2011 at 8:43 PM,kbrownkkbro...@gmail.com wrote:
  The R function hclust is used to do cluster analysis, but based on R
  help I see no way to print the actual fusion distances (that is, the
  vertical distances for each connected branch pairs seen in the cluster
  dendrogram).

  Any ideas? I'd like to use them test for significant differences from
  the mean fusion distance (i.e. The Best Cut Test).

  To perform a cluster analysis I'm using:

  x - dist(mydata, method = euclidean) # distance matrix
  y - hclust(x, method=ward) #clustering (i.e. fusion) method
  plot(y) # display dendogram

  Thanks,
 kbrownk

 You need to dig a bit deeper in the help file :) The return value is a
 list that contains, among others, components
 'merge' and 'height'. The 'merge' component tells you which objects
 were merged at each particular step, and the 'height' component tells
 you what the merging height at that step was. The (slightly) tricky
 part is to relate the merge component to actual objects - AFAIK there
 is no function for that. The function cutree() using the argument k
 and varying it between 2 and n should basically do it for you but you
 need to match it to the entries in 'merge'. Maybe someone else knows a
 better way to do this.

 HTH,

 Peter

 __
 r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a way to print branch distances for hclust function?

2011-12-13 Thread kbrownk
Thanks, I had tried using height but I was using it wrong. If I had
the distances alone it would be enough, except that the height returns
the distance all the way to 0, rather than to the adjacent merge. So,
if an initial merge at height 0 has a distance of 10, and that object
then merges again at having a distance that reaches height 35, I'd
want to see a distance value of 25 not 35. So I will need a way to
relate merges with heights somehow so I can subtract lower level
distances like this. I have 67 total distances, so if some manual work
is involved I can manage. I'm working on the advice you provided,
which may provide some answers.

Thanks,
kbrownk

On Dec 12, 1:38 am, Peter Langfelder peter.langfel...@gmail.com
wrote:
 On Sun, Dec 11, 2011 at 8:43 PM,kbrownkkbro...@gmail.com wrote:
  The R function hclust is used to do cluster analysis, but based on R
  help I see no way to print the actual fusion distances (that is, the
  vertical distances for each connected branch pairs seen in the cluster
  dendrogram).

  Any ideas? I'd like to use them test for significant differences from
  the mean fusion distance (i.e. The Best Cut Test).

  To perform a cluster analysis I'm using:

  x - dist(mydata, method = euclidean) # distance matrix
  y - hclust(x, method=ward) #clustering (i.e. fusion) method
  plot(y) # display dendogram

  Thanks,
 kbrownk

 You need to dig a bit deeper in the help file :) The return value is a
 list that contains, among others, components
 'merge' and 'height'. The 'merge' component tells you which objects
 were merged at each particular step, and the 'height' component tells
 you what the merging height at that step was. The (slightly) tricky
 part is to relate the merge component to actual objects - AFAIK there
 is no function for that. The function cutree() using the argument k
 and varying it between 2 and n should basically do it for you but you
 need to match it to the entries in 'merge'. Maybe someone else knows a
 better way to do this.

 HTH,

 Peter

 __
 r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Is there a way to print branch distances for hclust function?

2011-12-11 Thread kbrownk
The R function hclust is used to do cluster analysis, but based on R
help I see no way to print the actual fusion distances (that is, the
vertical distances for each connected branch pairs seen in the cluster
dendrogram).

Any ideas? I'd like to use them test for significant differences from
the mean fusion distance (i.e. The Best Cut Test).

To perform a cluster analysis I'm using:

x - dist(mydata, method = euclidean) # distance matrix
y - hclust(x, method=ward) #clustering (i.e. fusion) method
plot(y) # display dendogram

Thanks,
kbrownk

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help understanding cutree used for Dunn Index

2011-12-09 Thread kbrownk
I found a way to draw rectangles around whatever cutree cuts, so I think I 
now have a better idea, which is that it looks for the largest k distances 
and cuts there, where k is the # clusters you want defined. Then, I assume 
Dunn index uses the same defined #clusters to determine within vs. across 
cluster similarities (i.e. distances).

If I'm way off please let me know, otherwise thanks for taking the time to 
read my question.

kbrownk
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.