[R] Kolmogorov-Smirnov-Test on binned data, I guess gumbel-distributed data

2011-11-02 Thread Jochen1980
Hi R-Users,

I read some texts related to KS-tests. Most of those authors stated, that
KS-Tests are not suitable for binned data, but some of them refer to 'other'
authors who are claiming that KS-Tests are okay for binned data. 

I searched for sources and can't find examples which approve that it is okay
to use KS-Tests for binned data - do you have any links to articles or
tutorials? 

Anyway, I look for a test  which backens me up that my data is
gumbel-distributed. I estimated the gumbel-parameters mue and beta and after
having a look on  resulting plots, in my opinion: that looks quite good! 

You can the plot, related data, and the rscript here:
www.jochen-bauer.net/downloads/kstest/Rplots-1000.pdf
http://www.jochen-bauer.net/downloads/kstest/rm2700-1000.txt
http://www.jochen-bauer.net/downloads/kstest/rcalc.R 

The story about the data:
I am wondering what test I should choose if KS-Test is not appropriate? I
get real high p-Values for data-row-1-histogram-heights and
fitted-gumbel-distribution-function-to-bin-midth-vals. Most of the time,
KS-test results in distances of 0.01 and p-Values of 0.99 or 1. This sounds
strange to me, too high. Otherwise my plots are looking good and as you can
see, in my first experiment I sampled 1000 values. In a second experiment I
created only 50 random-values for the gumbel-parameter-estimation. I try to
reduce permutations, so I will be able to create results faster, but I have
to find out, when data fails for gumbel-distribution. The results surprised
me, I expected that my tests and plots get worse, but I got still high
p-values for the KS-Test and still a nice looking plot.

www.jochen-bauer.net/downloads/kstest/Rplots-0050.pdf
http://www.jochen-bauer.net/downloads/kstest/rm2700-0050.txt

Moreover besides the shuffled data of my randomisation-test there are
real-data-values. I calculated the p-value that my real data point occurs
under estimated gumbel distribution. Those p-values  between
1000permutation-experiment and 50-permutation-experiment are correlating
enormously ... around 0.98. Pearson and Spearman-correlation-coefficients
told me this. I guess that backens up the fact, that my plots are not
getting worse nor the KS-Tests do. 

I hope I was able to state my current situation and you are able to give me
some hints, for some literature or other tests or backen me up in my guess
that my data is gumbel-distributed.

Thanks in advance.

Jochen

I hope I was able to tell  


--
View this message in context: 
http://r.789695.n4.nabble.com/Kolmogorov-Smirnov-Test-on-binned-data-I-guess-gumbel-distributed-data-tp3983781p3983781.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Problems with ks.test()

2011-07-30 Thread Jochen1980
Hi, I used a ks-function of another library (kstwo() of numerical recipes, a
mathematics book) to test it for myself and the same happens there - I
cannot understand why this observation happens? I hope someone can
'enlighten' me.

--
View this message in context: 
http://r.789695.n4.nabble.com/Problems-with-ks-test-tp3703469p3706072.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Problems with ks.test()

2011-07-29 Thread Jochen1980
Hi, 

I got two data point vectors. Now I want to make a ks.test(). I you print 
both vectors you will see, that they fit pretty fine. Here is a picture:
http://www.jochen-bauer.net/downloads/kstest-r-help-list-plot.png

As you can see there is one histogram and moreover there is the gumbel
density 
function plotted. Now I took to bin-mids and the bin-height for vector1 and 
computed the distribution-values to all bin-mids as vector2. 

I pass these two vectors to ks.test(). Are those the right vectors, if I
want
to decide afterwards, if my experiment-data is gumbel-distributed? 

Surprisingly the p-value changes tremendously if I calculate more digits out
of 
my theoretical formula. If I round to 0 digits, p is 1, if I round to 4
digits,
p drops to 0 - how could this happen, I thought more digits will bring more
accurate results?!

 Case 0 digits: XXX 
  [1]   0   0   0   0   0  24  74  98 133 147 134 120  89  69  46  31  16  
7
 [19]   7   3   2   0   0   0   0   0   0   0   0   0   0   0   0   0   0  
0
 [37]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  
0
 [55]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  
0
 [73]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  
0
 [91]   0   0   0   0   0   0   0   0   0   0
  [1]   0   0   0   0   1  10  49 113 160 168 147 113  81  55  37  24  15 
10
 [19]   6   4   2   2   1   1   0   0   0   0   0   0   0   0   0   0   0  
0
 [37]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  
0
 [55]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  
0
 [73]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  
0
 [91]   0   0   0   0   0   0   0   0   0   0
[1] Ergebnisse
[1] Analyse der Eingangsdaten
[1] Mean:  0.104537195
[1] SAbw.:  0.0277657985898433
[1] Parameter-Berechnung der Daten bei angenommener Gumbelverteilung
[1] Mue:  0.0920411082987717
[1] Beta:  0.0216489043196013
[1] KS-Test -  1000  Werte,  100  Bins, x: Klassenmitten, y1, y2 =
Histogrammhöhen
[1] KST D:  0.04
[1] KST P:  1

XXX Case 4 digits: 
  [1]   0   0   0   0   0  24  74  98 133 147 134 120  89  69  46  31  16  
7
 [19]   7   3   2   0   0   0   0   0   0   0   0   0   0   0   0   0   0  
0
 [37]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  
0
 [55]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  
0
 [73]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  
0
 [91]   0   0   0   0   0   0   0   0   0   0
  [1]   0.000   0.000   0.000   0.006   0.622  10.094  49.271 112.776
160.174
 [10] 168.419 146.527 113.137  81.026  55.344  36.690  23.870  15.347  
9.793
 [19]   6.220   3.939   2.490   1.572   0.992   0.625   0.394   0.248  
0.157
 [28]   0.099   0.062   0.039   0.025   0.016   0.010   0.006   0.004  
0.002
 [37]   0.002   0.001   0.001   0.000   0.000   0.000   0.000   0.000  
0.000
 [46]   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000  
0.000
 [55]   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000  
0.000
 [64]   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000  
0.000
 [73]   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000  
0.000
 [82]   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000  
0.000
 [91]   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000  
0.000
[100]   0.000
[1] Ergebnisse
[1] Analyse der Eingangsdaten
[1] Mean:  0.104537195
[1] SAbw.:  0.0277657985898433
[1] Parameter-Berechnung der Daten bei angenommener Gumbelverteilung
[1] Mue:  0.0920411082987717
[1] Beta:  0.0216489043196013
[1] KS-Test -  1000  Werte,  100  Bins, x: Klassenmitten, y1, y2 =
Histogrammhöhen
[1] KST D:  0.2
[1] KST P:  0.0366

Thanks in advance for some help.
Jochen

--
View this message in context: 
http://r.789695.n4.nabble.com/Problems-with-ks-test-tp3703469p3703469.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Use ks.test() or an alternative in C/C++ application

2011-07-23 Thread Jochen1980
Hi, 

I am looking for an opportunity to make a KS-Test in my C/C++-app.
Unfortunately I am not able to find a lib or function in C or C++ which does
the job. For my other numerical stuff Gnu Scientific Library was recommended
to me. What to do now?

I read that there are options to call R in C++-Code. How does that work and
is this a good option if the test will be called very often and the program
should be running on a computer cluster?

Thanks in advance!

--
View this message in context: 
http://r.789695.n4.nabble.com/Use-ks-test-or-an-alternative-in-C-C-application-tp3688640p3688640.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ecdf() to nls() - how to transform data?

2011-07-18 Thread Jochen1980
Ok, I think I had a good idea to solve my problem and need someone who second
me on that or tell me what a fool I am :-) .

My problem: I went for curve-fitting with nls() and took knot()-ecdf() for
collecting data for nls-basis-dataframe. I came into trouble, because my
x-y-data of the data.frame() was not equal. I computed 1000 values and
there were always some ties, those ties were not returned by
knot()-function.

Okay, I thought about it and now I am using 100 percentile-values for the
data-frame - nls() works fine - moreover 100 percentiles are always there,
exceptions won't happen. Is it wise to use 100 cumulative percentile for
curve-fitting-base-data instead of ecdf()-function data?

Thanks in advance.

--
View this message in context: 
http://r.789695.n4.nabble.com/ecdf-to-nls-how-to-transform-data-tp3671754p3676932.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ecdf() to nls() - how to transform data?

2011-07-17 Thread Jochen1980
Thanks David and Peter!

I tried to improve my R-script to get closer to my goal. 
I guess I have to use nls(), because later I want to work with
Levenberg-Marquardt-Algorithm and when I got it right, LM-Algorithm uses
least squares as well, fitdistr() instead uses Maximum Likelihoods. Anyway,
I am wondering, how to create the appropriate data.frame for nls()? In my
opinion all data must be there:
I got the absolute numbers in my dataset from the internet.
I got the histogram/density function right, as you can see in the graph
number 3 after running the following example.
I got the cumulative density right as well - so how to find x and y for the
data-frame in line 116?
Is it necessary/easier to get this example worked with absolute numbers?
The two easy tutorials for nls() worked out as well, so that can't be that
hard I guess, but I am sitting here for two days and got no idea what to do
now!? Thanks in advance for a little more input guys.


#
### R-Skript Example Fit ###
#

cat(  * \n )
cat(   Script curvefit - from histogram to function \n )
cat(  \n )

# Preparation
setwd( /home/joba/jobascripts/rproject/ )
rm( list=ls( all=TRUE ) )

# Organise X11, x Rows, y Graphs
par( mfrow=c( 2, 2 ) )

### tutorial 1 

# Build data.frame() 
x - seq( 0.0, 0.30, length=10 )# seq( 
from, to, by )
y - c( 0.04, 0.06, 0.12, 0.14, 0.15, 0.16, 0.18, 0.19, 0.20, 0.26 )# vals
of target function
df - data.frame( x=x, y=y )

#print( df )
plot( df, main=Tutorial 1 )

# Formula: y = x^a , look for a
res - nls( y ~ I(x^exponent), data = df, start = list( exponent = 1 ),
trace = T )
print( round( summary(res)$coefficients[1], 3 ))
lines( x, predict( res, list( x = x ) ), lty = 1, col = blue )

# Formula: y = x^a + b, look for a and b
res2 - nls( y ~ I( x^exponent + gamma ), data = df, start = list(
exponent=1, gamma=0 ), trace = TRUE )
print( round( summary(res2)$coefficients[1], 3 ))
lines( x, predict( res2, list( x = x ) ), lty = 1, col = red )


### tutorial 2 

len - 24
x - runif( len )# 24 random values between 0 and 1
y - x^3 + rnorm( len, 0, 0.06 ) # compute y-values and add some noise
ds - data.frame( x = x, y = y ) # build data.frame
#str( ds ) 

# Compute prediction
s - seq( 0, 1, length = 100 ) # compute real-data, to get y-values
m - nls( y ~ I(x^power), data = ds, start = list( power = 1 ), trace = T )
#print( class(m) )
print( summary(m) )

# plot fitted curve and known curve
power - round( summary(m)$coefficients[1], 3 ) # get power from summary
power.se - round( summary(m)$coefficients[2], 3 )  # get failure from
summary
plot( y ~ x, main = Tutorial 2, sub=Blue: fit; green: known )
lines( s, s^3, lty = 2, col = green ) # known real data
lines( s, predict( m, list( x = s )), lty = 1, col = blue ) # predicted
vals
text( 0, 0.5, paste( y =x^ (, power,  +/- , power.se, ), sep=),
pos=4 )

### real data #
# --- Create Points ---

# Real data and density
hgd - read.csv(
file=http://www.jochen-bauer.net/downloads/curve-fitting-tutorial/data01.txt;,
sep=,, head=FALSE) 
#print( hgd$V1 )
print( summary( hgd$V1 ) )
denspoints - density( hgd$V1 )

# define variables for later use
msa - MSA-Name
c1 - col1
c2 - col2
histtitle - paste( msa, Spalten:, c1, ,, c2 )

#
# --- Plot ---

# Organise X11, x Rows, y Graphs
#par( mfrow=c( 1, 2 ) )

# Histogram
histo - hist( hgd$V1, freq=FALSE, main=histtitle, xlab=U-Value,
ylab=Density) 

# Density-Line
lines( denspoints, col=GREEN, lwd=1 )

# Plot cumulatives
plot( ecdf( hgd$V1 ), col=GREEN, lwd=1, main=, xlab=U-Value,
ylab=Probability ) # real data

#
# --- Fit ---

# Matching Gumbel distribution as tip from expert

# Gumbel-Dist-Function, cumulative
#   ( - ( x - mue ) / beta )
#  ( -e )^
# F(x) = e^
# = exp( - exp( - ( x - mue ) / beta ) )  # Thanks to Peter Dalgaard :-)

# Starting guesses: mue = medianOfData, beta = sdevOfData
mueStart - median( hgd$V1 )
betaStart - sd( hgd$V1 )
print( paste( mueStart: , mueStart ))
print( paste( betaStart: , betaStart ))

# Build data.frame() 
# x = min of uvals to max of uvals
# y = probabilities, which add to 1, normalized histogramms could help ?

### XXX # X #
x - seq( 0.0, 0.30, length=10 )# seq( 
from, to, by )
y - c( 0.04, 0.06, 0.12, 0.14, 0.15, 0.16, 0.18, 0.19, 0.20, 0.26 )# vals
of target function
df - data.frame( x=x, y=y )
print( df )

# Formula: y = x^a , look for a
res - nls( y ~ exp( - exp( - ( x - mue ) / beta )), data = df, start =
list( 

Re: [R] ecdf() to nls() - how to transform data?

2011-07-17 Thread Jochen1980
Hi David,

my first attempt to work through your code was successful, my predicted line
is pretty close to the ecdf-function. I have no idea why you inverted the
gumbel-function and what the advantage of this strategy is?

I interpret your (1:100/100)-trick like this: you build a sequence of 100
Tics. In addition to this 100 is identical to the number of created points. 

In my real world situation, ecdf() will produce ties, same values, and this
results that I do not know what number knot() will return, then this will
result in a non-symmetric-data.frame()-error.

R-Code:
### Tutorial 3, Gamma-Distribution-Points to Gumbel

# --- create Points --- #
http://r.789695.n4.nabble.com/ecdf-to-nls-how-to-transform-data-td3671754.html
print( Tutorial 3 )
p - rgamma( 100, 2)
print( p )
ep - ecdf( p )
x - knots(ep)
hist( p, main=Tutorial 3 )
ypre - seq( from=1, to=100, by=1 )
y - ypre / 100
df - data.frame( x=x, y=y )
plot( df )
res - nls( y ~ exp( - exp( - ( x - mue ) / beta )), start=list(mue=2,
beta=2), data=df, trace=TRUE )  
print( summary(res) )
#print( str(res))
lines( x, predict( res, list( x=x )), lty = 1, col = blue ) # predicted
vals

Console-Output:
[1] Tutorial 3
  [1] 4.75748951 5.36642043 2.03025702 1.80216440 1.40745277 0.91393251
  [7] 1.30575505 2.07451301 1.18500815 0.24972503 1.36604865 2.36786796
 [13] 1.37386798 1.72390843 1.93700139 0.78722468 2.92828385 1.51165415
 [19] 4.17000960 2.98356424 1.20195996 1.85899533 0.94863757 1.50438053
 [25] 0.50854274 3.32760075 2.21334316 0.58463817 2.26985676 2.88033389
 [31] 2.00718903 1.19043470 1.62945752 0.32010478 0.54837755 1.32965131
 [37] 3.32024573 0.53825226 3.30077557 0.46426513 1.00681720 0.59276030
 [43] 1.31431609 2.27822419 2.56404713 0.52459218 1.05228996 2.23799673
 [49] 1.14438962 2.41311612 1.40254244 1.20379073 0.44195457 0.95880408
 [55] 1.45254027 4.49645001 2.18826796 1.07161515 1.62617544 3.15506003
 [61] 2.15611491 2.43261017 1.40293461 2.79977886 3.44503201 3.25282551
 [67] 0.91570531 0.70243512 5.67971774 3.55532192 1.71515046 2.97718949
 [73] 0.47145131 0.32879089 0.60735231 0.92388638 4.29277857 1.73681839
 [79] 0.09387383 2.81281295 1.84419058 2.63070353 1.52298124 2.89761263
 [85] 1.05251987 1.24258703 3.09130229 2.66738574 3.17035060 2.15287327
 [91] 2.63042751 1.69736779 3.21126033 1.56920999 2.68985477 0.82952068
 [97] 3.62230865 1.65286592 2.76798783 2.08091935

Formula: y ~ exp(-exp(-(x - mue)/beta))

Parameters:
 Estimate Std. Error t value Pr(|t|)
mue  1.382744   0.005926   233.3   2e-16 ***
beta 0.984836   0.008816   111.7   2e-16 ***




--
View this message in context: 
http://r.789695.n4.nabble.com/ecdf-to-nls-how-to-transform-data-tp3671754p3674227.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] ecdf() to nls() - how to transform data?

2011-07-16 Thread Jochen1980
Hi,

I am using ecdf-function and want to use the ecdf()-data-points for nls() as
data-parameter.
nls() expects 'list' or 'environment' as a data-type, knots(ecdf(mydata))
gives me 'numeric'.
What should I do now?

Thanks in advance - Jochen

Here is the code:
#
# --- Fit ---
# Gumbel-Dist-Function, cumulative,
http://en.wikipedia.org/wiki/Gumbel_distribution
#   ( - ( x - mue ) / beta )
#  ( -e )^
# F(x) = e^
# formula for fitting-function
# formula: y ~ I( exp(1) ^ ( - exp(1) ) ^ ( - ( x - mue ) / beta ) )
# data: ecdf( hgd$V1 )
# start: list( mue=0.1, beta=0.025 )
gpfunction - ecdf( hgd$V1 )
gplistrange - seq( 0.0, 1, by=0.001 )
gplist - gpfunction( gplistrange )
print( gplist )
print( class( gplist ) )
print(---)
#res - nls( y ~ I( exp(1) ^ ( - exp(1) ) ^ ( - ( x - mue ) / beta ) ), df,
list( mue=0.1, beta=0.025 ), trace=TRUE )
#print( summary( res ) ) 

--
View this message in context: 
http://r.789695.n4.nabble.com/ecdf-to-nls-how-to-transform-data-tp3671754p3671754.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Add a density line to a cumulative histogram - second try

2011-07-15 Thread Jochen1980
Thanks, I found the function ecdf() which does the job.

plot( ecdf( nvtpoints), col=BLUE, lwd=1, add=TRUE )


--
View this message in context: 
http://r.789695.n4.nabble.com/Add-a-density-line-to-a-cumulative-histogram-second-try-tp3666969p3669310.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Add a density line to a cumulative histogram - second try

2011-07-14 Thread Jochen1980
Hi list,

this is my second try for first post on this list (I tried to post via email
and nothing appeared in my email-inbox, so now I try to use the
nabble-web-interface) - I hope that you will only have to read one post in
your inbox! Okay, my question ...

I was able to plot a histogram and add the density()-line to this plot.
I was able to plot a cumulative form of this histogram.
Yet, I was not able to add the density line to this cumulative histogram.

You can watch a picture of the histograms here:
http://www.jochen-bauer.net/downloads/histo-cumulative-density.png

Source:

# Histogramm
histo - hist( hgd$V1, freq=FALSE )

# Dichte-Schätzer-Funktion reinzeichnen
denspoints - density( hgd$V1 )
lines( denspoints, col=GREEN, lwd=2 )

# Kumulative Verteilung relativer Häufigkeiten
relcum - cumsum( histo$counts ) / sum(histo$counts)
barplot( relcum, names.arg=round( histo$mids, 2), col=green,
ylab=Wahrscheinlichkeit)

# Kumulative Dichtefunktion ?

Thanks in advance - Jochen

--
View this message in context: 
http://r.789695.n4.nabble.com/Add-a-density-line-to-a-cumulative-histogram-second-try-tp3666969p3666969.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.