Re: [R] identify the distribution of the data

2023-09-27 Thread Bogdan Tanasa
Dear all,

Thank you for your insights, suggestions and for sharing your knowledge. I
have found the package fitdistrplus to meet our needs.

Warm regards,

Bogdan

On Wed, Feb 8, 2023 at 11:10 PM PIKAL Petr  wrote:

> Hi
>
> Others gave you more fundamental answers. To check the possible
> distribution
> you could use package
>
> https://cran.r-project.org/web/packages/fitdistrplus/index.html
>
> Cheers
> Petr
>
> > -Original Message-
> > From: R-help  On Behalf Of Bogdan Tanasa
> > Sent: Wednesday, February 8, 2023 5:35 PM
> > To: r-help 
> > Subject: [R] identify the distribution of the data
> >
> > Dear all,
> >
> > I do have dataframes with numerical values such as 1,9, 20, 51, 100 etc
> >
> > Which way do you recommend to use in order to identify the type of the
> > distribution of the data (normal, poisson, bernoulli, exponential,
> log-normal etc
> > ..)
> >
> > Thanks so much,
> >
> > Bogdan
> >
> >   [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] identify the distribution of the data

2023-02-09 Thread Richard O'Keefe
fitdistrplus is a great package.
But the documentation for the fitdist function makes
something very clear:
  fitdistr(data, distr, ...)
distr [is] A character string "name" naming a distribution
  for which the corresponding density function dname,
  the corresponding distribution function pname and the
  corresponding quantile function qname must be defined,
  or directly the density function.

That is, it fits the *parameters* of a distribution,
it does not infer the *form* of the distribution.
If you tell it 'distr = "norm"' it will fit a mean and
standard deviation.  It will not pick distr = "norm" by
itself, which is what I think the OP wanted.

descdist from that package will *help*, but its advice
is not infallible, and it considers a limited range of
distributions.  (It doesn't deal with circular ones,
for example.)

On Thu, 9 Feb 2023 at 20:10, PIKAL Petr  wrote:

> Hi
>
> Others gave you more fundamental answers. To check the possible
> distribution
> you could use package
>
> https://cran.r-project.org/web/packages/fitdistrplus/index.html
>
> Cheers
> Petr
>
> > -Original Message-
> > From: R-help  On Behalf Of Bogdan Tanasa
> > Sent: Wednesday, February 8, 2023 5:35 PM
> > To: r-help 
> > Subject: [R] identify the distribution of the data
> >
> > Dear all,
> >
> > I do have dataframes with numerical values such as 1,9, 20, 51, 100 etc
> >
> > Which way do you recommend to use in order to identify the type of the
> > distribution of the data (normal, poisson, bernoulli, exponential,
> log-normal etc
> > ..)
> >
> > Thanks so much,
> >
> > Bogdan
> >
> >   [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] identify the distribution of the data

2023-02-08 Thread PIKAL Petr
Hi

Others gave you more fundamental answers. To check the possible distribution
you could use package

https://cran.r-project.org/web/packages/fitdistrplus/index.html

Cheers
Petr

> -Original Message-
> From: R-help  On Behalf Of Bogdan Tanasa
> Sent: Wednesday, February 8, 2023 5:35 PM
> To: r-help 
> Subject: [R] identify the distribution of the data
> 
> Dear all,
> 
> I do have dataframes with numerical values such as 1,9, 20, 51, 100 etc
> 
> Which way do you recommend to use in order to identify the type of the
> distribution of the data (normal, poisson, bernoulli, exponential,
log-normal etc
> ..)
> 
> Thanks so much,
> 
> Bogdan
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] identify the distribution of the data

2023-02-08 Thread Spencer Graves




On 2/8/23 12:06 PM, Ebert,Timothy Aaron wrote:

IMO) The best approach is to develop a good understanding of the individual 
processes that resulted in the observed values. The blend of those processes 
then results in the distribution of the observed values. This is seldom done, 
and often not possible to do. The alternatives depend on why you are doing this.

0) Sometime the nature of the data suggest a distribution. You list integer 
values. If all observations are integer (counts for example) then Poisson may 
be appropriate. With two values then maybe the Binomial distribution. 
Continuous data might be normally distributed (Gaussian distribution). If I 
roll one six-sided die many times I will have a uniform distribution (assuming 
a fair die). I could then try the same task but roll 2 dice and add the result. 
I still have discrete values, but the shape is closer to Gaussian. The 
distribution looks more and more Gaussian as I add more dice together in each 
roll.



	  I concur:  The application will often suggest a distribution, e.g., 
Poisson, binomial or negative binomial for nonnegative integers, Weibull 
for lifetime data, etc.



	  I love normal probability plots -- the qqnorm function.  This can 
identify outliers or multimodality or the need for a transformation. 
Continuous data that are always positive are often log-normal -- or a 
mixture of log-normals.



x <- rnorm(100)
X <- exp(x)
qqnorm(X, datax=TRUE, log='x')


	  The central limit theorem says that the distribution of almost any 
sum of random variables will be more nearly normal than the 
distributions of individual summands.  It also says that almost any 
product of positive random variables will be more nearly log-normal than 
the distributions of individual components of the product.  This 
application to products is less well known and occasionally controversial.



https://en.wikipedia.org/wiki/Gibrat%27s_law


  Spencer Graves

  
1) Try a simulation. Draw 5 values from a normal distribution, make a histogram. Then do it again. Is it easy to see that both samples are from the same distribution? Personally, the answer is no. So increase the sample size until you are happy with a decision that any two draws are from the same distribution. For my part, at 1 million most people would not be able to detect any difference between the two histograms. This helps calibrate the people. How does your sample size compare to your choice in this exercise?


2) Given that you have sufficient data (see above), can you see the 
distribution in your data? Is that good enough?

3) Are you doing this as part of following the assumptions of statistical 
models? In such tests for normality, we tend to assume that a failure to reject 
the null hypothesis is sufficient proof that the null hypothesis is true. 
However, in most other cases we are told that a failure to reject the null 
hypothesis is not sufficient to prove the null hypothesis. You need to work 
this out, but the importance, consequences, and alternatives of testing model 
assumptions is a large body of literature with (sometimes) widely divergent 
viewpoints.

4) There are hundreds of distributions. 
https://cran.r-project.org/web/views/Distributions.html but the common distributions are 
seen in sites like this one:  https://www.stat.umn.edu/geyer/old/5101/rlook.html. Given 
so many choices, you can probably find one that will fit your data reasonably well. 
Depending on how many data points you have will determine the reliability of that answer. 
Is that really informative to the problem you are trying to solve? Answering "what 
distribution do these data follow?" is not usually the goal.

Regards,
Tim
  


-Original Message-
From: R-help  On Behalf Of Bert Gunter
Sent: Wednesday, February 8, 2023 12:00 PM
To: Bogdan Tanasa 
Cc: r-help 
Subject: Re: [R] identify the distribution of the data

[External Email]

1. This is a statistical question, which usually is inappropriate here:
this list is about R language (including packages) programming.

2. IMO (so others may disagree), your question indicates a profound misunderstanding of basic statistical 
issues. While maybe you phrased it poorly or I misunderstand, but "identify the type of 
distribution" is basically a meaningless query. Explaining why this is so and what may be more 
meaningful would require a deep dive into statistics. You might try referencing a basic statistical text 
and/or online tutorials. Try searching on "Goodness of fit", "statistical modeling" or 
the like.

Cheers,
Bert

On Wed, Feb 8, 2023 at 8:35 AM Bogdan Tanasa  wrote:


Dear all,

I do have dataframes with numerical values such as 1,9, 20, 51, 100
etc

Which way do you recommend to use in order to identify the type of the
distribution of the data (normal, poisson, bernoulli, exponential,
log-normal etc ..)

Thanks so much,

Bogda

Re: [R] identify the distribution of the data

2023-02-08 Thread Ebert,Timothy Aaron
IMO) The best approach is to develop a good understanding of the individual 
processes that resulted in the observed values. The blend of those processes 
then results in the distribution of the observed values. This is seldom done, 
and often not possible to do. The alternatives depend on why you are doing 
this. 

0) Sometime the nature of the data suggest a distribution. You list integer 
values. If all observations are integer (counts for example) then Poisson may 
be appropriate. With two values then maybe the Binomial distribution. 
Continuous data might be normally distributed (Gaussian distribution). If I 
roll one six-sided die many times I will have a uniform distribution (assuming 
a fair die). I could then try the same task but roll 2 dice and add the result. 
I still have discrete values, but the shape is closer to Gaussian. The 
distribution looks more and more Gaussian as I add more dice together in each 
roll. 
 
1) Try a simulation. Draw 5 values from a normal distribution, make a 
histogram. Then do it again. Is it easy to see that both samples are from the 
same distribution? Personally, the answer is no. So increase the sample size 
until you are happy with a decision that any two draws are from the same 
distribution. For my part, at 1 million most people would not be able to detect 
any difference between the two histograms. This helps calibrate the people. How 
does your sample size compare to your choice in this exercise?

2) Given that you have sufficient data (see above), can you see the 
distribution in your data? Is that good enough?

3) Are you doing this as part of following the assumptions of statistical 
models? In such tests for normality, we tend to assume that a failure to reject 
the null hypothesis is sufficient proof that the null hypothesis is true. 
However, in most other cases we are told that a failure to reject the null 
hypothesis is not sufficient to prove the null hypothesis. You need to work 
this out, but the importance, consequences, and alternatives of testing model 
assumptions is a large body of literature with (sometimes) widely divergent 
viewpoints. 

4) There are hundreds of distributions. 
https://cran.r-project.org/web/views/Distributions.html but the common 
distributions are seen in sites like this one:  
https://www.stat.umn.edu/geyer/old/5101/rlook.html. Given so many choices, you 
can probably find one that will fit your data reasonably well. Depending on how 
many data points you have will determine the reliability of that answer. Is 
that really informative to the problem you are trying to solve? Answering "what 
distribution do these data follow?" is not usually the goal.

Regards,
Tim
 

-Original Message-
From: R-help  On Behalf Of Bert Gunter
Sent: Wednesday, February 8, 2023 12:00 PM
To: Bogdan Tanasa 
Cc: r-help 
Subject: Re: [R] identify the distribution of the data

[External Email]

1. This is a statistical question, which usually is inappropriate here:
this list is about R language (including packages) programming.

2. IMO (so others may disagree), your question indicates a profound 
misunderstanding of basic statistical issues. While maybe you phrased it poorly 
or I misunderstand, but "identify the type of distribution" is basically a 
meaningless query. Explaining why this is so and what may be more meaningful 
would require a deep dive into statistics. You might try referencing a basic 
statistical text and/or online tutorials. Try searching on "Goodness of fit", 
"statistical modeling" or the like.

Cheers,
Bert

On Wed, Feb 8, 2023 at 8:35 AM Bogdan Tanasa  wrote:

> Dear all,
>
> I do have dataframes with numerical values such as 1,9, 20, 51, 100 
> etc
>
> Which way do you recommend to use in order to identify the type of the 
> distribution of the data (normal, poisson, bernoulli, exponential, 
> log-normal etc ..)
>
> Thanks so much,
>
> Bogdan
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat
> .ethz.ch%2Fmailman%2Flistinfo%2Fr-help=05%7C01%7Ctebert%40ufl.edu
> %7Cfe002d446d0d4d722f1408db09f5e78f%7C0d4da0f84a314d76ace60a62331e1b84
> %7C0%7C0%7C638114724007457767%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> ta=GrZd0ZRFfnvbXzZKvJy7XUkRN4IsJOykuN5xTliR4sY%3D=0
> PLEASE do read the posting guide
> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r
> -project.org%2Fposting-guide.html=05%7C01%7Ctebert%40ufl.edu%7Cfe
> 002d446d0d4d722f1408db09f5e78f%7C0d4da0f84a314d76ace60a62331e1b84%7C0%
> 7C0%7C638114724007457767%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL
> CJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%

Re: [R] identify the distribution of the data

2023-02-08 Thread Bert Gunter
1. This is a statistical question, which usually is inappropriate here:
this list is about R language (including packages) programming.

2. IMO (so others may disagree), your question indicates a profound
misunderstanding of basic statistical issues. While maybe you phrased it
poorly or I misunderstand, but "identify the type of distribution" is
basically a meaningless query. Explaining why this is so and what may be
more meaningful would require a deep dive into statistics. You might try
referencing a basic statistical text and/or online tutorials. Try searching
on "Goodness of fit", "statistical modeling" or the like.

Cheers,
Bert

On Wed, Feb 8, 2023 at 8:35 AM Bogdan Tanasa  wrote:

> Dear all,
>
> I do have dataframes with numerical values such as 1,9, 20, 51, 100 etc
>
> Which way do you recommend to use in order to identify the type of the
> distribution of the data (normal, poisson, bernoulli, exponential,
> log-normal etc ..)
>
> Thanks so much,
>
> Bogdan
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] identify the distribution of the data

2023-02-08 Thread Bogdan Tanasa
Dear all,

I do have dataframes with numerical values such as 1,9, 20, 51, 100 etc

Which way do you recommend to use in order to identify the type of the
distribution of the data (normal, poisson, bernoulli, exponential,
log-normal etc ..)

Thanks so much,

Bogdan

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.