Hi Jim > > If those values represent response times in a system, then when I was > responsible for characterizing what the system would do from the > viewpoint of an SLA (service level agreement) with customers using the > system, we usually specified that "90% of the transactions would have > a response time of --- or less". This took care of most "long tails". > So it depends on how you are planning to use this data. We usually > monitored the 90th or 95th percentile to see how a system was > operating day to day.
I get the point. This can be an option. I will discuss it with my colleagues. Thank you for your time and an answer. Best regards Petr > > On Thu, Aug 18, 2011 at 8:52 AM, Petr PIKAL <petr.pi...@precheza.cz> wrote: > > Hallo Jim > > > > Thank you and see within text. > > > > jim holtman <jholt...@gmail.com> napsal dne 18.08.2011 14:09:11: > > > >> I am not sure why you say that "lapply(ml, mean)" shows (incorrectly) > >> that the second year has a larger average; it is correct for the data: > >> > >> > lapply(ml, my.func) > >> $y1 > >> Count Mean SD Min Median 90% 95% > >> Max Sum > >> 18.00000 16.83333 12.42980 4.00000 12.50000 37.20000 41.05000 > >> 47.00000 303.00000 > >> > >> $y2 > >> Count Mean SD Min Median 90% 95% > >> Max Sum > >> 15.00000 20.06667 25.27694 4.00000 11.00000 45.80000 70.40000 > >> 97.00000 301.00000 > >> > >> > >> You have a larger "outlier" in the second year that causes the mean to > >> be higher. The median is lower, but I usually look at the 90th > >> percentile if I am looking at response time from a system and again > >> the second year has a higher value. > >> > >> So exactly why do you not "trust" your data? > > > > Well. I trust them, however mean is "correct" central value only when data > > are normally distributed or at least symmetrical. As the values are > > heavily distorted I feel that I shall not use mean for comparison of such > > sets. Anyway t.test tells me that there is no difference between y2 and > > y1. > > > >> t.test(ml[[1]], ml[[2]]) > > > > Welch Two Sample t-test > > > > data: ml[[1]] and ml[[2]] > > t = -0.452, df = 19.557, p-value = 0.6563 > > alternative hypothesis: true difference in means is not equal to 0 > > 95 percent confidence interval: > > -18.17781 11.71115 > > sample estimates: > > mean of x mean of y > > 16.83333 20.06667 > > > > So based on this I probably will never get conclusive result as sd due to > > "outliers" will be quite high. > > > > When I do > > plot(ecdf(ml[[2]])) > > plot(ecdf(ml[[1]]), add=T, col=2) > > > > it seems to me that both sets are almost the same and they differ > > substantially only with those "outlier" values. > > > > If I decreased small values of y2 (e.g.) > > > > ml[[2]][ml[[2]]<20] <- ml[[2]][ml[[2]]<20]/2 > > > > I get same mean > > > > lapply(ml, mean) > > $y1 > > [1] 16.83333 > > > > $y2 > > [1] 16.1 > > > > and t.test tells me that there is no difference between those two sets, > > although I know that most events take half of the time and only few last > > longer so for me such set is better (we improved performance for most of > > the time however there are still scarce events which take a long time). > > > > plot(ecdf(ml[[2]])) > > plot(ecdf(ml[[1]]), add=T, col=2) > > > > So still the question stays - what procedure to use for comparison of two > > or more sets with such long tailed distribution? - Trimmed mean?, Median?, > > ... > > > > Thanks. > > > > Regards > > Petr > > > >> > >> On Thu, Aug 18, 2011 at 7:49 AM, Petr PIKAL <petr.pi...@precheza.cz> > > wrote: > >> > Hallo all > >> > > >> > I try to find a way how to compare set of waiting times during > > different > >> > periods. I tried learn something from queueing theory and used also R > >> > search. There is plenty of ways but I need to find the easiest and > > quite > >> > simple. > >> > Here is a list with actual waiting times. > >> > > >> > ml <- structure(list(y1 = c(10, 9, 9, 10, 8, 20, 16, 47, 4, 7, 15, > >> > 18, 36, 5, 24, 15, 40, 10), y2 = c(97, 10, 26, 11, 11, 10, 5, > >> > 13, 19, 5, 5, 59, 4, 16, 10)), .Names = c("y1", "y2")) > >> > > >> > par(mfrow=c(1,2)) > >> > lapply(ml, hist) > >> > > >> > shows that in the first year is more longer waiting times > >> > > >> > lapply(ml, mean) > >> > > >> > shows (incorrectly) that in the second year there is longer average > >> > waiting time. > >> > > >> > lapply(ml, mean) > >> > > >> > gives me completely reversed values. > >> > > >> > Can you please give me some hints what to use for "correct" and > > "simple" > >> > comparison of waiting times in two or more periods. > >> > > >> > Thank you > >> > Petr > >> > > >> > ______________________________________________ > >> > R-help@r-project.org mailing list > >> > https://stat.ethz.ch/mailman/listinfo/r-help > >> > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > >> > and provide commented, minimal, self-contained, reproducible code. > >> > > >> > >> > >> > >> -- > >> Jim Holtman > >> Data Munger Guru > >> > >> What is the problem that you are trying to solve? > > > > > > > > -- > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.