[jira] [Updated] (MATH-1307) Create a base class for all RNGs
[ https://issues.apache.org/jira/browse/MATH-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1307: - Attachment: BaseRandomGeneratorFloatGenerationPerformanceTest.java I have played around with the generation of floats and doubles from ints. On my environment I made the interesting observation that following code which avoids the floating point multiplication {code} public double nextDouble2() { final long high = ((long) (nextInt() >>> 6)) << 26; final int low = nextInt() >>> 6; return Double.longBitsToDouble(0x3ff0L | high | low) - 1.; } {code} is about 20% faster than {code} public double nextDouble() { final long high = ((long) (nextInt() >>> 6)) << 26; final int low = nextInt() >>> 6; return (high | low) * 0x1.0p-52d; } {code} Please see [^BaseRandomGeneratorFloatGenerationPerformanceTest.java] for which I got following output: {code} nextDouble2: Time = 24.267s, Sum = 4.757900840407535E9 nextFloat: Time = 24.304s, Sum = 1.6777216E7 nextDouble: Time = 29.654s, Sum = 4.757900840407535E9 nextFloat2: Time = 24.276s, Sum = 1.6777216E7 {code} > Create a base class for all RNGs > > > Key: MATH-1307 > URL: https://issues.apache.org/jira/browse/MATH-1307 > Project: Commons Math > Issue Type: Improvement >Reporter: Gilles >Assignee: Gilles >Priority: Minor > Labels: api, inheritance > Fix For: 4.0 > > Attachments: BaseRandomGenerator.java, > BaseRandomGeneratorFloatGenerationPerformanceTest.java > > > I proposed to create a base class which the existing abstract classes > {{AbstractRandomGenerator}} and {{BitsStreamGenerator}} will extend. > This would allow to define {{nextBytes(byte[])}} at the base class level. > The code for that method is almost identical in the two hierarchies: they > only differ in a call to either {{nextInt()}} or {{next(32)}} respectively; > the latter is however the same as the former, in disguise, and is not subject > to change given the type of return value. > As a corollary, the new base class can be the unique place where to add > utilities such as the one proposed in MATH-1306. > *Update:* {{AbstractRandomGenerator}} and {{BitsStreamGenerator}} are both > obsoleted by the class proposed in this report. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1296) DescriptiveStatistics return geometric mean as 0 when product of values is zero, expected to return NaN
[ https://issues.apache.org/jira/browse/MATH-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036575#comment-15036575 ] Otmar Ertl commented on MATH-1296: -- Gilles, I agree, it is the documentation that needs to be fixed. Geometric mean should be 0. By the way, I have checked the documentations for Kurtosis.java and Skewness.java. The documentation for Kurtosis.java already describes that NaN is returned for less than 4 values. The documentation of Skewness.java also needs to be improved, the corresponding information is hidden in the documentation of its evaluate() method. > DescriptiveStatistics return geometric mean as 0 when product of values is > zero, expected to return NaN > --- > > Key: MATH-1296 > URL: https://issues.apache.org/jira/browse/MATH-1296 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.4.1 >Reporter: Anmol >Priority: Trivial > Labels: easyfix > Original Estimate: 168h > Remaining Estimate: 168h > > @Test > public void test() { > DescriptiveStatistics stats = new DescriptiveStatistics(); > stats.addValue(1); > stats.addValue(2); > stats.addValue(4); > System.out.println(stats.getGeometricMean()); //prints 2 > stats.addValue(0); > System.out.println(stats.getGeometricMean()); //prints 0, > expected NaN as per the documentation > } > The class in consideration is: > org.apache.commons.math3.stat.descriptive.DescriptiveStatistics -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MATH-1254) Unit test CorrelatedRandomVectorGeneratorTest is flaky
[ https://issues.apache.org/jira/browse/MATH-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl resolved MATH-1254. -- Resolution: Fixed Fix Version/s: 3.6 4.0 4.0: ad12d97cbb4c35017d55c4105174671f1c38b36a 3.6: 635c35be7446c55740765bb67b8bfa21f1c93de5 > Unit test CorrelatedRandomVectorGeneratorTest is flaky > -- > > Key: MATH-1254 > URL: https://issues.apache.org/jira/browse/MATH-1254 > Project: Commons Math > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Otmar Ertl >Priority: Minor > Fix For: 4.0, 3.6 > > Attachments: seed628.patch > > > CorrelatedRandomVectorGeneratorTest.testSampleWithZeroCovariance sometimes > fails. The failure can be reproduced by setting the seed of the > JDKRandomGenerator in createSampler equal to 628. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1253) Skewness could get more precision from slightly reordered code.
[ https://issues.apache.org/jira/browse/MATH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025190#comment-15025190 ] Otmar Ertl commented on MATH-1253: -- Provided that there is no overflow, I do not think that my proposal is more accurate. The relative error of a sum of floating point numbers is the same after scaling the numbers by the same factor. > Skewness could get more precision from slightly reordered code. > --- > > Key: MATH-1253 > URL: https://issues.apache.org/jira/browse/MATH-1253 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.5 >Reporter: Bill Murphy >Priority: Minor > > In Skewness.java, approx line 180, there is code like: > {noformat} > double accum3 = 0.0; > for (int i = begin; i < begin + length; i++) { > final double d = values[i] - m; > accum3 += d * d * d; > } > accum3 /= variance * FastMath.sqrt(variance); > {noformat} > If the division was moved into the for loop, accum3 would be less likely to > overflow to Infinity (or -Infinity). This might allow computation to return a > result in a case such as: > {noformat} > double[] numArray = { 1.234E11, 1.234E51, 1.234E101, 1.234E151 }; > Skewnessskew = new Skewness(); > doublesk = skew.evaluate( numArray ); > {noformat} > Currently, this returns NaN, but I'd prefer it returned approx > 1.154700538379252. > The change I'm proposing would have the code instead read like: > {noformat} > double accum3 = 0.0; > double divisor = variance * FastMath.sqrt(variance); > for (int i = begin; i < begin + length; i++) { > final double d = values[i] - m; > accum3 += d * d * d / divisor; > } > {noformat} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1285) Description API ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999300#comment-14999300 ] Otmar Ertl commented on MATH-1285: -- Added definition of distribution to javadoc. 3.6: 29dfd9bb660cbcfb76aabc80458f9452c991bd35 4.0: 396610625b41f07085c1f43cd6e5e97d69097ba1 > Description API ZipfDistribution > > > Key: MATH-1285 > URL: https://issues.apache.org/jira/browse/MATH-1285 > Project: Commons Math > Issue Type: Improvement > Environment: All >Reporter: Pim van der Hoorn >Priority: Minor > Labels: easyfix > Fix For: 4.0, 3.6 > > Original Estimate: 1h > Remaining Estimate: 1h > > The description of the constructors for the ZipfDistribution class specifies > an argument of type doule, called exponent. The documentation refers to a > MathWorld page where the distribution is defined. However, here the exponent > p, is the exponent of the cdf, not of the pdf (which is then p + 1). After > running some test I found that the exponent in the constructor is the one for > the pdf. This can create confusion among users. Therefore I recommend > specifying in the documentation which exponent is meant. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MATH-1269) FastMath.exp may return NaN for non-NaN arguments
[ https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl resolved MATH-1269. -- Resolution: Fixed Fix Version/s: 3.6 4.0 > FastMath.exp may return NaN for non-NaN arguments > - > > Key: MATH-1269 > URL: https://issues.apache.org/jira/browse/MATH-1269 > Project: Commons Math > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Otmar Ertl > Fix For: 4.0, 3.6 > > Attachments: MATH-1269-fix-tempC-infinity.patch, MATH-1269.patch, > MATH-1269.patch, MATH-1269_fix_z.patch > > > I have observed that FastMath.exp(709.8125) returns NaN. However, the > exponential function must never return NaN (if the argument is not NaN). The > result must always be non-negative or positive infinity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1269) FastMath.exp may return NaN for non-NaN arguments
[ https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992406#comment-14992406 ] Otmar Ertl commented on MATH-1269: -- fixed 3.6: 8aecb842d32464a98eaab722da84902f8126957b 4.0: a94ff90ab6cd2d92ccb2eb1fd7913b4e5256f02b > FastMath.exp may return NaN for non-NaN arguments > - > > Key: MATH-1269 > URL: https://issues.apache.org/jira/browse/MATH-1269 > Project: Commons Math > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Otmar Ertl > Fix For: 4.0, 3.6 > > Attachments: MATH-1269-fix-tempC-infinity.patch, MATH-1269.patch, > MATH-1269.patch, MATH-1269_fix_z.patch > > > I have observed that FastMath.exp(709.8125) returns NaN. However, the > exponential function must never return NaN (if the argument is not NaN). The > result must always be non-negative or positive infinity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1285) Description API ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977040#comment-14977040 ] Otmar Ertl commented on MATH-1285: -- Thanks for reporting this. I have changed the reference to Wikipedia's article about Zipf's law. See https://en.wikipedia.org/wiki/Zipf's_law. 3.6: 5f03f0d702574a9c5655778ee28fe3e23d86a9b0 4.0: 8ed2209b1f8e2452d71ef8c3149f3ed3d89d4dfa > Description API ZipfDistribution > > > Key: MATH-1285 > URL: https://issues.apache.org/jira/browse/MATH-1285 > Project: Commons Math > Issue Type: Improvement > Environment: All >Reporter: Pim van der Hoorn >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > The description of the constructors for the ZipfDistribution class specifies > an argument of type doule, called exponent. The documentation refers to a > MathWorld page where the distribution is defined. However, here the exponent > p, is the exponent of the cdf, not of the pdf (which is then p + 1). After > running some test I found that the exponent in the constructor is the one for > the pdf. This can create confusion among users. Therefore I recommend > specifying in the documentation which exponent is meant. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MATH-1285) Description API ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl resolved MATH-1285. -- Resolution: Fixed Fix Version/s: 3.6 4.0 > Description API ZipfDistribution > > > Key: MATH-1285 > URL: https://issues.apache.org/jira/browse/MATH-1285 > Project: Commons Math > Issue Type: Improvement > Environment: All >Reporter: Pim van der Hoorn >Priority: Minor > Labels: easyfix > Fix For: 4.0, 3.6 > > Original Estimate: 1h > Remaining Estimate: 1h > > The description of the constructors for the ZipfDistribution class specifies > an argument of type doule, called exponent. The documentation refers to a > MathWorld page where the distribution is defined. However, here the exponent > p, is the exponent of the cdf, not of the pdf (which is then p + 1). After > running some test I found that the exponent in the constructor is the one for > the pdf. This can create confusion among users. Therefore I recommend > specifying in the documentation which exponent is meant. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (MATH-1285) Description API ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl reopened MATH-1285: -- Reopened issue because javadoc needs to be further improved. > Description API ZipfDistribution > > > Key: MATH-1285 > URL: https://issues.apache.org/jira/browse/MATH-1285 > Project: Commons Math > Issue Type: Improvement > Environment: All >Reporter: Pim van der Hoorn >Priority: Minor > Labels: easyfix > Fix For: 4.0, 3.6 > > Original Estimate: 1h > Remaining Estimate: 1h > > The description of the constructors for the ZipfDistribution class specifies > an argument of type doule, called exponent. The documentation refers to a > MathWorld page where the distribution is defined. However, here the exponent > p, is the exponent of the cdf, not of the pdf (which is then p + 1). After > running some test I found that the exponent in the constructor is the one for > the pdf. This can create confusion among users. Therefore I recommend > specifying in the documentation which exponent is meant. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1269) FastMath.exp may return NaN for non-NaN arguments
[ https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1269: - Attachment: MATH-1269-fix-tempC-infinity.patch One more proposal. The problem is the calculation of (1+z)(tempA+tempB) if tempA is infinite. In this case tempC is also infinite. If z < 0 at the same time we get NaN for tempC*z + tempB + tempA. The proposed patch returns infinity for the case tempC is positive infinite. In all other cases tempC*z cannot be negative infinite and the evaluation of tempC*z + tempB + tempA is not a problem even if tempA or tempB are positive infinite. > FastMath.exp may return NaN for non-NaN arguments > - > > Key: MATH-1269 > URL: https://issues.apache.org/jira/browse/MATH-1269 > Project: Commons Math > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Otmar Ertl > Attachments: MATH-1269-fix-tempC-infinity.patch, MATH-1269.patch, > MATH-1269.patch, MATH-1269_fix_z.patch > > > I have observed that FastMath.exp(709.8125) returns NaN. However, the > exponential function must never return NaN (if the argument is not NaN). The > result must always be non-negative or positive infinity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MATH-1277) Incorrect Kendall Tau calc due to data type mistmatch
[ https://issues.apache.org/jira/browse/MATH-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl resolved MATH-1277. -- Resolution: Fixed Fix Version/s: 3.6 4.0 > Incorrect Kendall Tau calc due to data type mistmatch > - > > Key: MATH-1277 > URL: https://issues.apache.org/jira/browse/MATH-1277 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.5 >Reporter: Marc Rosen >Priority: Minor > Labels: Kendall, correlation, tau > Fix For: 4.0, 3.6 > > > The Kendall Tau calculation returns a number from -1.0 to 1.0 > due to a mixing of ints and longs, a mistake occurs on large size columns > (arrays) passed to the function. an array size of > 50350 triggers the > condition in my case - although it may be data dependent > the ver 3.5 library returns 2.6 as a result (outside of the defined range of > Kendall Tau) > with the cast to long below - the result reutns to its expected value > commons.math3.stat.correlation.KendallsCorrelation.correlation > here's the sample code I used: > I added the cast to long of swaps in the > int swaps = 1077126315; >final long numPairs = sum(50350 - 1); > long tiedXPairs = 0; > long tiedXYPairs = 0; > long tiedYPairs = 0; > > final long concordantMinusDiscordant = numPairs - tiedXPairs > - tiedYPairs + tiedXYPairs - 2 * (long) swaps; > final double nonTiedPairsMultiplied = 1.6e18; > double myTest = concordantMinusDiscordant / > FastMath.sqrt(nonTiedPairsMultiplied); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1277) Incorrect Kendall Tau calc due to data type mistmatch
[ https://issues.apache.org/jira/browse/MATH-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877457#comment-14877457 ] Otmar Ertl commented on MATH-1277: -- Thanks for reporting! Fixed bug in following commits: * 81ce1b183aa4fb56e7710ebe0274740c268118fa (3.6) * fb0078159d2463da149de54018fca79a9447153e (4.0) > Incorrect Kendall Tau calc due to data type mistmatch > - > > Key: MATH-1277 > URL: https://issues.apache.org/jira/browse/MATH-1277 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.5 >Reporter: Marc Rosen >Priority: Minor > Labels: Kendall, correlation, tau > Fix For: 4.0, 3.6 > > > The Kendall Tau calculation returns a number from -1.0 to 1.0 > due to a mixing of ints and longs, a mistake occurs on large size columns > (arrays) passed to the function. an array size of > 50350 triggers the > condition in my case - although it may be data dependent > the ver 3.5 library returns 2.6 as a result (outside of the defined range of > Kendall Tau) > with the cast to long below - the result reutns to its expected value > commons.math3.stat.correlation.KendallsCorrelation.correlation > here's the sample code I used: > I added the cast to long of swaps in the > int swaps = 1077126315; >final long numPairs = sum(50350 - 1); > long tiedXPairs = 0; > long tiedXYPairs = 0; > long tiedYPairs = 0; > > final long concordantMinusDiscordant = numPairs - tiedXPairs > - tiedYPairs + tiedXYPairs - 2 * (long) swaps; > final double nonTiedPairsMultiplied = 1.6e18; > double myTest = concordantMinusDiscordant / > FastMath.sqrt(nonTiedPairsMultiplied); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MATH-1276) Override GeometricDistribution.inverseCumulativeProbability() to speed up sampling
[ https://issues.apache.org/jira/browse/MATH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl resolved MATH-1276. -- Resolution: Fixed Fix Version/s: 3.6 4.0 > Override GeometricDistribution.inverseCumulativeProbability() to speed up > sampling > -- > > Key: MATH-1276 > URL: https://issues.apache.org/jira/browse/MATH-1276 > Project: Commons Math > Issue Type: Improvement >Reporter: Otmar Ertl > Fix For: 4.0, 3.6 > > Attachments: MATH-1276.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1276) Override GeometricDistribution.inverseCumulativeProbability() to speed up sampling
[ https://issues.apache.org/jira/browse/MATH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900035#comment-14900035 ] Otmar Ertl commented on MATH-1276: -- Gilles, thank you for your suggestions. I have split the patch into multiple commits: 2fd6c8fa1e392f985ee860d8ee154a612ddde569 (4.0) df1db29ab401e5fd867f142f949e8feda12604d4 (3.6) > Override GeometricDistribution.inverseCumulativeProbability() to speed up > sampling > -- > > Key: MATH-1276 > URL: https://issues.apache.org/jira/browse/MATH-1276 > Project: Commons Math > Issue Type: Improvement >Reporter: Otmar Ertl > Fix For: 4.0, 3.6 > > Attachments: MATH-1276.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1276) Override GeometricDistribution.inverseCumulativeProbability() to speed up sampling
[ https://issues.apache.org/jira/browse/MATH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1276: - Flags: Patch > Override GeometricDistribution.inverseCumulativeProbability() to speed up > sampling > -- > > Key: MATH-1276 > URL: https://issues.apache.org/jira/browse/MATH-1276 > Project: Commons Math > Issue Type: Improvement >Reporter: Otmar Ertl > Attachments: MATH-1276.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MATH-1276) Override GeometricDistribution.inverseCumulativeProbability() to speed up sampling
Otmar Ertl created MATH-1276: Summary: Override GeometricDistribution.inverseCumulativeProbability() to speed up sampling Key: MATH-1276 URL: https://issues.apache.org/jira/browse/MATH-1276 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1276) Override GeometricDistribution.inverseCumulativeProbability() to speed up sampling
[ https://issues.apache.org/jira/browse/MATH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1276: - Attachment: MATH-1276.patch Any objections to the proposed patch? > Override GeometricDistribution.inverseCumulativeProbability() to speed up > sampling > -- > > Key: MATH-1276 > URL: https://issues.apache.org/jira/browse/MATH-1276 > Project: Commons Math > Issue Type: Improvement >Reporter: Otmar Ertl > Attachments: MATH-1276.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties
[ https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790993#comment-14790993 ] Otmar Ertl commented on MATH-1246: -- Sounds ok to me, due to its slow convergence avoiding Monte Carlo as much as possible is generally a good idea. > Kolmogorov-Smirnov 2-sample test does not correctly handle ties > --- > > Key: MATH-1246 > URL: https://issues.apache.org/jira/browse/MATH-1246 > Project: Commons Math > Issue Type: Bug >Reporter: Phil Steitz > > For small samples, KolmogorovSmirnovTest(double[], double[]) computes the > distribution of a D-statistic for m-n sets with no ties. No warning or > special handling is delivered in the presence of ties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MATH-1274) Represent Kolmogorov-Smirnov statistic as long value
[ https://issues.apache.org/jira/browse/MATH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl resolved MATH-1274. -- Resolution: Fixed > Represent Kolmogorov-Smirnov statistic as long value > > > Key: MATH-1274 > URL: https://issues.apache.org/jira/browse/MATH-1274 > Project: Commons Math > Issue Type: Improvement >Reporter: Otmar Ertl > Fix For: 4.0, 3.6 > > Attachments: MATH-1274.patch > > > In KolmogorovSmirnovTest.java the 2-sample Kolmogorov-Smirnov statistic is > internally represented as double value. However, the KS-statistic can be > accurately represented as long value if multiplied by both sample sizes. This > approach saves some floating point operations and avoids difficulties when > comparing two KS-statistics. Currently, Precision.compareTo with some > tolerance must be used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1274) Represent Kolmogorov-Smirnov statistic as long value
[ https://issues.apache.org/jira/browse/MATH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790896#comment-14790896 ] Otmar Ertl commented on MATH-1274: -- Applied the patch after renaming the methods to integralXxx as proposed by Phil, thanks! fb7e1e265dd9e560b3a3127a6593b6602f60026c (4.0) 1c9c43c1d4bb76d7e47cdfc9c6681a38305a95e4 (3.6) > Represent Kolmogorov-Smirnov statistic as long value > > > Key: MATH-1274 > URL: https://issues.apache.org/jira/browse/MATH-1274 > Project: Commons Math > Issue Type: Improvement >Reporter: Otmar Ertl > Fix For: 4.0, 3.6 > > Attachments: MATH-1274.patch > > > In KolmogorovSmirnovTest.java the 2-sample Kolmogorov-Smirnov statistic is > internally represented as double value. However, the KS-statistic can be > accurately represented as long value if multiplied by both sample sizes. This > approach saves some floating point operations and avoids difficulties when > comparing two KS-statistics. Currently, Precision.compareTo with some > tolerance must be used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1274) Represent Kolmogorov-Smirnov statistic as long value
[ https://issues.apache.org/jira/browse/MATH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1274: - Fix Version/s: 3.6 4.0 > Represent Kolmogorov-Smirnov statistic as long value > > > Key: MATH-1274 > URL: https://issues.apache.org/jira/browse/MATH-1274 > Project: Commons Math > Issue Type: Improvement >Reporter: Otmar Ertl > Fix For: 4.0, 3.6 > > Attachments: MATH-1274.patch > > > In KolmogorovSmirnovTest.java the 2-sample Kolmogorov-Smirnov statistic is > internally represented as double value. However, the KS-statistic can be > accurately represented as long value if multiplied by both sample sizes. This > approach saves some floating point operations and avoids difficulties when > comparing two KS-statistics. Currently, Precision.compareTo with some > tolerance must be used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties
[ https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746924#comment-14746924 ] Otmar Ertl commented on MATH-1246: -- The p-value is the probability that the observed KS-statistic is smaller than the KS-statistic that I get if two random samples of same sizes are drawn from the underlying distribution. In the no-ties case this value can be calculated exactly without knowing the underlying distribution. In case of ties, the p-value cannot be calculated exactly. There are different approaches how to calculate some approximation of the p-value for the tie-case: * Approximation of the underlying distribution by the observed data, which definitely makes sense for bootstrapping where the sample sizes are usually large. However, in our case the underlying distribution is estimated from small sample sizes, since this is the domain for the exactP method. Therefore, I doubt that the calculate p-value deserves the label "exact" in this case. * Assumption that orderings of observed equal values are equally likely, which of course is also an approximation. I still do not understand why the first approach should be the true one. > Kolmogorov-Smirnov 2-sample test does not correctly handle ties > --- > > Key: MATH-1246 > URL: https://issues.apache.org/jira/browse/MATH-1246 > Project: Commons Math > Issue Type: Bug >Reporter: Phil Steitz > > For small samples, KolmogorovSmirnovTest(double[], double[]) computes the > distribution of a D-statistic for m-n sets with no ties. No warning or > special handling is delivered in the presence of ties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1274) Represent Kolmogorov-Smirnov statistic as long value
[ https://issues.apache.org/jira/browse/MATH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1274: - Attachment: MATH-1274.patch > Represent Kolmogorov-Smirnov statistic as long value > > > Key: MATH-1274 > URL: https://issues.apache.org/jira/browse/MATH-1274 > Project: Commons Math > Issue Type: Improvement >Reporter: Otmar Ertl > Attachments: MATH-1274.patch > > > In KolmogorovSmirnovTest.java the 2-sample Kolmogorov-Smirnov statistic is > internally represented as double value. However, the KS-statistic can be > accurately represented as long value if multiplied by both sample sizes. This > approach saves some floating point operations and avoids difficulties when > comparing two KS-statistics. Currently, Precision.compareTo with some > tolerance must be used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1274) Represent Kolmogorov-Smirnov statistic as long value
[ https://issues.apache.org/jira/browse/MATH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1274: - Flags: Patch > Represent Kolmogorov-Smirnov statistic as long value > > > Key: MATH-1274 > URL: https://issues.apache.org/jira/browse/MATH-1274 > Project: Commons Math > Issue Type: Improvement >Reporter: Otmar Ertl > Attachments: MATH-1274.patch > > > In KolmogorovSmirnovTest.java the 2-sample Kolmogorov-Smirnov statistic is > internally represented as double value. However, the KS-statistic can be > accurately represented as long value if multiplied by both sample sizes. This > approach saves some floating point operations and avoids difficulties when > comparing two KS-statistics. Currently, Precision.compareTo with some > tolerance must be used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MATH-1274) Represent Kolmogorov-Smirnov statistic as long value
Otmar Ertl created MATH-1274: Summary: Represent Kolmogorov-Smirnov statistic as long value Key: MATH-1274 URL: https://issues.apache.org/jira/browse/MATH-1274 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl In KolmogorovSmirnovTest.java the 2-sample Kolmogorov-Smirnov statistic is internally represented as double value. However, the KS-statistic can be accurately represented as long value if multiplied by both sample sizes. This approach saves some floating point operations and avoids difficulties when comparing two KS-statistics. Currently, Precision.compareTo with some tolerance must be used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties
[ https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745495#comment-14745495 ] Otmar Ertl commented on MATH-1246: -- After some research I have the feeling we are discussing how to define zero divided by zero. There are at least two methods to calculate a reasonable p-value in the presence of ties: # The method you have proposed which seems to be also known as permutation method. Averaging only over some permutations and averaging over all possible permutations correspond to the bootstrap method and the current exactP() implementation, respectively. # Another method is to add some jitter to the sampled values to break ties. (This google search https://www.google.com/?gfe_rd=cr=qCL4VaKvNIWI8QfLibD4Bg_rd=cr=1#q=jitter+kolmogorov+smirnov immediately gives you a couple of references.) This method corresponds to the method I have proposed. Adding small random values to ties to get a strict ordering corresponds to choosing any random ordering. Averaging over all possible orderings would also lead to a well-defined p-value. Maybe, the user should be able to choose the method how to resolve ties? > Kolmogorov-Smirnov 2-sample test does not correctly handle ties > --- > > Key: MATH-1246 > URL: https://issues.apache.org/jira/browse/MATH-1246 > Project: Commons Math > Issue Type: Bug >Reporter: Phil Steitz > > For small samples, KolmogorovSmirnovTest(double[], double[]) computes the > distribution of a D-statistic for m-n sets with no ties. No warning or > special handling is delivered in the presence of ties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties
[ https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745495#comment-14745495 ] Otmar Ertl edited comment on MATH-1246 at 9/15/15 2:00 PM: --- After some research I have the feeling we are discussing how to define zero divided by zero. There are at least two methods to calculate a reasonable p-value in the presence of ties: # The method you have proposed which seems to be also known as permutation method. Averaging only over some permutations and averaging over all possible permutations correspond to the bootstrap method and the current exactP() implementation, respectively. # Another method is to add some jitter to the sampled values to break ties. (This google search https://www.google.com/?gfe_rd=cr=qCL4VaKvNIWI8QfLibD4Bg_rd=cr=1#q=jitter+kolmogorov+smirnov immediately gives you a couple of references.) This method corresponds to the method I have proposed. Adding small random values to ties to get a strict ordering corresponds to choosing any random ordering. Averaging over all possible orderings would also lead to a well-defined p-value. Maybe, the user should be able to choose the method how to resolve ties? was (Author: otmar ertl): After some research I have the feeling we are discussing how to define zero divided by zero. There are at least two methods to calculate a reasonable p-value in the presence of ties: # The method you have proposed which seems to be also known as permutation method. Averaging only over some permutations and averaging over all possible permutations correspond to the bootstrap method and the current exactP() implementation, respectively. # Another method is to add some jitter to the sampled values to break ties. (This google search https://www.google.com/?gfe_rd=cr=qCL4VaKvNIWI8QfLibD4Bg_rd=cr=1#q=jitter+kolmogorov+smirnov immediately gives you a couple of references.) This method corresponds to the method I have proposed. Adding small random values to ties to get a strict ordering corresponds to choosing any random ordering. Averaging over all possible orderings would also lead to a well-defined p-value. Maybe, the user should be able to choose the method how to resolve ties? > Kolmogorov-Smirnov 2-sample test does not correctly handle ties > --- > > Key: MATH-1246 > URL: https://issues.apache.org/jira/browse/MATH-1246 > Project: Commons Math > Issue Type: Bug >Reporter: Phil Steitz > > For small samples, KolmogorovSmirnovTest(double[], double[]) computes the > distribution of a D-statistic for m-n sets with no ties. No warning or > special handling is delivered in the presence of ties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties
[ https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744096#comment-14744096 ] Otmar Ertl commented on MATH-1246: -- I still have doubts: # There is a difference, if there are ties within one sample, or if the same value exists at least once in both samples. In the first case the D-statistic is well defined. In the latter case the D-statistic is undefined. For example, if x = (1, 3, 3, 5) and y = (2, 4, 4, 6) D = 0.5. On the other hand, if x = (1, 3, 3, 5) and y = (2, 3, 3, 6) the D-statistic could be any value between 0.25 and 0.75. The current implementation returns the minimum (0.25 in this case), but this seems to be a quite arbitrary choice. Furthermore, the implementation does not distinguish between these two cases (see hasTies() method). # If the current implementation of exactP() follows the definition you described, I do not really understand why the following two statements return different values: {code} new KolmogorovSmirnovTest().exactP(new double[]{0.9, 1.0, 1.1}, new double[]{0.0, 0.0}, false) new KolmogorovSmirnovTest().exactP(new double[]{1.0, 1.0, 1.0}, new double[]{0.0, 0.0}, false) {code} The D-statistic is well-defined and equal to 1 in both cases. # \[1\] describes estimating the p-Value using bootstrapping. I am not sure, if an exact definition can be derived from there, since bootstrapping in general is not a consistent estimation method. > Kolmogorov-Smirnov 2-sample test does not correctly handle ties > --- > > Key: MATH-1246 > URL: https://issues.apache.org/jira/browse/MATH-1246 > Project: Commons Math > Issue Type: Bug >Reporter: Phil Steitz > > For small samples, KolmogorovSmirnovTest(double[], double[]) computes the > distribution of a D-statistic for m-n sets with no ties. No warning or > special handling is delivered in the presence of ties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1269) FastMath.exp may return NaN for non-NaN arguments
[ https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739384#comment-14739384 ] Otmar Ertl commented on MATH-1269: -- +1 for 2nd patch, I have run the test over the range (MAX_LONG - 1, MAX_LONG + 1), I think that should be enough to assume that there is a clean transition from finite to infinite. > FastMath.exp may return NaN for non-NaN arguments > - > > Key: MATH-1269 > URL: https://issues.apache.org/jira/browse/MATH-1269 > Project: Commons Math > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Otmar Ertl > Attachments: MATH-1269.patch, MATH-1269.patch > > > I have observed that FastMath.exp(709.8125) returns NaN. However, the > exponential function must never return NaN (if the argument is not NaN). The > result must always be non-negative or positive infinity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties
[ https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739536#comment-14739536 ] Otmar Ertl commented on MATH-1246: -- The Monte Carlo approach can be modified by simultaneously sampling D. Here is an outline how this sampling could be achieved: # First determine set of points P = (p_i) for which equal values exist in both samples. # Determine maximum difference of CDFs over all values not included in P # Determine for each point p_i if it is possible at all to get a CDF difference that is larger than the calculated maximum. If not, those points can be excluded from P. Otherwise, remember the difference of the CDF d_i just before that point and the number of equal values in both samples n_i and m_i, respectively. # Within each Monte Carlo iteration, generate for each point p_i a random ordering of the n_i and m_i equal values (using a function similar to fillBooleanArrayRandomlyWithFixedNumberTrueValues). Determine the maximum differences of the CDFs at all points p_i using the random ordering and d_i, and take the maximum of them and the maximum calculated in 2) which gives us the sampled (observed) D-statistic that is finally compared to curD. Anyway, we should find the right definition first before implementing anything. > Kolmogorov-Smirnov 2-sample test does not correctly handle ties > --- > > Key: MATH-1246 > URL: https://issues.apache.org/jira/browse/MATH-1246 > Project: Commons Math > Issue Type: Bug >Reporter: Phil Steitz > > For small samples, KolmogorovSmirnovTest(double[], double[]) computes the > distribution of a D-statistic for m-n sets with no ties. No warning or > special handling is delivered in the presence of ties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1269) FastMath.exp may return NaN for non-NaN arguments
[ https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737181#comment-14737181 ] Otmar Ertl commented on MATH-1269: -- I have an objection. The patch changes the result of FastMath.exp(709.0001) from 8.219229343394329E307 to Double.POSITIVE_INFINITY. Although this kind of error is less harmful than that caused by the bug, I would prefer a more rigorous solution. > FastMath.exp may return NaN for non-NaN arguments > - > > Key: MATH-1269 > URL: https://issues.apache.org/jira/browse/MATH-1269 > Project: Commons Math > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Otmar Ertl > Attachments: MATH-1269.patch > > > I have observed that FastMath.exp(709.8125) returns NaN. However, the > exponential function must never return NaN (if the argument is not NaN). The > result must always be non-negative or positive infinity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties
[ https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737500#comment-14737500 ] Otmar Ertl commented on MATH-1246: -- I am thinking of another way to treat ties: The probability that two values sampled from a continuous distribution are equal is equal to 0. One of them is always greater than the other. However, represented as doubles we cannot distinguish them. Therefore, the best what we can do is to treat both cases equally likely. For example, if we have x = (0, 3, 5) and y = (5, 6, 7) we get two different values for the observed D-statistic. If we assume value 5 in x to be smaller than that in y, we would get D=3. Otherwise, we would get D=2, both with probability 0.5. In the general case, we can determine a discrete distribution describing all possible values of the observed D-statistics. Finally, we calculate the p-value for each of those possible values and calculate the weighted average which we take as the final p-value. Does this make sense? If yes, I think there is a way to adapt the new Monte Carlo approach. > Kolmogorov-Smirnov 2-sample test does not correctly handle ties > --- > > Key: MATH-1246 > URL: https://issues.apache.org/jira/browse/MATH-1246 > Project: Commons Math > Issue Type: Bug >Reporter: Phil Steitz > > For small samples, KolmogorovSmirnovTest(double[], double[]) computes the > distribution of a D-statistic for m-n sets with no ties. No warning or > special handling is delivered in the presence of ties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1269) FastMath.exp may return NaN for non-NaN arguments
[ https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737436#comment-14737436 ] Otmar Ertl commented on MATH-1269: -- In my opinion simply changing NaNs to infinity can only be a short-term solution. Such a transformation at the end of this function is like admitting that we do not really know how our code works. This is not acceptable for such an essential function as the exponential function. > FastMath.exp may return NaN for non-NaN arguments > - > > Key: MATH-1269 > URL: https://issues.apache.org/jira/browse/MATH-1269 > Project: Commons Math > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Otmar Ertl > Attachments: MATH-1269.patch > > > I have observed that FastMath.exp(709.8125) returns NaN. However, the > exponential function must never return NaN (if the argument is not NaN). The > result must always be non-negative or positive infinity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties
[ https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737500#comment-14737500 ] Otmar Ertl edited comment on MATH-1246 at 9/9/15 9:19 PM: -- I am thinking of another way to treat ties: The probability that two values sampled from a continuous distribution are equal is equal to 0. One of them is always greater than the other. However, represented as doubles we cannot distinguish them. Therefore, the best what we can do is to treat both cases equally likely. For example, if we have x = (0, 3, 5) and y = (5, 6, 7) we get two different values for the observed D-statistic. If we assume value 5 in x to be smaller than that in y, we would get D=1. Otherwise, we would get D=2/3, both with probability 0.5. In the general case, we can determine a discrete distribution describing all possible values of the observed D-statistics. Finally, we calculate the p-value for each of those possible values and calculate the weighted average which we take as the final p-value. Does this make sense? If yes, I think there is a way to adapt the new Monte Carlo approach. was (Author: otmar ertl): I am thinking of another way to treat ties: The probability that two values sampled from a continuous distribution are equal is equal to 0. One of them is always greater than the other. However, represented as doubles we cannot distinguish them. Therefore, the best what we can do is to treat both cases equally likely. For example, if we have x = (0, 3, 5) and y = (5, 6, 7) we get two different values for the observed D-statistic. If we assume value 5 in x to be smaller than that in y, we would get D=3. Otherwise, we would get D=2, both with probability 0.5. In the general case, we can determine a discrete distribution describing all possible values of the observed D-statistics. Finally, we calculate the p-value for each of those possible values and calculate the weighted average which we take as the final p-value. Does this make sense? If yes, I think there is a way to adapt the new Monte Carlo approach. > Kolmogorov-Smirnov 2-sample test does not correctly handle ties > --- > > Key: MATH-1246 > URL: https://issues.apache.org/jira/browse/MATH-1246 > Project: Commons Math > Issue Type: Bug >Reporter: Phil Steitz > > For small samples, KolmogorovSmirnovTest(double[], double[]) computes the > distribution of a D-statistic for m-n sets with no ties. No warning or > special handling is delivered in the presence of ties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1262) Heap overflow in org.apache.commons.math4.special.BesselJ
[ https://issues.apache.org/jira/browse/MATH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734949#comment-14734949 ] Otmar Ertl commented on MATH-1262: -- +1 for updating Javadoc and adding a note that memory needs are proportional to the order. > Heap overflow in org.apache.commons.math4.special.BesselJ > - > > Key: MATH-1262 > URL: https://issues.apache.org/jira/browse/MATH-1262 > Project: Commons Math > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Neil Walkinshaw >Priority: Minor > Labels: newbie, performance > > This test case: > {code:java} > import org.apache.commons.math4.special.BesselJ; > public class BesselJHeapBug { > public static void main(String[] args) { > BesselJ.value(1182054491, 3.589635306407856E-8D); > } > } > {code} > throws the following exception: > {code} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at org.apache.commons.math4.special.BesselJ.rjBesl(BesselJ.java:247) > at org.apache.commons.math4.special.BesselJ.value(BesselJ.java:161) > at BesselJHeapBug.main(BesselJHeapBug.java:5) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MATH-1262) Heap overflow in org.apache.commons.math4.special.BesselJ
[ https://issues.apache.org/jira/browse/MATH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14721676#comment-14721676 ] Otmar Ertl edited comment on MATH-1262 at 9/6/15 8:46 AM: -- -This [paper|http://www.cl.cam.ac.uk/~jrh13/papers/bessel.pdf] could be interesting.- EDIT: The paper states that the approach could be extended to moderate orders. However, I doubt that the method could handle large orders such as 1182054491 as given in the test case. was (Author: otmar ertl): This [paper|http://www.cl.cam.ac.uk/~jrh13/papers/bessel.pdf] could be interesting. > Heap overflow in org.apache.commons.math4.special.BesselJ > - > > Key: MATH-1262 > URL: https://issues.apache.org/jira/browse/MATH-1262 > Project: Commons Math > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Neil Walkinshaw >Priority: Minor > Labels: newbie, performance > > This test case: > {code:java} > import org.apache.commons.math4.special.BesselJ; > public class BesselJHeapBug { > public static void main(String[] args) { > BesselJ.value(1182054491, 3.589635306407856E-8D); > } > } > {code} > throws the following exception: > {code} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at org.apache.commons.math4.special.BesselJ.rjBesl(BesselJ.java:247) > at org.apache.commons.math4.special.BesselJ.value(BesselJ.java:161) > at BesselJHeapBug.main(BesselJHeapBug.java:5) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1262) Heap overflow in org.apache.commons.math4.special.BesselJ
[ https://issues.apache.org/jira/browse/MATH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732386#comment-14732386 ] Otmar Ertl commented on MATH-1262: -- The paper [C.Schwartz, Numerical Calculation of Bessel Functions, 2012|http://arxiv.org/pdf/1209.1547] demonstrates an interesting approach to calculate Bessel functions. Unfortunately, it does not contain an algorithm that can be implemented straightforwardly. > Heap overflow in org.apache.commons.math4.special.BesselJ > - > > Key: MATH-1262 > URL: https://issues.apache.org/jira/browse/MATH-1262 > Project: Commons Math > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Neil Walkinshaw >Priority: Minor > Labels: newbie, performance > > This test case: > {code:java} > import org.apache.commons.math4.special.BesselJ; > public class BesselJHeapBug { > public static void main(String[] args) { > BesselJ.value(1182054491, 3.589635306407856E-8D); > } > } > {code} > throws the following exception: > {code} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at org.apache.commons.math4.special.BesselJ.rjBesl(BesselJ.java:247) > at org.apache.commons.math4.special.BesselJ.value(BesselJ.java:161) > at BesselJHeapBug.main(BesselJHeapBug.java:5) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MATH-1269) FastMath.exp may return NaN
Otmar Ertl created MATH-1269: Summary: FastMath.exp may return NaN Key: MATH-1269 URL: https://issues.apache.org/jira/browse/MATH-1269 Project: Commons Math Issue Type: Bug Affects Versions: 4.0 Reporter: Otmar Ertl I have observed that FastMath.exp(709.8125) returns NaN. However, the exponential function must never return NaN. The result must always be non-negative or positive infinity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1269) FastMath.exp may return NaN for non-NaN arguments
[ https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1269: - Summary: FastMath.exp may return NaN for non-NaN arguments (was: FastMath.exp may return NaN) > FastMath.exp may return NaN for non-NaN arguments > - > > Key: MATH-1269 > URL: https://issues.apache.org/jira/browse/MATH-1269 > Project: Commons Math > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Otmar Ertl > > I have observed that FastMath.exp(709.8125) returns NaN. However, the > exponential function must never return NaN (if the argument is not NaN). The > result must always be non-negative or positive infinity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1269) FastMath.exp may return NaN
[ https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1269: - Description: I have observed that FastMath.exp(709.8125) returns NaN. However, the exponential function must never return NaN (if the argument is not NaN). The result must always be non-negative or positive infinity. (was: I have observed that FastMath.exp(709.8125) returns NaN. However, the exponential function must never return NaN. The result must always be non-negative or positive infinity.) > FastMath.exp may return NaN > --- > > Key: MATH-1269 > URL: https://issues.apache.org/jira/browse/MATH-1269 > Project: Commons Math > Issue Type: Bug >Affects Versions: 4.0 >Reporter: Otmar Ertl > > I have observed that FastMath.exp(709.8125) returns NaN. However, the > exponential function must never return NaN (if the argument is not NaN). The > result must always be non-negative or positive infinity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1262) Heap overflow in org.apache.commons.math4.special.BesselJ
[ https://issues.apache.org/jira/browse/MATH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721676#comment-14721676 ] Otmar Ertl commented on MATH-1262: -- This [paper|http://www.cl.cam.ac.uk/~jrh13/papers/bessel.pdf] could be interesting. Heap overflow in org.apache.commons.math4.special.BesselJ - Key: MATH-1262 URL: https://issues.apache.org/jira/browse/MATH-1262 Project: Commons Math Issue Type: Bug Affects Versions: 4.0 Reporter: Neil Walkinshaw Priority: Minor Labels: newbie, performance This test case: {code:java} import org.apache.commons.math4.special.BesselJ; public class BesselJHeapBug { public static void main(String[] args) { BesselJ.value(1182054491, 3.589635306407856E-8D); } } {code} throws the following exception: {code} Exception in thread main java.lang.OutOfMemoryError: Java heap space at org.apache.commons.math4.special.BesselJ.rjBesl(BesselJ.java:247) at org.apache.commons.math4.special.BesselJ.value(BesselJ.java:161) at BesselJHeapBug.main(BesselJHeapBug.java:5) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1262) Heap overflow in org.apache.commons.math4.special.BesselJ
[ https://issues.apache.org/jira/browse/MATH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711484#comment-14711484 ] Otmar Ertl commented on MATH-1262: -- The current implementation allocates a double array with size equal to order+1. Hence, the OutOfMemoryError is not very surprising in this case with order equal to 1182054491. To fix this, we would need to implement a more space efficient algorithm like the one used in boost, see https://github.com/mirror/boost/blob/master/boost/math/special_functions/detail/bessel_jn.hpp. Heap overflow in org.apache.commons.math4.special.BesselJ - Key: MATH-1262 URL: https://issues.apache.org/jira/browse/MATH-1262 Project: Commons Math Issue Type: Bug Affects Versions: 4.0 Reporter: Neil Walkinshaw Priority: Minor Labels: newbie, performance This test case: {code:java} import org.apache.commons.math4.special.BesselJ; public class BesselJHeapBug { public static void main(String[] args) { BesselJ.value(1182054491, 3.589635306407856E-8D); } } {code} throws the following exception: {code} Exception in thread main java.lang.OutOfMemoryError: Java heap space at org.apache.commons.math4.special.BesselJ.rjBesl(BesselJ.java:247) at org.apache.commons.math4.special.BesselJ.value(BesselJ.java:161) at BesselJHeapBug.main(BesselJHeapBug.java:5) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl resolved MATH-1220. -- Resolution: Fixed More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Assignee: Otmar Ertl Fix For: 4.0, 3.6 Attachments: Zipf Rejection Inversion Sampling Notes.pdf, patch_v1, patch_v2, patch_v3 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14709928#comment-14709928 ] Otmar Ertl commented on MATH-1220: -- Introduced rejection-inversion sampling approach with commits * 484196fc2e11b632964895b6662dc783d4538b35 (3.6) * 9c51e5316babbd370bc32aed0fee134216726ec9 (4.0) More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Assignee: Otmar Ertl Fix For: 4.0, 3.6 Attachments: Zipf Rejection Inversion Sampling Notes.pdf, patch_v1, patch_v2, patch_v3 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-990) Improve performance of MathArrays.sortInPlace
[ https://issues.apache.org/jira/browse/MATH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707967#comment-14707967 ] Otmar Ertl commented on MATH-990: - Applied patch in following commits: * 72a46babeb0da02071e47425ebd55da915b98178 (4.0) * 32c5f86125ea0ce1741a386eff0da5f8455f879c (3.6) Improve performance of MathArrays.sortInPlace - Key: MATH-990 URL: https://issues.apache.org/jira/browse/MATH-990 Project: Commons Math Issue Type: Improvement Affects Versions: 3.2 Reporter: Gilles Assignee: Otmar Ertl Priority: Minor Labels: benchmark Fix For: 4.0 Attachments: patch Performance suffers from lots of copying. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-990) Improve performance of MathArrays.sortInPlace
[ https://issues.apache.org/jira/browse/MATH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-990: Fix Version/s: 3.6 Improve performance of MathArrays.sortInPlace - Key: MATH-990 URL: https://issues.apache.org/jira/browse/MATH-990 Project: Commons Math Issue Type: Improvement Affects Versions: 3.2 Reporter: Gilles Assignee: Otmar Ertl Priority: Minor Labels: benchmark Fix For: 4.0, 3.6 Attachments: patch Performance suffers from lots of copying. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MATH-990) Improve performance of MathArrays.sortInPlace
[ https://issues.apache.org/jira/browse/MATH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl resolved MATH-990. - Resolution: Fixed Improve performance of MathArrays.sortInPlace - Key: MATH-990 URL: https://issues.apache.org/jira/browse/MATH-990 Project: Commons Math Issue Type: Improvement Affects Versions: 3.2 Reporter: Gilles Assignee: Otmar Ertl Priority: Minor Labels: benchmark Fix For: 4.0, 3.6 Attachments: patch Performance suffers from lots of copying. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-990) Improve performance of MathArrays.sortInPlace
[ https://issues.apache.org/jira/browse/MATH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705515#comment-14705515 ] Otmar Ertl commented on MATH-990: - I think I do not have the rights to reassign this issue. Improve performance of MathArrays.sortInPlace - Key: MATH-990 URL: https://issues.apache.org/jira/browse/MATH-990 Project: Commons Math Issue Type: Improvement Affects Versions: 3.2 Reporter: Gilles Assignee: Gilles Priority: Minor Labels: benchmark Fix For: 4.0 Attachments: patch Performance suffers from lots of copying. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1258) compute() method in classes in org.apache.commons.math4.ml.distance package
[ https://issues.apache.org/jira/browse/MATH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705509#comment-14705509 ] Otmar Ertl commented on MATH-1258: -- I have just added the missing entries to changes.xml. I think we can close this issue now. It seems that I do not have the required rights to do that. compute() method in classes in org.apache.commons.math4.ml.distance package --- Key: MATH-1258 URL: https://issues.apache.org/jira/browse/MATH-1258 Project: Commons Math Issue Type: Bug Reporter: Gunel Jahangirova Priority: Minor Attachments: patch.diff, pre-MATH-1258.patch Hi! There are five classes CanberraDistance, ChebyshevDistance, EarthMoversDistance, EuclideanDistance and ManhattanDistance in org.apache.commons.math4.ml.distance package, which compute different types of distances. Each of them contains method compute(double[] a, double[] b) that accepts two double arrays as variables. However, if the lengths of array a is greater than the length of array b, the method compute() in all the five classes produces java.lang.ArrayIndexOutOfBoundsException. For example, private void test0() { CanberraDistance distance = new CanberraDistance(); final double[] a = { 1, 2, 3, 4, 9, 4 }; final double[] b = { -5, -6, 7, 4, 3 }; distance.compute(a, b); } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MATH-1258) compute() method in classes in org.apache.commons.math4.ml.distance package
[ https://issues.apache.org/jira/browse/MATH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl resolved MATH-1258. -- Resolution: Fixed Fix Version/s: 3.6 4.0 compute() method in classes in org.apache.commons.math4.ml.distance package --- Key: MATH-1258 URL: https://issues.apache.org/jira/browse/MATH-1258 Project: Commons Math Issue Type: Bug Reporter: Gunel Jahangirova Priority: Minor Fix For: 4.0, 3.6 Attachments: patch.diff, pre-MATH-1258.patch Hi! There are five classes CanberraDistance, ChebyshevDistance, EarthMoversDistance, EuclideanDistance and ManhattanDistance in org.apache.commons.math4.ml.distance package, which compute different types of distances. Each of them contains method compute(double[] a, double[] b) that accepts two double arrays as variables. However, if the lengths of array a is greater than the length of array b, the method compute() in all the five classes produces java.lang.ArrayIndexOutOfBoundsException. For example, private void test0() { CanberraDistance distance = new CanberraDistance(); final double[] a = { 1, 2, 3, 4, 9, 4 }; final double[] b = { -5, -6, 7, 4, 3 }; distance.compute(a, b); } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1258) compute() method in classes in org.apache.commons.math4.ml.distance package
[ https://issues.apache.org/jira/browse/MATH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705674#comment-14705674 ] Otmar Ertl commented on MATH-1258: -- Thanks! It worked now. compute() method in classes in org.apache.commons.math4.ml.distance package --- Key: MATH-1258 URL: https://issues.apache.org/jira/browse/MATH-1258 Project: Commons Math Issue Type: Bug Reporter: Gunel Jahangirova Priority: Minor Fix For: 4.0, 3.6 Attachments: patch.diff, pre-MATH-1258.patch Hi! There are five classes CanberraDistance, ChebyshevDistance, EarthMoversDistance, EuclideanDistance and ManhattanDistance in org.apache.commons.math4.ml.distance package, which compute different types of distances. Each of them contains method compute(double[] a, double[] b) that accepts two double arrays as variables. However, if the lengths of array a is greater than the length of array b, the method compute() in all the five classes produces java.lang.ArrayIndexOutOfBoundsException. For example, private void test0() { CanberraDistance distance = new CanberraDistance(); final double[] a = { 1, 2, 3, 4, 9, 4 }; final double[] b = { -5, -6, 7, 4, 3 }; distance.compute(a, b); } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1258) compute() method in classes in org.apache.commons.math4.ml.distance package
[ https://issues.apache.org/jira/browse/MATH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705180#comment-14705180 ] Otmar Ertl commented on MATH-1258: -- Committed my patch: 7934bfea106206d2840ba062eef105001601588a (3.6) 5ca0a1c3564d35293a5ecf03e5f32e6f0f6f445c (4.0) compute() method in classes in org.apache.commons.math4.ml.distance package --- Key: MATH-1258 URL: https://issues.apache.org/jira/browse/MATH-1258 Project: Commons Math Issue Type: Bug Reporter: Gunel Jahangirova Priority: Minor Attachments: patch.diff, pre-MATH-1258.patch Hi! There are five classes CanberraDistance, ChebyshevDistance, EarthMoversDistance, EuclideanDistance and ManhattanDistance in org.apache.commons.math4.ml.distance package, which compute different types of distances. Each of them contains method compute(double[] a, double[] b) that accepts two double arrays as variables. However, if the lengths of array a is greater than the length of array b, the method compute() in all the five classes produces java.lang.ArrayIndexOutOfBoundsException. For example, private void test0() { CanberraDistance distance = new CanberraDistance(); final double[] a = { 1, 2, 3, 4, 9, 4 }; final double[] b = { -5, -6, 7, 4, 3 }; distance.compute(a, b); } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705207#comment-14705207 ] Otmar Ertl commented on MATH-1220: -- If there are not any concerns, I will commit [^patch_v3]. More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Fix For: 4.0, 3.6 Attachments: Zipf Rejection Inversion Sampling Notes.pdf, patch_v1, patch_v2, patch_v3 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1258) compute() method in classes in org.apache.commons.math4.ml.distance package
[ https://issues.apache.org/jira/browse/MATH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1258: - Attachment: patch.diff I also do not like throwing an ArrayIndexOutOfBoundsException due to 2 reasons: * The current behavior is asymmetric. distance(a,b) succeeds while distance(b, a) fails with an exception, if a and b have different lengths. Since distance functions are known to be symmetric, their behavior should be symmetric too, even in case of invalid parameters. * The element-by-element functions in MathArrays throw a DimensionMismatchException instead of an ArrayIndexOutOfBoundsException. For the sake of consistency, I think throwing a DimensionMismatchException is more appropriate for distance functions as well. Please see attached patch. compute() method in classes in org.apache.commons.math4.ml.distance package --- Key: MATH-1258 URL: https://issues.apache.org/jira/browse/MATH-1258 Project: Commons Math Issue Type: Bug Reporter: Gunel Jahangirova Priority: Minor Attachments: patch.diff Hi! There are five classes CanberraDistance, ChebyshevDistance, EarthMoversDistance, EuclideanDistance and ManhattanDistance in org.apache.commons.math4.ml.distance package, which compute different types of distances. Each of them contains method compute(double[] a, double[] b) that accepts two double arrays as variables. However, if the lengths of array a is greater than the length of array b, the method compute() in all the five classes produces java.lang.ArrayIndexOutOfBoundsException. For example, private void test0() { CanberraDistance distance = new CanberraDistance(); final double[] a = { 1, 2, 3, 4, 9, 4 }; final double[] b = { -5, -6, 7, 4, 3 }; distance.compute(a, b); } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1257) NormalDistribution.cumulativeProbability() suffers from cancellation
[ https://issues.apache.org/jira/browse/MATH-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703489#comment-14703489 ] Otmar Ertl commented on MATH-1257: -- +1 for the patch, it eliminates one source of error. For values much smaller than -1 the regularizedGammaQ function evaluates to a very small positive values. The computation was like 0.5*((regularizedGammaQ - 1) + 1). With the patch the computation is now 0.5*regularizedGammaQ, obviously improving accuracy. NormalDistribution.cumulativeProbability() suffers from cancellation Key: MATH-1257 URL: https://issues.apache.org/jira/browse/MATH-1257 Project: Commons Math Issue Type: Bug Affects Versions: 3.5 Reporter: Bill Murphy Priority: Minor Attachments: MATH-1257.patch I see the following around line 194: {noformat} return 0.5 * (1 + Erf.erf(dev / (standardDeviation * SQRT2))); {noformat} When erf() returns a very small value, this cancels in the addition with the 1.0 which leads to poor precision in the results. I would suggest changing this line to read more like: {noformat} return 0.5 * Erf.erfc( -dev / standardDeviation * SQRT2 ); {noformat} Should you want some test cases for extreme values (one might argue that within 10 standard deviations isn't all that extreme) then you can check the following: http://www.jstatsoft.org/v52/i07/ then look in the v52i07-xls.zip at replication-01-distribution-standard-normal.xls I think you will also find that evaluation of expressions such as {noformat}NormalDistribution( 0, 1 ).cumulativeProbability( -10.0 );{noformat} are pretty far off. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1220: - Attachment: patch_v3 I have revised the patch that makes use of rejection-inversion sampling. I have improved documentation and inlined a proof that shows that the generated random numbers are really Zipf distributed. I think the new approach should be more comprehensible now. I would be grateful for a review. More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Fix For: 4.0, 3.6 Attachments: Zipf Rejection Inversion Sampling Notes.pdf, patch_v1, patch_v2, patch_v3 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1254) Unit test CorrelatedRandomVectorGeneratorTest is flaky
[ https://issues.apache.org/jira/browse/MATH-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1254: - Attachment: seed628.patch Unit test CorrelatedRandomVectorGeneratorTest is flaky -- Key: MATH-1254 URL: https://issues.apache.org/jira/browse/MATH-1254 Project: Commons Math Issue Type: Bug Affects Versions: 4.0 Reporter: Otmar Ertl Priority: Minor Attachments: seed628.patch CorrelatedRandomVectorGeneratorTest.testSampleWithZeroCovariance sometimes fails. The failure can be reproduced by setting the seed of the JDKRandomGenerator in createSampler equal to 628. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MATH-1254) Unit test CorrelatedRandomVectorGeneratorTest is flaky
Otmar Ertl created MATH-1254: Summary: Unit test CorrelatedRandomVectorGeneratorTest is flaky Key: MATH-1254 URL: https://issues.apache.org/jira/browse/MATH-1254 Project: Commons Math Issue Type: Bug Affects Versions: 4.0 Reporter: Otmar Ertl Priority: Minor CorrelatedRandomVectorGeneratorTest.testSampleWithZeroCovariance sometimes fails. The failure can be reproduced by setting the seed of the JDKRandomGenerator in createSampler equal to 628. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1253) Skewness could get more precision from slightly reordered code.
[ https://issues.apache.org/jira/browse/MATH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662994#comment-14662994 ] Otmar Ertl commented on MATH-1253: -- The given example is problematic anyway, because the numbers (1.234E11, 1.234E51, 1.234E101, 1.234E151) are many orders of magnitude apart, for which cancellation effects are a major issue when using floating point types. I can reproduce the NaN result for values that have either large or small exponents, for example (1.234E148, 1.234E149, 1.234E150, 1.234E151) and (1.234E-148, 1.234E-149, 1.234E-150, 1.234E-151). The proposed code does not avoid this over-/underflows either, because the division is performed after the two multiplications which cause the over-/underflow. However, following code would work: {code} final double variance = (accum - (accum2 * accum2 / length)) / (length - 1); final double stdDevInverse = 1d / FastMath.sqrt(variance); double accum3 = 0.0; for (int i = begin; i begin + length; i++) { final double d = (values[i] - m) * stdDevInverse; accum3 += d * d * d; } {code} (Here the inverse is precomputed, in order to avoid additional divisions which are likely to be more expensive than multiplications.) Nevertheless, I am not sure, If we really should fix that. The variance calculation is also prone to over-/underflows for numbers greater than sqrt(Double.MAX_VALUE) and smaller than sqrt(Double.MIN_VALUE), respectively. So the fix would slightly extend the valid input range from (cbrt(Double.MIN_VALUE), cbrt(Double.MAX_VALUE)) to (sqrt(Double.MIN_VALUE), sqrt(Double.MAX_VALUE)) at the expense of an additional multiplication within the loop. Of course, if there was also an accurate variance calculation method, that avoids over-/underflows for values outside of (sqrt(Double.MIN_VALUE), sqrt(Double.MAX_VALUE)), this fix would make much more sense to me. Skewness could get more precision from slightly reordered code. --- Key: MATH-1253 URL: https://issues.apache.org/jira/browse/MATH-1253 Project: Commons Math Issue Type: Bug Affects Versions: 3.5 Reporter: Bill Murphy Priority: Minor In Skewness.java, approx line 180, there is code like: {noformat} double accum3 = 0.0; for (int i = begin; i begin + length; i++) { final double d = values[i] - m; accum3 += d * d * d; } accum3 /= variance * FastMath.sqrt(variance); {noformat} If the division was moved into the for loop, accum3 would be less likely to overflow to Infinity (or -Infinity). This might allow computation to return a result in a case such as: {noformat} double[] numArray = { 1.234E11, 1.234E51, 1.234E101, 1.234E151 }; Skewnessskew = new Skewness(); doublesk = skew.evaluate( numArray ); {noformat} Currently, this returns NaN, but I'd prefer it returned approx 1.154700538379252. The change I'm proposing would have the code instead read like: {noformat} double accum3 = 0.0; double divisor = variance * FastMath.sqrt(variance); for (int i = begin; i begin + length; i++) { final double d = values[i] - m; accum3 += d * d * d / divisor; } {noformat} Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1242) Speed up KolmogorovSmirnovTest.monteCarloP()
[ https://issues.apache.org/jira/browse/MATH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631814#comment-14631814 ] Otmar Ertl commented on MATH-1242: -- Fixed shuffle algorithm and added unit test in the following commits: * 4.0 cf4416a84203ff9d360d4398da3b166e6d0c72b9 * 3.6 49b4bbd41b116c55083300d864e824e22b7127df Speed up KolmogorovSmirnovTest.monteCarloP() Key: MATH-1242 URL: https://issues.apache.org/jira/browse/MATH-1242 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Fix For: 4.0, 3.6 Attachments: bug_fix_and_unit_test.patch, modified.patch, patch, patch The advantages of the new implementation are: * It has linear run-time complexity because sorting of data is avoided * Consumes less than half random numbers (min(n,m) instead of (n+m)) * Allocates less memory The speed-up can be verified using the testTwoSampleMonteCarloPerformance() unit test. Here is the output on my machine of the test using the old monteCarloP() implementation {noformat} n=2, m=5000, time=16.894s n=3, m=, time=11.917s n=4, m=2500, time=8.69s n=5, m=2000, time=7.38s n=6, m=1666, time=6.235s n=7, m=1428, time=5.472s n=8, m=1250, time=4.517s n=9, m=, time=4.246s n=10, m=1000, time=3.797s n=11, m=909, time=3.502s n=12, m=833, time=3.249s n=13, m=769, time=3.0s n=14, m=714, time=2.816s n=15, m=666, time=2.669s n=16, m=625, time=2.44s n=17, m=588, time=2.443s n=18, m=555, time=2.349s n=19, m=526, time=2.219s n=20, m=500, time=2.135s n=21, m=476, time=2.022s n=22, m=454, time=1.952s n=23, m=434, time=1.907s n=24, m=416, time=1.823s n=25, m=400, time=1.793s n=26, m=384, time=1.783s n=27, m=370, time=1.655s n=28, m=357, time=1.602s n=29, m=344, time=1.549s n=30, m=333, time=1.513s n=31, m=322, time=1.492s n=32, m=312, time=1.382s n=33, m=303, time=1.391s n=34, m=294, time=1.383s n=35, m=285, time=1.454s n=36, m=277, time=1.433s n=37, m=270, time=1.402s n=38, m=263, time=1.374s n=39, m=256, time=1.286s n=40, m=250, time=1.334s n=41, m=243, time=1.328s n=42, m=238, time=1.306s n=43, m=232, time=1.316s n=44, m=227, time=1.273s n=45, m=222, time=1.304s n=46, m=217, time=1.265s n=47, m=212, time=1.254s n=48, m=208, time=1.243s n=49, m=204, time=1.236s n=50, m=200, time=1.237s n=51, m=196, time=1.201s n=52, m=192, time=1.227s n=53, m=188, time=1.179s n=54, m=185, time=1.162s n=55, m=181, time=1.163s n=56, m=178, time=1.16s n=57, m=175, time=1.167s n=58, m=172, time=1.168s n=59, m=169, time=1.169s n=60, m=166, time=1.167s n=61, m=163, time=1.171s n=62, m=161, time=1.183s n=63, m=158, time=1.132s n=64, m=156, time=1.127s n=65, m=153, time=1.153s n=66, m=151, time=1.139s n=67, m=149, time=1.107s n=68, m=147, time=1.103s n=69, m=144, time=1.118s n=70, m=142, time=1.124s n=71, m=140, time=1.12s n=72, m=138, time=1.143s n=73, m=136, time=1.146s n=74, m=135, time=1.143s n=75, m=133, time=1.141s n=76, m=131, time=1.138s n=77, m=129, time=1.137s n=78, m=128, time=1.1s n=79, m=126, time=1.111s n=80, m=125, time=1.109s n=81, m=123, time=1.145s n=82, m=121, time=1.135s n=83, m=120, time=1.122s n=84, m=119, time=1.132s n=85, m=117, time=1.137s n=86, m=116, time=1.133s n=87, m=114, time=1.115s n=88, m=113, time=1.123s n=89, m=112, time=1.121s n=90, m=111, time=1.134s n=91, m=109, time=1.145s n=92, m=108, time=1.156s n=93, m=107, time=1.153s n=94, m=106, time=1.137s n=95, m=105, time=1.134s n=96, m=104, time=1.147s n=97, m=103, time=1.151s n=98, m=102, time=1.147s n=99, m=101, time=1.181s n=100, m=100, time=1.189s {noformat} and this is the output using the new implementation {noformat} n=2, m=5000, time=0.602s n=3, m=, time=0.4s n=4, m=2500, time=0.192s n=5, m=2000, time=0.163s n=6, m=1666, time=0.144s n=7, m=1428, time=0.129s n=8, m=1250, time=0.115s n=9, m=, time=0.117s n=10, m=1000, time=0.11s n=11, m=909, time=0.106s n=12, m=833, time=0.11s n=13, m=769, time=0.102s n=14, m=714, time=0.104s n=15, m=666, time=0.103s n=16, m=625, time=0.094s n=17, m=588, time=0.102s n=18, m=555, time=0.106s n=19, m=526, time=0.106s n=20, m=500, time=0.151s n=21, m=476, time=0.118s n=22, m=454, time=0.12s n=23, m=434, time=0.123s n=24, m=416, time=0.126s n=25, m=400, time=0.127s n=26, m=384, time=0.13s n=27, m=370, time=0.132s n=28, m=357, time=0.135s n=29, m=344, time=0.137s n=30, m=333, time=0.14s n=31, m=322, time=0.143s n=32, m=312, time=0.133s n=33, m=303, time=0.149s n=34, m=294, time=0.162s n=35, m=285, time=0.156s n=36, m=277, time=0.157s n=37, m=270, time=0.16s n=38, m=263, time=0.168s n=39, m=256, time=0.148s n=40, m=250, time=0.184s n=41, m=243, time=0.175s n=42, m=238, time=0.175s n=43, m=232, time=0.179s n=44, m=227, time=0.186s n=45, m=222,
[jira] [Resolved] (MATH-1242) Speed up KolmogorovSmirnovTest.monteCarloP()
[ https://issues.apache.org/jira/browse/MATH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl resolved MATH-1242. -- Resolution: Fixed Speed up KolmogorovSmirnovTest.monteCarloP() Key: MATH-1242 URL: https://issues.apache.org/jira/browse/MATH-1242 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Fix For: 4.0, 3.6 Attachments: bug_fix_and_unit_test.patch, modified.patch, patch, patch The advantages of the new implementation are: * It has linear run-time complexity because sorting of data is avoided * Consumes less than half random numbers (min(n,m) instead of (n+m)) * Allocates less memory The speed-up can be verified using the testTwoSampleMonteCarloPerformance() unit test. Here is the output on my machine of the test using the old monteCarloP() implementation {noformat} n=2, m=5000, time=16.894s n=3, m=, time=11.917s n=4, m=2500, time=8.69s n=5, m=2000, time=7.38s n=6, m=1666, time=6.235s n=7, m=1428, time=5.472s n=8, m=1250, time=4.517s n=9, m=, time=4.246s n=10, m=1000, time=3.797s n=11, m=909, time=3.502s n=12, m=833, time=3.249s n=13, m=769, time=3.0s n=14, m=714, time=2.816s n=15, m=666, time=2.669s n=16, m=625, time=2.44s n=17, m=588, time=2.443s n=18, m=555, time=2.349s n=19, m=526, time=2.219s n=20, m=500, time=2.135s n=21, m=476, time=2.022s n=22, m=454, time=1.952s n=23, m=434, time=1.907s n=24, m=416, time=1.823s n=25, m=400, time=1.793s n=26, m=384, time=1.783s n=27, m=370, time=1.655s n=28, m=357, time=1.602s n=29, m=344, time=1.549s n=30, m=333, time=1.513s n=31, m=322, time=1.492s n=32, m=312, time=1.382s n=33, m=303, time=1.391s n=34, m=294, time=1.383s n=35, m=285, time=1.454s n=36, m=277, time=1.433s n=37, m=270, time=1.402s n=38, m=263, time=1.374s n=39, m=256, time=1.286s n=40, m=250, time=1.334s n=41, m=243, time=1.328s n=42, m=238, time=1.306s n=43, m=232, time=1.316s n=44, m=227, time=1.273s n=45, m=222, time=1.304s n=46, m=217, time=1.265s n=47, m=212, time=1.254s n=48, m=208, time=1.243s n=49, m=204, time=1.236s n=50, m=200, time=1.237s n=51, m=196, time=1.201s n=52, m=192, time=1.227s n=53, m=188, time=1.179s n=54, m=185, time=1.162s n=55, m=181, time=1.163s n=56, m=178, time=1.16s n=57, m=175, time=1.167s n=58, m=172, time=1.168s n=59, m=169, time=1.169s n=60, m=166, time=1.167s n=61, m=163, time=1.171s n=62, m=161, time=1.183s n=63, m=158, time=1.132s n=64, m=156, time=1.127s n=65, m=153, time=1.153s n=66, m=151, time=1.139s n=67, m=149, time=1.107s n=68, m=147, time=1.103s n=69, m=144, time=1.118s n=70, m=142, time=1.124s n=71, m=140, time=1.12s n=72, m=138, time=1.143s n=73, m=136, time=1.146s n=74, m=135, time=1.143s n=75, m=133, time=1.141s n=76, m=131, time=1.138s n=77, m=129, time=1.137s n=78, m=128, time=1.1s n=79, m=126, time=1.111s n=80, m=125, time=1.109s n=81, m=123, time=1.145s n=82, m=121, time=1.135s n=83, m=120, time=1.122s n=84, m=119, time=1.132s n=85, m=117, time=1.137s n=86, m=116, time=1.133s n=87, m=114, time=1.115s n=88, m=113, time=1.123s n=89, m=112, time=1.121s n=90, m=111, time=1.134s n=91, m=109, time=1.145s n=92, m=108, time=1.156s n=93, m=107, time=1.153s n=94, m=106, time=1.137s n=95, m=105, time=1.134s n=96, m=104, time=1.147s n=97, m=103, time=1.151s n=98, m=102, time=1.147s n=99, m=101, time=1.181s n=100, m=100, time=1.189s {noformat} and this is the output using the new implementation {noformat} n=2, m=5000, time=0.602s n=3, m=, time=0.4s n=4, m=2500, time=0.192s n=5, m=2000, time=0.163s n=6, m=1666, time=0.144s n=7, m=1428, time=0.129s n=8, m=1250, time=0.115s n=9, m=, time=0.117s n=10, m=1000, time=0.11s n=11, m=909, time=0.106s n=12, m=833, time=0.11s n=13, m=769, time=0.102s n=14, m=714, time=0.104s n=15, m=666, time=0.103s n=16, m=625, time=0.094s n=17, m=588, time=0.102s n=18, m=555, time=0.106s n=19, m=526, time=0.106s n=20, m=500, time=0.151s n=21, m=476, time=0.118s n=22, m=454, time=0.12s n=23, m=434, time=0.123s n=24, m=416, time=0.126s n=25, m=400, time=0.127s n=26, m=384, time=0.13s n=27, m=370, time=0.132s n=28, m=357, time=0.135s n=29, m=344, time=0.137s n=30, m=333, time=0.14s n=31, m=322, time=0.143s n=32, m=312, time=0.133s n=33, m=303, time=0.149s n=34, m=294, time=0.162s n=35, m=285, time=0.156s n=36, m=277, time=0.157s n=37, m=270, time=0.16s n=38, m=263, time=0.168s n=39, m=256, time=0.148s n=40, m=250, time=0.184s n=41, m=243, time=0.175s n=42, m=238, time=0.175s n=43, m=232, time=0.179s n=44, m=227, time=0.186s n=45, m=222, time=0.186s n=46, m=217, time=0.188s n=47, m=212, time=0.192s n=48, m=208, time=0.195s n=49, m=204, time=0.199s n=50, m=200, time=0.21s n=51, m=196, time=0.205s n=52, m=192, time=0.208s
[jira] [Updated] (MATH-1242) Speed up KolmogorovSmirnovTest.monteCarloP()
[ https://issues.apache.org/jira/browse/MATH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1242: - Attachment: bug_fix_and_unit_test.patch Here is an update of the fix. I have moved the shuffle algorithm to a static package-private method which is now explicitly tested. If you agree with the patch, I will try to do my first commit. Speed up KolmogorovSmirnovTest.monteCarloP() Key: MATH-1242 URL: https://issues.apache.org/jira/browse/MATH-1242 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Fix For: 4.0, 3.6 Attachments: bug_fix_and_unit_test.patch, modified.patch, patch, patch The advantages of the new implementation are: * It has linear run-time complexity because sorting of data is avoided * Consumes less than half random numbers (min(n,m) instead of (n+m)) * Allocates less memory The speed-up can be verified using the testTwoSampleMonteCarloPerformance() unit test. Here is the output on my machine of the test using the old monteCarloP() implementation {noformat} n=2, m=5000, time=16.894s n=3, m=, time=11.917s n=4, m=2500, time=8.69s n=5, m=2000, time=7.38s n=6, m=1666, time=6.235s n=7, m=1428, time=5.472s n=8, m=1250, time=4.517s n=9, m=, time=4.246s n=10, m=1000, time=3.797s n=11, m=909, time=3.502s n=12, m=833, time=3.249s n=13, m=769, time=3.0s n=14, m=714, time=2.816s n=15, m=666, time=2.669s n=16, m=625, time=2.44s n=17, m=588, time=2.443s n=18, m=555, time=2.349s n=19, m=526, time=2.219s n=20, m=500, time=2.135s n=21, m=476, time=2.022s n=22, m=454, time=1.952s n=23, m=434, time=1.907s n=24, m=416, time=1.823s n=25, m=400, time=1.793s n=26, m=384, time=1.783s n=27, m=370, time=1.655s n=28, m=357, time=1.602s n=29, m=344, time=1.549s n=30, m=333, time=1.513s n=31, m=322, time=1.492s n=32, m=312, time=1.382s n=33, m=303, time=1.391s n=34, m=294, time=1.383s n=35, m=285, time=1.454s n=36, m=277, time=1.433s n=37, m=270, time=1.402s n=38, m=263, time=1.374s n=39, m=256, time=1.286s n=40, m=250, time=1.334s n=41, m=243, time=1.328s n=42, m=238, time=1.306s n=43, m=232, time=1.316s n=44, m=227, time=1.273s n=45, m=222, time=1.304s n=46, m=217, time=1.265s n=47, m=212, time=1.254s n=48, m=208, time=1.243s n=49, m=204, time=1.236s n=50, m=200, time=1.237s n=51, m=196, time=1.201s n=52, m=192, time=1.227s n=53, m=188, time=1.179s n=54, m=185, time=1.162s n=55, m=181, time=1.163s n=56, m=178, time=1.16s n=57, m=175, time=1.167s n=58, m=172, time=1.168s n=59, m=169, time=1.169s n=60, m=166, time=1.167s n=61, m=163, time=1.171s n=62, m=161, time=1.183s n=63, m=158, time=1.132s n=64, m=156, time=1.127s n=65, m=153, time=1.153s n=66, m=151, time=1.139s n=67, m=149, time=1.107s n=68, m=147, time=1.103s n=69, m=144, time=1.118s n=70, m=142, time=1.124s n=71, m=140, time=1.12s n=72, m=138, time=1.143s n=73, m=136, time=1.146s n=74, m=135, time=1.143s n=75, m=133, time=1.141s n=76, m=131, time=1.138s n=77, m=129, time=1.137s n=78, m=128, time=1.1s n=79, m=126, time=1.111s n=80, m=125, time=1.109s n=81, m=123, time=1.145s n=82, m=121, time=1.135s n=83, m=120, time=1.122s n=84, m=119, time=1.132s n=85, m=117, time=1.137s n=86, m=116, time=1.133s n=87, m=114, time=1.115s n=88, m=113, time=1.123s n=89, m=112, time=1.121s n=90, m=111, time=1.134s n=91, m=109, time=1.145s n=92, m=108, time=1.156s n=93, m=107, time=1.153s n=94, m=106, time=1.137s n=95, m=105, time=1.134s n=96, m=104, time=1.147s n=97, m=103, time=1.151s n=98, m=102, time=1.147s n=99, m=101, time=1.181s n=100, m=100, time=1.189s {noformat} and this is the output using the new implementation {noformat} n=2, m=5000, time=0.602s n=3, m=, time=0.4s n=4, m=2500, time=0.192s n=5, m=2000, time=0.163s n=6, m=1666, time=0.144s n=7, m=1428, time=0.129s n=8, m=1250, time=0.115s n=9, m=, time=0.117s n=10, m=1000, time=0.11s n=11, m=909, time=0.106s n=12, m=833, time=0.11s n=13, m=769, time=0.102s n=14, m=714, time=0.104s n=15, m=666, time=0.103s n=16, m=625, time=0.094s n=17, m=588, time=0.102s n=18, m=555, time=0.106s n=19, m=526, time=0.106s n=20, m=500, time=0.151s n=21, m=476, time=0.118s n=22, m=454, time=0.12s n=23, m=434, time=0.123s n=24, m=416, time=0.126s n=25, m=400, time=0.127s n=26, m=384, time=0.13s n=27, m=370, time=0.132s n=28, m=357, time=0.135s n=29, m=344, time=0.137s n=30, m=333, time=0.14s n=31, m=322, time=0.143s n=32, m=312, time=0.133s n=33, m=303, time=0.149s n=34, m=294, time=0.162s n=35, m=285, time=0.156s n=36, m=277, time=0.157s n=37, m=270, time=0.16s n=38, m=263, time=0.168s n=39, m=256, time=0.148s n=40, m=250, time=0.184s n=41, m=243, time=0.175s n=42, m=238, time=0.175s n=43, m=232, time=0.179s n=44, m=227,
[jira] [Updated] (MATH-1242) Speed up KolmogorovSmirnovTest.monteCarloP()
[ https://issues.apache.org/jira/browse/MATH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1242: - Attachment: patch There is a bug in the simplified Fisher-Yates shuffle algorithm. The random number r needs to be drawn from 0 = r = k. Since the parameter of nextInt is exclusive we need to pass k+1 instead of k. Speed up KolmogorovSmirnovTest.monteCarloP() Key: MATH-1242 URL: https://issues.apache.org/jira/browse/MATH-1242 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Fix For: 4.0, 3.6 Attachments: modified.patch, patch, patch The advantages of the new implementation are: * It has linear run-time complexity because sorting of data is avoided * Consumes less than half random numbers (min(n,m) instead of (n+m)) * Allocates less memory The speed-up can be verified using the testTwoSampleMonteCarloPerformance() unit test. Here is the output on my machine of the test using the old monteCarloP() implementation {noformat} n=2, m=5000, time=16.894s n=3, m=, time=11.917s n=4, m=2500, time=8.69s n=5, m=2000, time=7.38s n=6, m=1666, time=6.235s n=7, m=1428, time=5.472s n=8, m=1250, time=4.517s n=9, m=, time=4.246s n=10, m=1000, time=3.797s n=11, m=909, time=3.502s n=12, m=833, time=3.249s n=13, m=769, time=3.0s n=14, m=714, time=2.816s n=15, m=666, time=2.669s n=16, m=625, time=2.44s n=17, m=588, time=2.443s n=18, m=555, time=2.349s n=19, m=526, time=2.219s n=20, m=500, time=2.135s n=21, m=476, time=2.022s n=22, m=454, time=1.952s n=23, m=434, time=1.907s n=24, m=416, time=1.823s n=25, m=400, time=1.793s n=26, m=384, time=1.783s n=27, m=370, time=1.655s n=28, m=357, time=1.602s n=29, m=344, time=1.549s n=30, m=333, time=1.513s n=31, m=322, time=1.492s n=32, m=312, time=1.382s n=33, m=303, time=1.391s n=34, m=294, time=1.383s n=35, m=285, time=1.454s n=36, m=277, time=1.433s n=37, m=270, time=1.402s n=38, m=263, time=1.374s n=39, m=256, time=1.286s n=40, m=250, time=1.334s n=41, m=243, time=1.328s n=42, m=238, time=1.306s n=43, m=232, time=1.316s n=44, m=227, time=1.273s n=45, m=222, time=1.304s n=46, m=217, time=1.265s n=47, m=212, time=1.254s n=48, m=208, time=1.243s n=49, m=204, time=1.236s n=50, m=200, time=1.237s n=51, m=196, time=1.201s n=52, m=192, time=1.227s n=53, m=188, time=1.179s n=54, m=185, time=1.162s n=55, m=181, time=1.163s n=56, m=178, time=1.16s n=57, m=175, time=1.167s n=58, m=172, time=1.168s n=59, m=169, time=1.169s n=60, m=166, time=1.167s n=61, m=163, time=1.171s n=62, m=161, time=1.183s n=63, m=158, time=1.132s n=64, m=156, time=1.127s n=65, m=153, time=1.153s n=66, m=151, time=1.139s n=67, m=149, time=1.107s n=68, m=147, time=1.103s n=69, m=144, time=1.118s n=70, m=142, time=1.124s n=71, m=140, time=1.12s n=72, m=138, time=1.143s n=73, m=136, time=1.146s n=74, m=135, time=1.143s n=75, m=133, time=1.141s n=76, m=131, time=1.138s n=77, m=129, time=1.137s n=78, m=128, time=1.1s n=79, m=126, time=1.111s n=80, m=125, time=1.109s n=81, m=123, time=1.145s n=82, m=121, time=1.135s n=83, m=120, time=1.122s n=84, m=119, time=1.132s n=85, m=117, time=1.137s n=86, m=116, time=1.133s n=87, m=114, time=1.115s n=88, m=113, time=1.123s n=89, m=112, time=1.121s n=90, m=111, time=1.134s n=91, m=109, time=1.145s n=92, m=108, time=1.156s n=93, m=107, time=1.153s n=94, m=106, time=1.137s n=95, m=105, time=1.134s n=96, m=104, time=1.147s n=97, m=103, time=1.151s n=98, m=102, time=1.147s n=99, m=101, time=1.181s n=100, m=100, time=1.189s {noformat} and this is the output using the new implementation {noformat} n=2, m=5000, time=0.602s n=3, m=, time=0.4s n=4, m=2500, time=0.192s n=5, m=2000, time=0.163s n=6, m=1666, time=0.144s n=7, m=1428, time=0.129s n=8, m=1250, time=0.115s n=9, m=, time=0.117s n=10, m=1000, time=0.11s n=11, m=909, time=0.106s n=12, m=833, time=0.11s n=13, m=769, time=0.102s n=14, m=714, time=0.104s n=15, m=666, time=0.103s n=16, m=625, time=0.094s n=17, m=588, time=0.102s n=18, m=555, time=0.106s n=19, m=526, time=0.106s n=20, m=500, time=0.151s n=21, m=476, time=0.118s n=22, m=454, time=0.12s n=23, m=434, time=0.123s n=24, m=416, time=0.126s n=25, m=400, time=0.127s n=26, m=384, time=0.13s n=27, m=370, time=0.132s n=28, m=357, time=0.135s n=29, m=344, time=0.137s n=30, m=333, time=0.14s n=31, m=322, time=0.143s n=32, m=312, time=0.133s n=33, m=303, time=0.149s n=34, m=294, time=0.162s n=35, m=285, time=0.156s n=36, m=277, time=0.157s n=37, m=270, time=0.16s n=38, m=263, time=0.168s n=39, m=256, time=0.148s n=40, m=250, time=0.184s n=41, m=243, time=0.175s n=42, m=238, time=0.175s n=43, m=232, time=0.179s n=44, m=227, time=0.186s n=45, m=222, time=0.186s n=46, m=217,
[jira] [Commented] (MATH-1242) Speed up KolmogorovSmirnovTest.monteCarloP()
[ https://issues.apache.org/jira/browse/MATH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602970#comment-14602970 ] Otmar Ertl commented on MATH-1242: -- Your modifications look fine. Thanks! Speed up KolmogorovSmirnovTest.monteCarloP() Key: MATH-1242 URL: https://issues.apache.org/jira/browse/MATH-1242 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Attachments: modified.patch, patch The advantages of the new implementation are: * It has linear run-time complexity because sorting of data is avoided * Consumes less than half random numbers (min(n,m) instead of (n+m)) * Allocates less memory The speed-up can be verified using the testTwoSampleMonteCarloPerformance() unit test. Here is the output on my machine of the test using the old monteCarloP() implementation {noformat} n=2, m=5000, time=16.894s n=3, m=, time=11.917s n=4, m=2500, time=8.69s n=5, m=2000, time=7.38s n=6, m=1666, time=6.235s n=7, m=1428, time=5.472s n=8, m=1250, time=4.517s n=9, m=, time=4.246s n=10, m=1000, time=3.797s n=11, m=909, time=3.502s n=12, m=833, time=3.249s n=13, m=769, time=3.0s n=14, m=714, time=2.816s n=15, m=666, time=2.669s n=16, m=625, time=2.44s n=17, m=588, time=2.443s n=18, m=555, time=2.349s n=19, m=526, time=2.219s n=20, m=500, time=2.135s n=21, m=476, time=2.022s n=22, m=454, time=1.952s n=23, m=434, time=1.907s n=24, m=416, time=1.823s n=25, m=400, time=1.793s n=26, m=384, time=1.783s n=27, m=370, time=1.655s n=28, m=357, time=1.602s n=29, m=344, time=1.549s n=30, m=333, time=1.513s n=31, m=322, time=1.492s n=32, m=312, time=1.382s n=33, m=303, time=1.391s n=34, m=294, time=1.383s n=35, m=285, time=1.454s n=36, m=277, time=1.433s n=37, m=270, time=1.402s n=38, m=263, time=1.374s n=39, m=256, time=1.286s n=40, m=250, time=1.334s n=41, m=243, time=1.328s n=42, m=238, time=1.306s n=43, m=232, time=1.316s n=44, m=227, time=1.273s n=45, m=222, time=1.304s n=46, m=217, time=1.265s n=47, m=212, time=1.254s n=48, m=208, time=1.243s n=49, m=204, time=1.236s n=50, m=200, time=1.237s n=51, m=196, time=1.201s n=52, m=192, time=1.227s n=53, m=188, time=1.179s n=54, m=185, time=1.162s n=55, m=181, time=1.163s n=56, m=178, time=1.16s n=57, m=175, time=1.167s n=58, m=172, time=1.168s n=59, m=169, time=1.169s n=60, m=166, time=1.167s n=61, m=163, time=1.171s n=62, m=161, time=1.183s n=63, m=158, time=1.132s n=64, m=156, time=1.127s n=65, m=153, time=1.153s n=66, m=151, time=1.139s n=67, m=149, time=1.107s n=68, m=147, time=1.103s n=69, m=144, time=1.118s n=70, m=142, time=1.124s n=71, m=140, time=1.12s n=72, m=138, time=1.143s n=73, m=136, time=1.146s n=74, m=135, time=1.143s n=75, m=133, time=1.141s n=76, m=131, time=1.138s n=77, m=129, time=1.137s n=78, m=128, time=1.1s n=79, m=126, time=1.111s n=80, m=125, time=1.109s n=81, m=123, time=1.145s n=82, m=121, time=1.135s n=83, m=120, time=1.122s n=84, m=119, time=1.132s n=85, m=117, time=1.137s n=86, m=116, time=1.133s n=87, m=114, time=1.115s n=88, m=113, time=1.123s n=89, m=112, time=1.121s n=90, m=111, time=1.134s n=91, m=109, time=1.145s n=92, m=108, time=1.156s n=93, m=107, time=1.153s n=94, m=106, time=1.137s n=95, m=105, time=1.134s n=96, m=104, time=1.147s n=97, m=103, time=1.151s n=98, m=102, time=1.147s n=99, m=101, time=1.181s n=100, m=100, time=1.189s {noformat} and this is the output using the new implementation {noformat} n=2, m=5000, time=0.602s n=3, m=, time=0.4s n=4, m=2500, time=0.192s n=5, m=2000, time=0.163s n=6, m=1666, time=0.144s n=7, m=1428, time=0.129s n=8, m=1250, time=0.115s n=9, m=, time=0.117s n=10, m=1000, time=0.11s n=11, m=909, time=0.106s n=12, m=833, time=0.11s n=13, m=769, time=0.102s n=14, m=714, time=0.104s n=15, m=666, time=0.103s n=16, m=625, time=0.094s n=17, m=588, time=0.102s n=18, m=555, time=0.106s n=19, m=526, time=0.106s n=20, m=500, time=0.151s n=21, m=476, time=0.118s n=22, m=454, time=0.12s n=23, m=434, time=0.123s n=24, m=416, time=0.126s n=25, m=400, time=0.127s n=26, m=384, time=0.13s n=27, m=370, time=0.132s n=28, m=357, time=0.135s n=29, m=344, time=0.137s n=30, m=333, time=0.14s n=31, m=322, time=0.143s n=32, m=312, time=0.133s n=33, m=303, time=0.149s n=34, m=294, time=0.162s n=35, m=285, time=0.156s n=36, m=277, time=0.157s n=37, m=270, time=0.16s n=38, m=263, time=0.168s n=39, m=256, time=0.148s n=40, m=250, time=0.184s n=41, m=243, time=0.175s n=42, m=238, time=0.175s n=43, m=232, time=0.179s n=44, m=227, time=0.186s n=45, m=222, time=0.186s n=46, m=217, time=0.188s n=47, m=212, time=0.192s n=48, m=208, time=0.195s n=49, m=204, time=0.199s n=50, m=200, time=0.21s n=51, m=196, time=0.205s n=52, m=192, time=0.208s
[jira] [Updated] (MATH-1242) Speed up KolmogorovSmirnovTest.monteCarloP()
[ https://issues.apache.org/jira/browse/MATH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1242: - Attachment: patch Speed up KolmogorovSmirnovTest.monteCarloP() Key: MATH-1242 URL: https://issues.apache.org/jira/browse/MATH-1242 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Attachments: patch The advantages of the new implementation are: * It has linear run-time complexity because sorting of data is avoided * Consumes less than half random numbers (min(n,m) instead of (n+m)) * Allocates less memory The speed-up can be verified using the testTwoSampleMonteCarloPerformance() unit test. Here is the output on my machine of the test using the old monteCarloP() implementation {noformat} n=2, m=5000, time=16.894s n=3, m=, time=11.917s n=4, m=2500, time=8.69s n=5, m=2000, time=7.38s n=6, m=1666, time=6.235s n=7, m=1428, time=5.472s n=8, m=1250, time=4.517s n=9, m=, time=4.246s n=10, m=1000, time=3.797s n=11, m=909, time=3.502s n=12, m=833, time=3.249s n=13, m=769, time=3.0s n=14, m=714, time=2.816s n=15, m=666, time=2.669s n=16, m=625, time=2.44s n=17, m=588, time=2.443s n=18, m=555, time=2.349s n=19, m=526, time=2.219s n=20, m=500, time=2.135s n=21, m=476, time=2.022s n=22, m=454, time=1.952s n=23, m=434, time=1.907s n=24, m=416, time=1.823s n=25, m=400, time=1.793s n=26, m=384, time=1.783s n=27, m=370, time=1.655s n=28, m=357, time=1.602s n=29, m=344, time=1.549s n=30, m=333, time=1.513s n=31, m=322, time=1.492s n=32, m=312, time=1.382s n=33, m=303, time=1.391s n=34, m=294, time=1.383s n=35, m=285, time=1.454s n=36, m=277, time=1.433s n=37, m=270, time=1.402s n=38, m=263, time=1.374s n=39, m=256, time=1.286s n=40, m=250, time=1.334s n=41, m=243, time=1.328s n=42, m=238, time=1.306s n=43, m=232, time=1.316s n=44, m=227, time=1.273s n=45, m=222, time=1.304s n=46, m=217, time=1.265s n=47, m=212, time=1.254s n=48, m=208, time=1.243s n=49, m=204, time=1.236s n=50, m=200, time=1.237s n=51, m=196, time=1.201s n=52, m=192, time=1.227s n=53, m=188, time=1.179s n=54, m=185, time=1.162s n=55, m=181, time=1.163s n=56, m=178, time=1.16s n=57, m=175, time=1.167s n=58, m=172, time=1.168s n=59, m=169, time=1.169s n=60, m=166, time=1.167s n=61, m=163, time=1.171s n=62, m=161, time=1.183s n=63, m=158, time=1.132s n=64, m=156, time=1.127s n=65, m=153, time=1.153s n=66, m=151, time=1.139s n=67, m=149, time=1.107s n=68, m=147, time=1.103s n=69, m=144, time=1.118s n=70, m=142, time=1.124s n=71, m=140, time=1.12s n=72, m=138, time=1.143s n=73, m=136, time=1.146s n=74, m=135, time=1.143s n=75, m=133, time=1.141s n=76, m=131, time=1.138s n=77, m=129, time=1.137s n=78, m=128, time=1.1s n=79, m=126, time=1.111s n=80, m=125, time=1.109s n=81, m=123, time=1.145s n=82, m=121, time=1.135s n=83, m=120, time=1.122s n=84, m=119, time=1.132s n=85, m=117, time=1.137s n=86, m=116, time=1.133s n=87, m=114, time=1.115s n=88, m=113, time=1.123s n=89, m=112, time=1.121s n=90, m=111, time=1.134s n=91, m=109, time=1.145s n=92, m=108, time=1.156s n=93, m=107, time=1.153s n=94, m=106, time=1.137s n=95, m=105, time=1.134s n=96, m=104, time=1.147s n=97, m=103, time=1.151s n=98, m=102, time=1.147s n=99, m=101, time=1.181s n=100, m=100, time=1.189s {noformat} and this is the output using the new implementation {noformat} n=2, m=5000, time=0.602s n=3, m=, time=0.4s n=4, m=2500, time=0.192s n=5, m=2000, time=0.163s n=6, m=1666, time=0.144s n=7, m=1428, time=0.129s n=8, m=1250, time=0.115s n=9, m=, time=0.117s n=10, m=1000, time=0.11s n=11, m=909, time=0.106s n=12, m=833, time=0.11s n=13, m=769, time=0.102s n=14, m=714, time=0.104s n=15, m=666, time=0.103s n=16, m=625, time=0.094s n=17, m=588, time=0.102s n=18, m=555, time=0.106s n=19, m=526, time=0.106s n=20, m=500, time=0.151s n=21, m=476, time=0.118s n=22, m=454, time=0.12s n=23, m=434, time=0.123s n=24, m=416, time=0.126s n=25, m=400, time=0.127s n=26, m=384, time=0.13s n=27, m=370, time=0.132s n=28, m=357, time=0.135s n=29, m=344, time=0.137s n=30, m=333, time=0.14s n=31, m=322, time=0.143s n=32, m=312, time=0.133s n=33, m=303, time=0.149s n=34, m=294, time=0.162s n=35, m=285, time=0.156s n=36, m=277, time=0.157s n=37, m=270, time=0.16s n=38, m=263, time=0.168s n=39, m=256, time=0.148s n=40, m=250, time=0.184s n=41, m=243, time=0.175s n=42, m=238, time=0.175s n=43, m=232, time=0.179s n=44, m=227, time=0.186s n=45, m=222, time=0.186s n=46, m=217, time=0.188s n=47, m=212, time=0.192s n=48, m=208, time=0.195s n=49, m=204, time=0.199s n=50, m=200, time=0.21s n=51, m=196, time=0.205s n=52, m=192, time=0.208s n=53, m=188, time=0.24s n=54, m=185, time=0.22s n=55, m=181, time=0.218s n=56, m=178,
[jira] [Created] (MATH-1242) Speed up KolmogorovSmirnovTest.monteCarloP()
Otmar Ertl created MATH-1242: Summary: Speed up KolmogorovSmirnovTest.monteCarloP() Key: MATH-1242 URL: https://issues.apache.org/jira/browse/MATH-1242 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Attachments: patch The advantages of the new implementation are: * It has linear run-time complexity because sorting of data is avoided * Consumes less than half random numbers (min(n,m) instead of (n+m)) * Allocates less memory The speed-up can be verified using the testTwoSampleMonteCarloPerformance() unit test. Here is the output on my machine of the test using the old monteCarloP() implementation {noformat} n=2, m=5000, time=16.894s n=3, m=, time=11.917s n=4, m=2500, time=8.69s n=5, m=2000, time=7.38s n=6, m=1666, time=6.235s n=7, m=1428, time=5.472s n=8, m=1250, time=4.517s n=9, m=, time=4.246s n=10, m=1000, time=3.797s n=11, m=909, time=3.502s n=12, m=833, time=3.249s n=13, m=769, time=3.0s n=14, m=714, time=2.816s n=15, m=666, time=2.669s n=16, m=625, time=2.44s n=17, m=588, time=2.443s n=18, m=555, time=2.349s n=19, m=526, time=2.219s n=20, m=500, time=2.135s n=21, m=476, time=2.022s n=22, m=454, time=1.952s n=23, m=434, time=1.907s n=24, m=416, time=1.823s n=25, m=400, time=1.793s n=26, m=384, time=1.783s n=27, m=370, time=1.655s n=28, m=357, time=1.602s n=29, m=344, time=1.549s n=30, m=333, time=1.513s n=31, m=322, time=1.492s n=32, m=312, time=1.382s n=33, m=303, time=1.391s n=34, m=294, time=1.383s n=35, m=285, time=1.454s n=36, m=277, time=1.433s n=37, m=270, time=1.402s n=38, m=263, time=1.374s n=39, m=256, time=1.286s n=40, m=250, time=1.334s n=41, m=243, time=1.328s n=42, m=238, time=1.306s n=43, m=232, time=1.316s n=44, m=227, time=1.273s n=45, m=222, time=1.304s n=46, m=217, time=1.265s n=47, m=212, time=1.254s n=48, m=208, time=1.243s n=49, m=204, time=1.236s n=50, m=200, time=1.237s n=51, m=196, time=1.201s n=52, m=192, time=1.227s n=53, m=188, time=1.179s n=54, m=185, time=1.162s n=55, m=181, time=1.163s n=56, m=178, time=1.16s n=57, m=175, time=1.167s n=58, m=172, time=1.168s n=59, m=169, time=1.169s n=60, m=166, time=1.167s n=61, m=163, time=1.171s n=62, m=161, time=1.183s n=63, m=158, time=1.132s n=64, m=156, time=1.127s n=65, m=153, time=1.153s n=66, m=151, time=1.139s n=67, m=149, time=1.107s n=68, m=147, time=1.103s n=69, m=144, time=1.118s n=70, m=142, time=1.124s n=71, m=140, time=1.12s n=72, m=138, time=1.143s n=73, m=136, time=1.146s n=74, m=135, time=1.143s n=75, m=133, time=1.141s n=76, m=131, time=1.138s n=77, m=129, time=1.137s n=78, m=128, time=1.1s n=79, m=126, time=1.111s n=80, m=125, time=1.109s n=81, m=123, time=1.145s n=82, m=121, time=1.135s n=83, m=120, time=1.122s n=84, m=119, time=1.132s n=85, m=117, time=1.137s n=86, m=116, time=1.133s n=87, m=114, time=1.115s n=88, m=113, time=1.123s n=89, m=112, time=1.121s n=90, m=111, time=1.134s n=91, m=109, time=1.145s n=92, m=108, time=1.156s n=93, m=107, time=1.153s n=94, m=106, time=1.137s n=95, m=105, time=1.134s n=96, m=104, time=1.147s n=97, m=103, time=1.151s n=98, m=102, time=1.147s n=99, m=101, time=1.181s n=100, m=100, time=1.189s {noformat} and this is the output using the new implementation {noformat} n=2, m=5000, time=0.602s n=3, m=, time=0.4s n=4, m=2500, time=0.192s n=5, m=2000, time=0.163s n=6, m=1666, time=0.144s n=7, m=1428, time=0.129s n=8, m=1250, time=0.115s n=9, m=, time=0.117s n=10, m=1000, time=0.11s n=11, m=909, time=0.106s n=12, m=833, time=0.11s n=13, m=769, time=0.102s n=14, m=714, time=0.104s n=15, m=666, time=0.103s n=16, m=625, time=0.094s n=17, m=588, time=0.102s n=18, m=555, time=0.106s n=19, m=526, time=0.106s n=20, m=500, time=0.151s n=21, m=476, time=0.118s n=22, m=454, time=0.12s n=23, m=434, time=0.123s n=24, m=416, time=0.126s n=25, m=400, time=0.127s n=26, m=384, time=0.13s n=27, m=370, time=0.132s n=28, m=357, time=0.135s n=29, m=344, time=0.137s n=30, m=333, time=0.14s n=31, m=322, time=0.143s n=32, m=312, time=0.133s n=33, m=303, time=0.149s n=34, m=294, time=0.162s n=35, m=285, time=0.156s n=36, m=277, time=0.157s n=37, m=270, time=0.16s n=38, m=263, time=0.168s n=39, m=256, time=0.148s n=40, m=250, time=0.184s n=41, m=243, time=0.175s n=42, m=238, time=0.175s n=43, m=232, time=0.179s n=44, m=227, time=0.186s n=45, m=222, time=0.186s n=46, m=217, time=0.188s n=47, m=212, time=0.192s n=48, m=208, time=0.195s n=49, m=204, time=0.199s n=50, m=200, time=0.21s n=51, m=196, time=0.205s n=52, m=192, time=0.208s n=53, m=188, time=0.24s n=54, m=185, time=0.22s n=55, m=181, time=0.218s n=56, m=178, time=0.222s n=57, m=175, time=0.225s n=58, m=172, time=0.229s n=59, m=169, time=0.231s n=60, m=166, time=0.235s n=61, m=163, time=0.239s n=62, m=161, time=0.241s n=63, m=158, time=0.245s n=64, m=156, time=0.234s n=65, m=153, time=0.252s n=66, m=151, time=0.254s n=67, m=149, time=0.257s n=68, m=147, time=0.26s n=69, m=144, time=0.264s
[jira] [Updated] (MATH-1236) Improve Kolmogorov-Smirnov statistic for two samples
[ https://issues.apache.org/jira/browse/MATH-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1236: - Attachment: patch Improve Kolmogorov-Smirnov statistic for two samples Key: MATH-1236 URL: https://issues.apache.org/jira/browse/MATH-1236 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Attachments: patch I reimplemented the Kolmogorov-Smirnov test statistic for two samples because of its problems with identical values. Although its already fixed in the meantime (see MATH-1197), the proposed code is an improvement, because it is simpler and does not require binary searches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MATH-1236) Improve Kolmogorov-Smirnov statistic for two samples
Otmar Ertl created MATH-1236: Summary: Improve Kolmogorov-Smirnov statistic for two samples Key: MATH-1236 URL: https://issues.apache.org/jira/browse/MATH-1236 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Attachments: patch I reimplemented the Kolmogorov-Smirnov test statistic for two samples because of its problems with identical values. Although its already fixed in the meantime (see MATH-1197), the proposed code is an improvement, because it is simpler and does not require binary searches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1220: - Attachment: Zipf Rejection Inversion Sampling Notes.pdf I attached my notes that should make it clearer how the original algorithm was transformed. More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Fix For: 4.0, 3.6 Attachments: Zipf Rejection Inversion Sampling Notes.pdf, patch_v1, patch_v2 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526002#comment-14526002 ] Otmar Ertl commented on MATH-1220: -- Yes, my proposed method outperforms the rejection inversion method for large exponents. However, I suppose large exponents (greater than 10) are not very interesting for data modelling. The exponent is usually close to 1 or even less than 1. See for example following paper about word frequencies http://colala.bcs.rochester.edu/papers/piantadosi2014zipfs.pdf. The original rejection inversion method cannot handle exponents in that range. The second patch contains an adapted rejection inversion algorithm, that covers all non-negative exponents. (I still favor to include 0 as allowed value for the exponent. It is easier for data modelling, if you are allowed to choose the parameter from an closed interval. Furthermore, exponent equal to 0 represents a meaningful distribution, since it corresponds to a uniform distribution.) More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Fix For: 4.0, 3.6 Attachments: patch_v1, patch_v2 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1220: - Attachment: patch_v2 I have found some time to read the paper you mentioned: http://epub.wu.ac.at/1176/. The algorithm described there is superior to the method I have proposed. The only drawback is that it is restricted to exponents larger than 1. However, I have found a way to transform the algorithm so that it should work for any non-negative exponents (see patch_v2). More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Fix For: 4.0, 3.6 Attachments: patch_v1, patch_v2 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl reopened MATH-1220: -- More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Fix For: 4.0, 3.6 Attachments: patch_v1, patch_v2 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14521921#comment-14521921 ] Otmar Ertl commented on MATH-1220: -- Since we are sampling from a finite number of points, convergence of the infinite series is irrelevant. Exponent equal to 0 corresponds to a uniform distribution. Yes, the tail includes the points starting from 2 to N, because the first point (the head) is treated differently in order to limit the rejection rate. Otherwise, the rejection rate could become arbitrarily large for large exponents. More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Attachments: patch_v1 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520123#comment-14520123 ] Otmar Ertl commented on MATH-1220: -- Caching generalizedHarmonic(numberOfElements, exponent) makes sense. The inverse cumulative probability would be more efficient by simply summing up the probabilities of points until the searched probability is met. Furthermore, I would allow the exponent to be non-negative. Currently, it is restricted to positive values. I have developed the method by myself. I do not know if a similar method can be found in literature. So far, apart from this math library, I have no plans to publish it somewhere else. I am not sure, if I could bring up the time to write some paper. More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Attachments: patch_v1 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514257#comment-14514257 ] Otmar Ertl edited comment on MATH-1220 at 4/27/15 3:10 PM: --- Do you mean using a pre-calculated CDF table and sampling using binary search? This would have O(N) initialization costs, O(N) space costs, and sampling would require O(log(N)) time. All that is constant for the method I have proposed. If initialization and memory costs are not an issue, sampling using such a pre-calculated table for the CDF is probably faster for smaller N. The proposed rejection sampling algorithm divides the support into two parts. The head which currently only consists of the first point (equal to 1) and the tail which consists of all remaining points. Rejection sampling is only used to sample points from the tail. Instead, it would be possible to take the first X (=N) points as the head and to use such a pre-calculated CDF table in combination with binary serach for sampling points from the head. For the tail still the rejection method can be applied. When developing that algorithm I had that in mind, but finally decided to set X = 1 to minimize the initialization costs. The optimal value for X is likely to be use case dependent and difficult to determine. I do not think that a pure CDF table based approach (X=N) is feasible for all N, because initialization and memory costs could get very large. was (Author: otmar ertl): Do you mean using a pre-calculated CDF tanle and sampling using binary search? This would have O(N) initialization costs, O(N) space costs, and sampling would require O(log(N)) time. All that is constant for the method I have proposed. If initialization and memory costs are not an issue, sampling using such a pre-calculated table for the CDF is probably faster for smaller N. The proposed rejection sampling algorithm divides the support into two parts. The head which currently only consists of the first point (equal to 1) and the tail which consists of all remaining points. Rejection sampling is only used to sample points from the tail. Instead, it would be possible to take the first X (=N) points as the head and to use such a pre-calculated CDF table in combination with binary serach for sampling points from the head. For the tail still the rejection method can be applied. When developing that algorithm I had that in mind, but finally decided to set X = 1 to minimize the initialization costs. The optimal value for X is likely to be use case dependent and difficult to determine. I do not think that a pure CDF table based approach (X=N) is feasible for all N, because initialization and memory costs could get very large. More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Attachments: patch_v1 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514257#comment-14514257 ] Otmar Ertl commented on MATH-1220: -- Do you mean using a pre-calculated CDF tanle and sampling using binary search? This would have O(N) initialization costs, O(N) space costs, and sampling would require O(log(N)) time. All that is constant for the method I have proposed. If initialization and memory costs are not an issue, sampling using such a pre-calculated table for the CDF is probably faster for smaller N. The proposed rejection sampling algorithm divides the support into two parts. The head which currently only consists of the first point (equal to 1) and the tail which consists of all remaining points. Rejection sampling is only used to sample points from the tail. Instead, it would be possible to take the first X (=N) points as the head and to use such a pre-calculated CDF table in combination with binary serach for sampling points from the head. For the tail still the rejection method can be applied. When developing that algorithm I had that in mind, but finally decided to set X = 1 to minimize the initialization costs. The optimal value for X is likely to be use case dependent and difficult to determine. I do not think that a pure CDF table based approach (X=N) is feasible for all N, because initialization and memory costs could get very large. More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Attachments: patch_v1 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516401#comment-14516401 ] Otmar Ertl commented on MATH-1220: -- The avg. number of iterations until a random value is accepted can be bound by a constant that is independent of the number of points and the exponent. Therefore, the O(1) runtime complexity is ensured. See the output of the testPerformance() unit test, which shows that the avg. number of consumed uniformly distributed random numbers is limited, which is directly connected to the number of iterations. I can imagine to generalize the rejection sampling method in its basic form. Basically, you need to define two methods, a method to generate a value from the instrumental distribution and a method to calculate the corresponding acceptance rate for that value. However, there are many rejection sampling methods that do not fit into this basic scheme (see for example the Ziggurat algorithm or the Monty Python method to generate normal distributed values). Furthermore, defining a method to calculate the acceptance rate is not feasible in all cases, because it restricts transformations of the inequality (acceptance rate uniformly distributed random value). For example, the proposed method could also be optimized without the definition of an explicit method for the acceptance rate (a division could be replaced in favor of a mutliplication). More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Attachments: patch_v1 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MATH-1220) More efficient sample() method for ZipfDistribution
Otmar Ertl created MATH-1220: Summary: More efficient sample() method for ZipfDistribution Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1220: - Attachment: patch_v1 More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Attachments: patch_v1 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution
[ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513221#comment-14513221 ] Otmar Ertl commented on MATH-1220: -- The patch overrides the sample() method for ZipfDistribution by an implementation that uses rejection sampling. To demonstrate the speed of the new method the (ignored) unit test testPerformance() can be used. Using the default sample() method from AbstractIntegerDistribution instead, the test will not finish in reasonable time. The generation of a single random value consumes at most 2 uniformly distributed random numbers on average, dependent on the parameters of the Zipf distribution (see the output of testPerformance()). More efficient sample() method for ZipfDistribution --- Key: MATH-1220 URL: https://issues.apache.org/jira/browse/MATH-1220 Project: Commons Math Issue Type: Improvement Reporter: Otmar Ertl Attachments: patch_v1 Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated by inverting the CDF. However, the current implementation uses O(N) power function evaluations to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.) I propose to use rejection sampling instead, which allows the generation of a single random value in constant time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-990) Improve performance of MathArrays.sortInPlace
[ https://issues.apache.org/jira/browse/MATH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-990: Attachment: patch The performance can be further improved while still using Collections.sort by avoiding the generic Pair class (see attached patch). A reduction of costs by a factor of 0.8 was observed in case of 2 parallel arrays. Improve performance of MathArrays.sortInPlace - Key: MATH-990 URL: https://issues.apache.org/jira/browse/MATH-990 Project: Commons Math Issue Type: Improvement Affects Versions: 3.2 Reporter: Gilles Assignee: Gilles Priority: Minor Labels: benchmark Fix For: 4.0 Attachments: patch Performance suffers from lots of copying. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1159) Add quartiles to SummaryStatistics
[ https://issues.apache.org/jira/browse/MATH-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180422#comment-14180422 ] Otmar Ertl commented on MATH-1159: -- If multiple quantile values need to be computed simultaneously (e.g lower quartile, median, and upper quartile), it would make sense to use the P² algorithm for histograms that uses more than 5 marker positions as described in the original paper. Depending on the desired quantile values different sets of marker positions could be used. Using a PSquarePercentile object for each of the desired quantiles has some overhead. For example, each of them store the minimum and the maximum value. The 2nd and 4th marker positions used for the median correspond to the 3rd marker positions of the lower and upper quartiles, respectively. In summary, a P² algorithm implementation that calculates all desired quantiles simultaneously would be much more efficient. Add quartiles to SummaryStatistics -- Key: MATH-1159 URL: https://issues.apache.org/jira/browse/MATH-1159 Project: Commons Math Issue Type: Improvement Affects Versions: 3.3 Reporter: Phil Steitz Using PSquarePercentile, we can add quartile computation to SummaryStatistics. Since maintaining quartiles will add some overhead, implementation should allow the feature to be turned off via some kind of constructor flag. This does open the can of worms regarding turning on / off other stats, which is probably a good thing to think about as the implementation of this feature is developed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1158) Factory method to provide sampling from distributions
[ https://issues.apache.org/jira/browse/MATH-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180464#comment-14180464 ] Otmar Ertl commented on MATH-1158: -- I like your proposal. I think there is much more to deprecate. Since you deprecated all methods that use the RandomGenerator instance, it would be consequent to also deprecate all the constructors of all RealDistribution implementations that accept a RandomGenerator instance. Why not separating the sampling functionality for IntegerDistribution too? Factory method to provide sampling from distributions - Key: MATH-1158 URL: https://issues.apache.org/jira/browse/MATH-1158 Project: Commons Math Issue Type: New Feature Affects Versions: 3.3 Reporter: Gilles Priority: Minor Labels: API, deprecation Fix For: 3.4 Attachments: MATH-1158.patch A proposal to separate the sampling functionality from the core of the distribution classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1154) Statistical tests in stat.inference package are very slow due to implicit RandomGenerator initialization
[ https://issues.apache.org/jira/browse/MATH-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163330#comment-14163330 ] Otmar Ertl commented on MATH-1154: -- I agree with Phil that my proposed patch does not address the core problem, which is definitely MATH-1124. The patch was thought as a short-term fix, which improves the performance essentially with minimal invasive code changes. Once the root issue is solved, both patches proposed here (mine and Thomas') will become obsolete anyway. The question is when will MATH-1124 be resolved? In the case it is fixed for the next release, none of both patches need to be applied. If not, it makes sense to apply one of both patches for the short-term. The question remains, do you only want a speed up of the statistical tests (Thomas' patch) or also of the distribution classes (my patch) ? Statistical tests in stat.inference package are very slow due to implicit RandomGenerator initialization Key: MATH-1154 URL: https://issues.apache.org/jira/browse/MATH-1154 Project: Commons Math Issue Type: Bug Affects Versions: 3.3 Reporter: Otmar Ertl Attachments: MATH-1154.patch, math3.patch Some statistical tests defined in the stat.inference package (e.g. BinomialTest or ChiSquareTest) are unnecessarily very slow (up to a factor 20 slower than necessary). The reason is the implicit slow initialization of a default (Well19937c) random generator instance each time a test is performed. The affected tests create some distribution instance in order to use some methods defined therein. However, they do not use any method for random generation. Nevertheless a random number generator instance is automatically created when creating a distribution instance, which is the reason for the serious slowdown. The problem is related to MATH-1124. There are following solutions: 1) Fix the affected statistical tests by passing a light-weight RandomGenerator implementation (or even null) to the constructor of the distribution. 2) Or use for all distributions a RandomGenerator implementation that uses lazy initialization to generate the Well19937c instance as late as possible. This would also solve MATH-1124. I will attach a patch proposal together with a performance test, that will demonstrate the speed up after a fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1124) Instances of AbstractRealDistribution require a random generator.
[ https://issues.apache.org/jira/browse/MATH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162234#comment-14162234 ] Otmar Ertl commented on MATH-1124: -- What about the C++ STL design choice? The corresponding Java design would be to pass a random generator instance to the sample method. For instance, have a look on the code example given on http://www.cplusplus.com/reference/random/uniform_real_distribution/ Instances of AbstractRealDistribution require a random generator. - Key: MATH-1124 URL: https://issues.apache.org/jira/browse/MATH-1124 Project: Commons Math Issue Type: Bug Reporter: Ajo Fod A couple of observations: ... The default random generator takes a while to instantiate. ... Many functions of these distributions don't require a random generator. Generally speaking only sampling requires it. So, why force the default constructor to initialize with a new random generator ... why not use a global generic or simple generator? Or do away with random generator except for sampling? This issue was observed with the TDistribution class , but it is probably applicable to many classes as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MATH-1124) Instances of AbstractRealDistribution require a random generator.
[ https://issues.apache.org/jira/browse/MATH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162234#comment-14162234 ] Otmar Ertl edited comment on MATH-1124 at 10/7/14 6:25 PM: --- What about the C++ STL design choice? The corresponding Java design would be to pass a random generator instance to the sample method. For instance, have a look on the code example given on http://www.cplusplus.com/reference/random/uniform_real_distribution/ EDIT: This would also allow the distribution classes to be entirely immutable. was (Author: otmar ertl): What about the C++ STL design choice? The corresponding Java design would be to pass a random generator instance to the sample method. For instance, have a look on the code example given on http://www.cplusplus.com/reference/random/uniform_real_distribution/ Instances of AbstractRealDistribution require a random generator. - Key: MATH-1124 URL: https://issues.apache.org/jira/browse/MATH-1124 Project: Commons Math Issue Type: Bug Reporter: Ajo Fod A couple of observations: ... The default random generator takes a while to instantiate. ... Many functions of these distributions don't require a random generator. Generally speaking only sampling requires it. So, why force the default constructor to initialize with a new random generator ... why not use a global generic or simple generator? Or do away with random generator except for sampling? This issue was observed with the TDistribution class , but it is probably applicable to many classes as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1124) Instances of AbstractRealDistribution require a random generator.
[ https://issues.apache.org/jira/browse/MATH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162367#comment-14162367 ] Otmar Ertl commented on MATH-1124: -- @Ole Ersoy: For some distributions it is quite common to use multiple uniform distributed random numbers to generate a single random number following the given distributions (e.g normal distribution). Therefore, passing the random number generator is a much better design choice. Instances of AbstractRealDistribution require a random generator. - Key: MATH-1124 URL: https://issues.apache.org/jira/browse/MATH-1124 Project: Commons Math Issue Type: Bug Reporter: Ajo Fod A couple of observations: ... The default random generator takes a while to instantiate. ... Many functions of these distributions don't require a random generator. Generally speaking only sampling requires it. So, why force the default constructor to initialize with a new random generator ... why not use a global generic or simple generator? Or do away with random generator except for sampling? This issue was observed with the TDistribution class , but it is probably applicable to many classes as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1124) Instances of AbstractRealDistribution require a random generator.
[ https://issues.apache.org/jira/browse/MATH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163124#comment-14163124 ] Otmar Ertl commented on MATH-1124: -- It is fairly easy to write a RandomGenerator mock using AbstractRandomGenerator that returns a predefined sequence of random numbers, if testing is your major concern. In my opinion, your proposed interface {quote} sample(double[] uniformRNArray) {quote} is much more complex than {quote} sample(RandomGenerator rng) {quote} More important, there are distributions for which rejection sampling is used to generate random numbers. In this case, the number of required uniform distributed random numbers to generate a single random number obeying the given distribution is random too (see http://en.wikipedia.org/wiki/Rejection_sampling). Instances of AbstractRealDistribution require a random generator. - Key: MATH-1124 URL: https://issues.apache.org/jira/browse/MATH-1124 Project: Commons Math Issue Type: Bug Reporter: Ajo Fod A couple of observations: ... The default random generator takes a while to instantiate. ... Many functions of these distributions don't require a random generator. Generally speaking only sampling requires it. So, why force the default constructor to initialize with a new random generator ... why not use a global generic or simple generator? Or do away with random generator except for sampling? This issue was observed with the TDistribution class , but it is probably applicable to many classes as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MATH-1154) Statistical tests in stat.inference package are very slow due to implicit RandomGenerator initialization
Otmar Ertl created MATH-1154: Summary: Statistical tests in stat.inference package are very slow due to implicit RandomGenerator initialization Key: MATH-1154 URL: https://issues.apache.org/jira/browse/MATH-1154 Project: Commons Math Issue Type: Bug Affects Versions: 3.3 Reporter: Otmar Ertl Some statistical tests defined in the stat.inference package (e.g. BinomialTest or ChiSquareTest) are unnecessarily very slow (up to a factor 20 slower than necessary). The reason is the implicit slow initialization of a default (Well19937c) random generator instance each time a test is performed. The affected tests create some distribution instance in order to use some methods defined therein. However, they do not use any method for random generation. Nevertheless a random number generator instance is automatically created when creating a distribution instance, which is the reason for the serious slowdown. The problem is related to MATH-1124. There are following solutions: 1) Fix the affected statistical tests by passing a light-weight RandomGenerator implementation (or even null) to the constructor of the distribution. 2) Or use for all distributions a RandomGenerator implementation that uses lazy initialization to generate the Well19937c instance as late as possible. This would also solve MATH-1124. I will attach a patch proposal together with a performance test, that will demonstrate the speed up after a fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MATH-1154) Statistical tests in stat.inference package are very slow due to implicit RandomGenerator initialization
[ https://issues.apache.org/jira/browse/MATH-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otmar Ertl updated MATH-1154: - Attachment: math3.patch This patch demonstrates a fix using lazy initialization of default random number generator instances. Furthermore a test is included which gave following results before {noformat} statistical tests performance test (calls per timed block: 10, timed blocks: 100, time unit: ms) name time/call std error total time ratio difference binomial test 1 1.38289492e-02 1.71975630e-04 1.3829e+05 1.e+00 0.e+00 binomial test 2 1.38270752e-02 1.61613547e-04 1.3827e+05 9.9986e-01 -1.87395300e+01 chi square test 2.67553017e-02 2.29903602e-04 2.6755e+05 1.9347e+00 1.29263525e+05 {noformat} and after {noformat} statistical tests performance test (calls per timed block: 10, timed blocks: 100, time unit: ms) name time/call std error total time ratio difference binomial test 1 7.26630369e-04 5.87472596e-05 7.2663e+03 1.e+00 0.e+00 binomial test 2 7.27780967e-04 2.44728991e-05 7.2778e+03 1.0016e+00 1.15059780e+01 chi square test 5.21210430e-04 3.14044354e-05 5.2121e+03 7.1730e-01 -2.05419939e+03 {noformat} the fix. A speedup up to a factor of 20 can be seen. Statistical tests in stat.inference package are very slow due to implicit RandomGenerator initialization Key: MATH-1154 URL: https://issues.apache.org/jira/browse/MATH-1154 Project: Commons Math Issue Type: Bug Affects Versions: 3.3 Reporter: Otmar Ertl Attachments: math3.patch Some statistical tests defined in the stat.inference package (e.g. BinomialTest or ChiSquareTest) are unnecessarily very slow (up to a factor 20 slower than necessary). The reason is the implicit slow initialization of a default (Well19937c) random generator instance each time a test is performed. The affected tests create some distribution instance in order to use some methods defined therein. However, they do not use any method for random generation. Nevertheless a random number generator instance is automatically created when creating a distribution instance, which is the reason for the serious slowdown. The problem is related to MATH-1124. There are following solutions: 1) Fix the affected statistical tests by passing a light-weight RandomGenerator implementation (or even null) to the constructor of the distribution. 2) Or use for all distributions a RandomGenerator implementation that uses lazy initialization to generate the Well19937c instance as late as possible. This would also solve MATH-1124. I will attach a patch proposal together with a performance test, that will demonstrate the speed up after a fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)