from:"Otmar Ertl \(JIRA\)"

[jira] [Updated] (MATH-1307) Create a base class for all RNGs

2015-12-30 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1307:
-
Attachment: BaseRandomGeneratorFloatGenerationPerformanceTest.java

I have played around with the generation of floats and doubles from ints. On my 
environment I made the interesting observation that following code which avoids 
the floating point multiplication
{code}
public double nextDouble2() {
final long high = ((long) (nextInt() >>> 6)) << 26;
final int low  = nextInt() >>> 6;
return Double.longBitsToDouble(0x3ff0L | high | low) - 1.;
}
{code}
is about 20% faster than
{code}
public double nextDouble() {
final long high = ((long) (nextInt() >>> 6)) << 26;
final int low  = nextInt() >>> 6;
return (high | low) * 0x1.0p-52d;
}
{code}
Please see [^BaseRandomGeneratorFloatGenerationPerformanceTest.java] for which 
I got following output:
{code}
nextDouble2: Time = 24.267s, Sum = 4.757900840407535E9
nextFloat: Time = 24.304s, Sum = 1.6777216E7
nextDouble: Time = 29.654s, Sum = 4.757900840407535E9
nextFloat2: Time = 24.276s, Sum = 1.6777216E7
{code}

> Create a base class for all RNGs
> 
>
> Key: MATH-1307
> URL: https://issues.apache.org/jira/browse/MATH-1307
> Project: Commons Math
>  Issue Type: Improvement
>Reporter: Gilles
>Assignee: Gilles
>Priority: Minor
>  Labels: api, inheritance
> Fix For: 4.0
>
> Attachments: BaseRandomGenerator.java, 
> BaseRandomGeneratorFloatGenerationPerformanceTest.java
>
>
> I proposed to create a base class which the existing abstract classes 
> {{AbstractRandomGenerator}} and {{BitsStreamGenerator}} will extend.
> This would allow to define {{nextBytes(byte[])}} at the base class level.
> The code for that method is almost identical in the two hierarchies: they 
> only differ in a call to either {{nextInt()}} or {{next(32)}} respectively; 
> the latter is however the same as the former, in disguise, and is not subject 
> to change given the type of return value.
> As a corollary, the new base class can be the unique place where to add 
> utilities such as the one proposed in MATH-1306.
> *Update:* {{AbstractRandomGenerator}} and {{BitsStreamGenerator}} are both 
> obsoleted by the class proposed in this report.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1296) DescriptiveStatistics return geometric mean as 0 when product of values is zero, expected to return NaN

2015-12-02 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036575#comment-15036575
 ] 

Otmar Ertl commented on MATH-1296:
--

Gilles, I agree, it is the documentation that needs to be fixed. Geometric mean 
should be 0.

By the way, I have checked the documentations for Kurtosis.java and 
Skewness.java. The documentation for Kurtosis.java already describes that NaN 
is returned for less than 4 values. The documentation of Skewness.java also 
needs to be improved, the corresponding information is hidden in the 
documentation of its evaluate() method.

> DescriptiveStatistics return geometric mean as 0 when product of values is 
> zero, expected to return NaN
> ---
>
> Key: MATH-1296
> URL: https://issues.apache.org/jira/browse/MATH-1296
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 3.4.1
>Reporter: Anmol
>Priority: Trivial
>  Labels: easyfix
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
>   @Test
>   public void test() {
>   DescriptiveStatistics stats = new DescriptiveStatistics();
>   stats.addValue(1);
>   stats.addValue(2);
>   stats.addValue(4);
>   System.out.println(stats.getGeometricMean()); //prints 2
>   stats.addValue(0);
>   System.out.println(stats.getGeometricMean()); //prints 0, 
> expected NaN as per the documentation
> }
> The class in consideration is: 
> org.apache.commons.math3.stat.descriptive.DescriptiveStatistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MATH-1254) Unit test CorrelatedRandomVectorGeneratorTest is flaky

2015-11-24 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl resolved MATH-1254.
--
   Resolution: Fixed
Fix Version/s: 3.6
   4.0

4.0: ad12d97cbb4c35017d55c4105174671f1c38b36a
3.6: 635c35be7446c55740765bb67b8bfa21f1c93de5

> Unit test CorrelatedRandomVectorGeneratorTest is flaky
> --
>
> Key: MATH-1254
> URL: https://issues.apache.org/jira/browse/MATH-1254
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Otmar Ertl
>Priority: Minor
> Fix For: 4.0, 3.6
>
> Attachments: seed628.patch
>
>
> CorrelatedRandomVectorGeneratorTest.testSampleWithZeroCovariance sometimes 
> fails. The failure can be reproduced by setting the seed of the 
> JDKRandomGenerator in createSampler equal to 628. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1253) Skewness could get more precision from slightly reordered code.

2015-11-24 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025190#comment-15025190
 ] 

Otmar Ertl commented on MATH-1253:
--

Provided that there is no overflow, I do not think that my proposal is more 
accurate. The relative error of a sum of floating point numbers is the same 
after scaling the numbers by the same factor.

> Skewness could get more precision from slightly reordered code.
> ---
>
> Key: MATH-1253
> URL: https://issues.apache.org/jira/browse/MATH-1253
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 3.5
>Reporter: Bill Murphy
>Priority: Minor
>
> In Skewness.java, approx line 180, there is code like:
> {noformat}
> double accum3 = 0.0;
> for (int i = begin; i < begin + length; i++) {
> final double d = values[i] - m;
> accum3 += d * d * d;
> }
> accum3 /= variance * FastMath.sqrt(variance);
> {noformat}
> If the division was moved into the for loop, accum3 would be less likely to 
> overflow to Infinity (or -Infinity). This might allow computation to return a 
> result in a case such as:
> {noformat}
> double[] numArray = { 1.234E11, 1.234E51, 1.234E101, 1.234E151 };
> Skewnessskew = new Skewness();
> doublesk = skew.evaluate( numArray );
> {noformat}
> Currently, this returns NaN, but I'd prefer it returned approx 
> 1.154700538379252.
> The change I'm proposing would have the code instead read like:
> {noformat}
> double accum3 = 0.0;
> double divisor = variance * FastMath.sqrt(variance);
> for (int i = begin; i < begin + length; i++) {
> final double d = values[i] - m;
> accum3 += d * d * d / divisor;
> }
> {noformat}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1285) Description API ZipfDistribution

2015-11-10 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999300#comment-14999300
 ] 

Otmar Ertl commented on MATH-1285:
--

Added definition of distribution to javadoc.
3.6: 29dfd9bb660cbcfb76aabc80458f9452c991bd35
4.0: 396610625b41f07085c1f43cd6e5e97d69097ba1

> Description API ZipfDistribution
> 
>
> Key: MATH-1285
> URL: https://issues.apache.org/jira/browse/MATH-1285
> Project: Commons Math
>  Issue Type: Improvement
> Environment: All
>Reporter: Pim van der Hoorn
>Priority: Minor
>  Labels: easyfix
> Fix For: 4.0, 3.6
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The description of the constructors for the ZipfDistribution class specifies 
> an argument of type doule, called exponent. The documentation refers to a 
> MathWorld page where the distribution is defined. However, here the exponent 
> p, is the exponent of the cdf, not of the pdf (which is then p + 1). After 
> running some test I found that the exponent in the constructor is the one for 
> the pdf. This can create confusion among users. Therefore I recommend 
> specifying in the documentation which exponent is meant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MATH-1269) FastMath.exp may return NaN for non-NaN arguments

2015-11-05 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl resolved MATH-1269.
--
   Resolution: Fixed
Fix Version/s: 3.6
   4.0

> FastMath.exp may return NaN for non-NaN arguments
> -
>
> Key: MATH-1269
> URL: https://issues.apache.org/jira/browse/MATH-1269
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Otmar Ertl
> Fix For: 4.0, 3.6
>
> Attachments: MATH-1269-fix-tempC-infinity.patch, MATH-1269.patch, 
> MATH-1269.patch, MATH-1269_fix_z.patch
>
>
> I have observed that FastMath.exp(709.8125) returns NaN. However, the 
> exponential function must never return NaN (if the argument is not NaN). The 
> result must always be non-negative or positive infinity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1269) FastMath.exp may return NaN for non-NaN arguments

2015-11-05 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992406#comment-14992406
 ] 

Otmar Ertl commented on MATH-1269:
--

fixed
3.6: 8aecb842d32464a98eaab722da84902f8126957b
4.0: a94ff90ab6cd2d92ccb2eb1fd7913b4e5256f02b

> FastMath.exp may return NaN for non-NaN arguments
> -
>
> Key: MATH-1269
> URL: https://issues.apache.org/jira/browse/MATH-1269
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Otmar Ertl
> Fix For: 4.0, 3.6
>
> Attachments: MATH-1269-fix-tempC-infinity.patch, MATH-1269.patch, 
> MATH-1269.patch, MATH-1269_fix_z.patch
>
>
> I have observed that FastMath.exp(709.8125) returns NaN. However, the 
> exponential function must never return NaN (if the argument is not NaN). The 
> result must always be non-negative or positive infinity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1285) Description API ZipfDistribution

2015-10-27 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977040#comment-14977040
 ] 

Otmar Ertl commented on MATH-1285:
--

Thanks for reporting this. I have changed the reference to Wikipedia's article 
about Zipf's law. See https://en.wikipedia.org/wiki/Zipf's_law.
3.6: 5f03f0d702574a9c5655778ee28fe3e23d86a9b0
4.0: 8ed2209b1f8e2452d71ef8c3149f3ed3d89d4dfa

> Description API ZipfDistribution
> 
>
> Key: MATH-1285
> URL: https://issues.apache.org/jira/browse/MATH-1285
> Project: Commons Math
>  Issue Type: Improvement
> Environment: All
>Reporter: Pim van der Hoorn
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The description of the constructors for the ZipfDistribution class specifies 
> an argument of type doule, called exponent. The documentation refers to a 
> MathWorld page where the distribution is defined. However, here the exponent 
> p, is the exponent of the cdf, not of the pdf (which is then p + 1). After 
> running some test I found that the exponent in the constructor is the one for 
> the pdf. This can create confusion among users. Therefore I recommend 
> specifying in the documentation which exponent is meant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MATH-1285) Description API ZipfDistribution

2015-10-27 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl resolved MATH-1285.
--
   Resolution: Fixed
Fix Version/s: 3.6
   4.0

> Description API ZipfDistribution
> 
>
> Key: MATH-1285
> URL: https://issues.apache.org/jira/browse/MATH-1285
> Project: Commons Math
>  Issue Type: Improvement
> Environment: All
>Reporter: Pim van der Hoorn
>Priority: Minor
>  Labels: easyfix
> Fix For: 4.0, 3.6
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The description of the constructors for the ZipfDistribution class specifies 
> an argument of type doule, called exponent. The documentation refers to a 
> MathWorld page where the distribution is defined. However, here the exponent 
> p, is the exponent of the cdf, not of the pdf (which is then p + 1). After 
> running some test I found that the exponent in the constructor is the one for 
> the pdf. This can create confusion among users. Therefore I recommend 
> specifying in the documentation which exponent is meant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (MATH-1285) Description API ZipfDistribution

2015-10-27 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl reopened MATH-1285:
--

Reopened issue because javadoc needs to be further improved.

> Description API ZipfDistribution
> 
>
> Key: MATH-1285
> URL: https://issues.apache.org/jira/browse/MATH-1285
> Project: Commons Math
>  Issue Type: Improvement
> Environment: All
>Reporter: Pim van der Hoorn
>Priority: Minor
>  Labels: easyfix
> Fix For: 4.0, 3.6
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The description of the constructors for the ZipfDistribution class specifies 
> an argument of type doule, called exponent. The documentation refers to a 
> MathWorld page where the distribution is defined. However, here the exponent 
> p, is the exponent of the cdf, not of the pdf (which is then p + 1). After 
> running some test I found that the exponent in the constructor is the one for 
> the pdf. This can create confusion among users. Therefore I recommend 
> specifying in the documentation which exponent is meant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1269) FastMath.exp may return NaN for non-NaN arguments

2015-10-23 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1269:
-
Attachment: MATH-1269-fix-tempC-infinity.patch

One more proposal. The problem is the calculation of (1+z)(tempA+tempB) if 
tempA is infinite. In this case tempC is also infinite. If z < 0 at the same 
time we get NaN for tempC*z + tempB + tempA. The proposed patch returns 
infinity for the case tempC is positive infinite. In all other cases tempC*z 
cannot be negative infinite and the evaluation of tempC*z + tempB + tempA is 
not a problem even if tempA or tempB are positive infinite.

> FastMath.exp may return NaN for non-NaN arguments
> -
>
> Key: MATH-1269
> URL: https://issues.apache.org/jira/browse/MATH-1269
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Otmar Ertl
> Attachments: MATH-1269-fix-tempC-infinity.patch, MATH-1269.patch, 
> MATH-1269.patch, MATH-1269_fix_z.patch
>
>
> I have observed that FastMath.exp(709.8125) returns NaN. However, the 
> exponential function must never return NaN (if the argument is not NaN). The 
> result must always be non-negative or positive infinity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MATH-1277) Incorrect Kendall Tau calc due to data type mistmatch

2015-09-20 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl resolved MATH-1277.
--
   Resolution: Fixed
Fix Version/s: 3.6
   4.0

> Incorrect Kendall Tau calc due to data type mistmatch
> -
>
> Key: MATH-1277
> URL: https://issues.apache.org/jira/browse/MATH-1277
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 3.5
>Reporter: Marc Rosen
>Priority: Minor
>  Labels: Kendall, correlation, tau
> Fix For: 4.0, 3.6
>
>
> The Kendall Tau calculation returns a number from -1.0 to 1.0
> due to a mixing of ints and longs, a mistake occurs on large size columns 
> (arrays) passed to the function. an array size of > 50350 triggers the 
> condition in my case - although it may be data dependent
> the ver 3.5 library returns 2.6 as a result (outside of the defined range of 
> Kendall Tau)
> with the cast to long below - the result reutns to its expected value
> commons.math3.stat.correlation.KendallsCorrelation.correlation
> here's the sample code I used:
> I added the cast to long of swaps in the 
>   int swaps = 1077126315;
>final long numPairs = sum(50350 - 1);
>   long tiedXPairs = 0;
>   long tiedXYPairs = 0;
>   long tiedYPairs = 0;
>   
> final long concordantMinusDiscordant = numPairs - tiedXPairs 
> - tiedYPairs + tiedXYPairs - 2 * (long) swaps;
>   final double nonTiedPairsMultiplied = 1.6e18;
>   double myTest = concordantMinusDiscordant / 
> FastMath.sqrt(nonTiedPairsMultiplied);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1277) Incorrect Kendall Tau calc due to data type mistmatch

2015-09-20 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877457#comment-14877457
 ] 

Otmar Ertl commented on MATH-1277:
--

Thanks for reporting! Fixed bug in following commits:
* 81ce1b183aa4fb56e7710ebe0274740c268118fa (3.6)
* fb0078159d2463da149de54018fca79a9447153e (4.0)

> Incorrect Kendall Tau calc due to data type mistmatch
> -
>
> Key: MATH-1277
> URL: https://issues.apache.org/jira/browse/MATH-1277
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 3.5
>Reporter: Marc Rosen
>Priority: Minor
>  Labels: Kendall, correlation, tau
> Fix For: 4.0, 3.6
>
>
> The Kendall Tau calculation returns a number from -1.0 to 1.0
> due to a mixing of ints and longs, a mistake occurs on large size columns 
> (arrays) passed to the function. an array size of > 50350 triggers the 
> condition in my case - although it may be data dependent
> the ver 3.5 library returns 2.6 as a result (outside of the defined range of 
> Kendall Tau)
> with the cast to long below - the result reutns to its expected value
> commons.math3.stat.correlation.KendallsCorrelation.correlation
> here's the sample code I used:
> I added the cast to long of swaps in the 
>   int swaps = 1077126315;
>final long numPairs = sum(50350 - 1);
>   long tiedXPairs = 0;
>   long tiedXYPairs = 0;
>   long tiedYPairs = 0;
>   
> final long concordantMinusDiscordant = numPairs - tiedXPairs 
> - tiedYPairs + tiedXYPairs - 2 * (long) swaps;
>   final double nonTiedPairsMultiplied = 1.6e18;
>   double myTest = concordantMinusDiscordant / 
> FastMath.sqrt(nonTiedPairsMultiplied);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MATH-1276) Override GeometricDistribution.inverseCumulativeProbability() to speed up sampling

2015-09-20 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl resolved MATH-1276.
--
   Resolution: Fixed
Fix Version/s: 3.6
   4.0

> Override GeometricDistribution.inverseCumulativeProbability() to speed up 
> sampling
> --
>
> Key: MATH-1276
> URL: https://issues.apache.org/jira/browse/MATH-1276
> Project: Commons Math
>  Issue Type: Improvement
>Reporter: Otmar Ertl
> Fix For: 4.0, 3.6
>
> Attachments: MATH-1276.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1276) Override GeometricDistribution.inverseCumulativeProbability() to speed up sampling

2015-09-20 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14900035#comment-14900035
 ] 

Otmar Ertl commented on MATH-1276:
--

Gilles, thank you for your suggestions. I have split the patch into multiple 
commits:
2fd6c8fa1e392f985ee860d8ee154a612ddde569 (4.0)
df1db29ab401e5fd867f142f949e8feda12604d4 (3.6)

> Override GeometricDistribution.inverseCumulativeProbability() to speed up 
> sampling
> --
>
> Key: MATH-1276
> URL: https://issues.apache.org/jira/browse/MATH-1276
> Project: Commons Math
>  Issue Type: Improvement
>Reporter: Otmar Ertl
> Fix For: 4.0, 3.6
>
> Attachments: MATH-1276.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1276) Override GeometricDistribution.inverseCumulativeProbability() to speed up sampling

2015-09-17 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1276:
-
Flags: Patch

> Override GeometricDistribution.inverseCumulativeProbability() to speed up 
> sampling
> --
>
> Key: MATH-1276
> URL: https://issues.apache.org/jira/browse/MATH-1276
> Project: Commons Math
>  Issue Type: Improvement
>Reporter: Otmar Ertl
> Attachments: MATH-1276.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MATH-1276) Override GeometricDistribution.inverseCumulativeProbability() to speed up sampling

2015-09-17 Thread Otmar Ertl (JIRA)

Otmar Ertl created MATH-1276:


 Summary: Override 
GeometricDistribution.inverseCumulativeProbability() to speed up sampling
 Key: MATH-1276
 URL: https://issues.apache.org/jira/browse/MATH-1276
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1276) Override GeometricDistribution.inverseCumulativeProbability() to speed up sampling

2015-09-17 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1276:
-
Attachment: MATH-1276.patch

Any objections to the proposed patch?

> Override GeometricDistribution.inverseCumulativeProbability() to speed up 
> sampling
> --
>
> Key: MATH-1276
> URL: https://issues.apache.org/jira/browse/MATH-1276
> Project: Commons Math
>  Issue Type: Improvement
>Reporter: Otmar Ertl
> Attachments: MATH-1276.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties

2015-09-16 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790993#comment-14790993
 ] 

Otmar Ertl commented on MATH-1246:
--

Sounds ok to me, due to its slow convergence avoiding Monte Carlo as much as 
possible is generally a good idea.

> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---
>
> Key: MATH-1246
> URL: https://issues.apache.org/jira/browse/MATH-1246
> Project: Commons Math
>  Issue Type: Bug
>Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the 
> distribution of a D-statistic for m-n sets with no ties.  No warning or 
> special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MATH-1274) Represent Kolmogorov-Smirnov statistic as long value

2015-09-16 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl resolved MATH-1274.
--
Resolution: Fixed

> Represent Kolmogorov-Smirnov statistic as long value
> 
>
> Key: MATH-1274
> URL: https://issues.apache.org/jira/browse/MATH-1274
> Project: Commons Math
>  Issue Type: Improvement
>Reporter: Otmar Ertl
> Fix For: 4.0, 3.6
>
> Attachments: MATH-1274.patch
>
>
> In KolmogorovSmirnovTest.java the 2-sample Kolmogorov-Smirnov statistic is 
> internally represented as double value. However, the KS-statistic can be 
> accurately represented as long value if multiplied by both sample sizes. This 
> approach saves some floating point operations and avoids difficulties when 
> comparing two KS-statistics. Currently, Precision.compareTo with some 
> tolerance must be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1274) Represent Kolmogorov-Smirnov statistic as long value

2015-09-16 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790896#comment-14790896
 ] 

Otmar Ertl commented on MATH-1274:
--

Applied the patch after renaming the methods to integralXxx as proposed by 
Phil, thanks!
fb7e1e265dd9e560b3a3127a6593b6602f60026c (4.0) 
1c9c43c1d4bb76d7e47cdfc9c6681a38305a95e4 (3.6)

> Represent Kolmogorov-Smirnov statistic as long value
> 
>
> Key: MATH-1274
> URL: https://issues.apache.org/jira/browse/MATH-1274
> Project: Commons Math
>  Issue Type: Improvement
>Reporter: Otmar Ertl
> Fix For: 4.0, 3.6
>
> Attachments: MATH-1274.patch
>
>
> In KolmogorovSmirnovTest.java the 2-sample Kolmogorov-Smirnov statistic is 
> internally represented as double value. However, the KS-statistic can be 
> accurately represented as long value if multiplied by both sample sizes. This 
> approach saves some floating point operations and avoids difficulties when 
> comparing two KS-statistics. Currently, Precision.compareTo with some 
> tolerance must be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1274) Represent Kolmogorov-Smirnov statistic as long value

2015-09-16 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1274:
-
Fix Version/s: 3.6
   4.0

> Represent Kolmogorov-Smirnov statistic as long value
> 
>
> Key: MATH-1274
> URL: https://issues.apache.org/jira/browse/MATH-1274
> Project: Commons Math
>  Issue Type: Improvement
>Reporter: Otmar Ertl
> Fix For: 4.0, 3.6
>
> Attachments: MATH-1274.patch
>
>
> In KolmogorovSmirnovTest.java the 2-sample Kolmogorov-Smirnov statistic is 
> internally represented as double value. However, the KS-statistic can be 
> accurately represented as long value if multiplied by both sample sizes. This 
> approach saves some floating point operations and avoids difficulties when 
> comparing two KS-statistics. Currently, Precision.compareTo with some 
> tolerance must be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties

2015-09-15 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746924#comment-14746924
 ] 

Otmar Ertl commented on MATH-1246:
--

The p-value is the probability that the observed KS-statistic is smaller than 
the KS-statistic that I get if two random samples of same sizes are drawn from 
the underlying distribution. In the no-ties case this value can be calculated 
exactly without knowing the underlying distribution. In case of ties, the 
p-value cannot be calculated exactly. There are different approaches how to 
calculate some approximation of the p-value for the tie-case:
* Approximation of the underlying distribution by the observed data, which 
definitely makes sense for bootstrapping where the sample sizes are usually 
large. However, in our case the underlying distribution is estimated from small 
sample sizes, since this is the domain for the exactP method. Therefore, I 
doubt that the calculate p-value deserves the label "exact" in this case.
* Assumption that orderings of observed equal values are equally likely, which 
of course is also an approximation.

I still do not understand why the first approach should be the true one.

> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---
>
> Key: MATH-1246
> URL: https://issues.apache.org/jira/browse/MATH-1246
> Project: Commons Math
>  Issue Type: Bug
>Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the 
> distribution of a D-statistic for m-n sets with no ties.  No warning or 
> special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1274) Represent Kolmogorov-Smirnov statistic as long value

2015-09-15 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1274:
-
Attachment: MATH-1274.patch

> Represent Kolmogorov-Smirnov statistic as long value
> 
>
> Key: MATH-1274
> URL: https://issues.apache.org/jira/browse/MATH-1274
> Project: Commons Math
>  Issue Type: Improvement
>Reporter: Otmar Ertl
> Attachments: MATH-1274.patch
>
>
> In KolmogorovSmirnovTest.java the 2-sample Kolmogorov-Smirnov statistic is 
> internally represented as double value. However, the KS-statistic can be 
> accurately represented as long value if multiplied by both sample sizes. This 
> approach saves some floating point operations and avoids difficulties when 
> comparing two KS-statistics. Currently, Precision.compareTo with some 
> tolerance must be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1274) Represent Kolmogorov-Smirnov statistic as long value

2015-09-15 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1274:
-
Flags: Patch

> Represent Kolmogorov-Smirnov statistic as long value
> 
>
> Key: MATH-1274
> URL: https://issues.apache.org/jira/browse/MATH-1274
> Project: Commons Math
>  Issue Type: Improvement
>Reporter: Otmar Ertl
> Attachments: MATH-1274.patch
>
>
> In KolmogorovSmirnovTest.java the 2-sample Kolmogorov-Smirnov statistic is 
> internally represented as double value. However, the KS-statistic can be 
> accurately represented as long value if multiplied by both sample sizes. This 
> approach saves some floating point operations and avoids difficulties when 
> comparing two KS-statistics. Currently, Precision.compareTo with some 
> tolerance must be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MATH-1274) Represent Kolmogorov-Smirnov statistic as long value

2015-09-15 Thread Otmar Ertl (JIRA)

Otmar Ertl created MATH-1274:


 Summary: Represent Kolmogorov-Smirnov statistic as long value
 Key: MATH-1274
 URL: https://issues.apache.org/jira/browse/MATH-1274
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl


In KolmogorovSmirnovTest.java the 2-sample Kolmogorov-Smirnov statistic is 
internally represented as double value. However, the KS-statistic can be 
accurately represented as long value if multiplied by both sample sizes. This 
approach saves some floating point operations and avoids difficulties when 
comparing two KS-statistics. Currently, Precision.compareTo with some tolerance 
must be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties

2015-09-15 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745495#comment-14745495
 ] 

Otmar Ertl commented on MATH-1246:
--

After some research I have the feeling we are discussing how to define zero 
divided by zero. There are at least two methods to calculate a reasonable 
p-value in the presence of ties:
# The method you have proposed which seems to be also known as permutation 
method. Averaging only over some permutations and averaging over all possible 
permutations correspond to the bootstrap method and the current exactP() 
implementation, respectively.
# Another method is to add some jitter to the sampled values to break ties. 
(This google search 
https://www.google.com/?gfe_rd=cr=qCL4VaKvNIWI8QfLibD4Bg_rd=cr=1#q=jitter+kolmogorov+smirnov
 immediately gives you a couple of references.) This method corresponds to the 
method I have proposed. Adding small random values to ties to get a strict 
ordering corresponds to choosing any random ordering. Averaging over all 
possible orderings would also lead to a well-defined p-value.
Maybe, the user should be able to choose the method how to resolve ties?


> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---
>
> Key: MATH-1246
> URL: https://issues.apache.org/jira/browse/MATH-1246
> Project: Commons Math
>  Issue Type: Bug
>Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the 
> distribution of a D-statistic for m-n sets with no ties.  No warning or 
> special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties

2015-09-15 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745495#comment-14745495
 ] 

Otmar Ertl edited comment on MATH-1246 at 9/15/15 2:00 PM:
---

After some research I have the feeling we are discussing how to define zero 
divided by zero. There are at least two methods to calculate a reasonable 
p-value in the presence of ties:
# The method you have proposed which seems to be also known as permutation 
method. Averaging only over some permutations and averaging over all possible 
permutations correspond to the bootstrap method and the current exactP() 
implementation, respectively.
# Another method is to add some jitter to the sampled values to break ties. 
(This google search 
https://www.google.com/?gfe_rd=cr=qCL4VaKvNIWI8QfLibD4Bg_rd=cr=1#q=jitter+kolmogorov+smirnov
 immediately gives you a couple of references.) This method corresponds to the 
method I have proposed. Adding small random values to ties to get a strict 
ordering corresponds to choosing any random ordering. Averaging over all 
possible orderings would also lead to a well-defined p-value.

Maybe, the user should be able to choose the method how to resolve ties?



was (Author: otmar ertl):
After some research I have the feeling we are discussing how to define zero 
divided by zero. There are at least two methods to calculate a reasonable 
p-value in the presence of ties:
# The method you have proposed which seems to be also known as permutation 
method. Averaging only over some permutations and averaging over all possible 
permutations correspond to the bootstrap method and the current exactP() 
implementation, respectively.
# Another method is to add some jitter to the sampled values to break ties. 
(This google search 
https://www.google.com/?gfe_rd=cr=qCL4VaKvNIWI8QfLibD4Bg_rd=cr=1#q=jitter+kolmogorov+smirnov
 immediately gives you a couple of references.) This method corresponds to the 
method I have proposed. Adding small random values to ties to get a strict 
ordering corresponds to choosing any random ordering. Averaging over all 
possible orderings would also lead to a well-defined p-value.
Maybe, the user should be able to choose the method how to resolve ties?


> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---
>
> Key: MATH-1246
> URL: https://issues.apache.org/jira/browse/MATH-1246
> Project: Commons Math
>  Issue Type: Bug
>Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the 
> distribution of a D-statistic for m-n sets with no ties.  No warning or 
> special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties

2015-09-14 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744096#comment-14744096
 ] 

Otmar Ertl commented on MATH-1246:
--

I still have doubts:
# There is a difference, if there are ties within one sample, or if the same 
value exists at least once in both samples. In the first case the D-statistic 
is well defined. In the latter case the D-statistic is undefined. For example, 
if x = (1, 3, 3, 5) and y = (2, 4, 4, 6) D = 0.5. On the other hand, if  x = 
(1, 3, 3, 5) and y = (2, 3, 3, 6) the D-statistic could be any value  between 
0.25 and 0.75. The current implementation returns the minimum (0.25 in this 
case), but this seems to be a quite arbitrary choice. Furthermore, the 
implementation does not distinguish between these two cases (see hasTies() 
method).
# If the current implementation of exactP() follows the definition you 
described, I do not really understand why the following two statements return 
different values:
{code}
new KolmogorovSmirnovTest().exactP(new double[]{0.9, 1.0, 1.1}, new 
double[]{0.0, 0.0}, false)
new KolmogorovSmirnovTest().exactP(new double[]{1.0, 1.0, 1.0}, new 
double[]{0.0, 0.0}, false)
{code}
The D-statistic is well-defined and equal to 1 in both cases.
# \[1\] describes estimating the p-Value using bootstrapping. I am not sure, if 
an exact definition can be derived from there, since bootstrapping in general 
is not a consistent  estimation method.

> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---
>
> Key: MATH-1246
> URL: https://issues.apache.org/jira/browse/MATH-1246
> Project: Commons Math
>  Issue Type: Bug
>Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the 
> distribution of a D-statistic for m-n sets with no ties.  No warning or 
> special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1269) FastMath.exp may return NaN for non-NaN arguments

2015-09-10 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739384#comment-14739384
 ] 

Otmar Ertl commented on MATH-1269:
--

+1 for 2nd patch, I have run the test over the range (MAX_LONG - 1, 
MAX_LONG + 1), I think that should be enough to assume that there is a 
clean transition from finite to infinite.

> FastMath.exp may return NaN for non-NaN arguments
> -
>
> Key: MATH-1269
> URL: https://issues.apache.org/jira/browse/MATH-1269
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Otmar Ertl
> Attachments: MATH-1269.patch, MATH-1269.patch
>
>
> I have observed that FastMath.exp(709.8125) returns NaN. However, the 
> exponential function must never return NaN (if the argument is not NaN). The 
> result must always be non-negative or positive infinity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties

2015-09-10 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739536#comment-14739536
 ] 

Otmar Ertl commented on MATH-1246:
--

The Monte Carlo approach can be modified by simultaneously sampling D. Here is 
an outline how this sampling  could be achieved:
# First determine set of points P = (p_i) for which equal values exist in both 
samples.
# Determine maximum difference of CDFs over all values not included in P
# Determine for each point p_i if it is possible at all to get a CDF difference 
that is larger than the calculated maximum. If not, those points can be 
excluded from P. Otherwise, remember the difference of the CDF d_i just before 
that point and the number of equal values in both samples n_i and m_i, 
respectively.
# Within each Monte Carlo iteration, generate for each point p_i a random 
ordering of the n_i and m_i equal values (using a function similar to 
fillBooleanArrayRandomlyWithFixedNumberTrueValues). Determine the maximum 
differences of the CDFs at all points p_i using the random ordering and d_i, 
and take the maximum of them and the maximum calculated in 2) which gives us 
the sampled (observed) D-statistic that is finally compared to curD.

Anyway, we should find the right definition first before implementing anything.

> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---
>
> Key: MATH-1246
> URL: https://issues.apache.org/jira/browse/MATH-1246
> Project: Commons Math
>  Issue Type: Bug
>Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the 
> distribution of a D-statistic for m-n sets with no ties.  No warning or 
> special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1269) FastMath.exp may return NaN for non-NaN arguments

2015-09-09 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737181#comment-14737181
 ] 

Otmar Ertl commented on MATH-1269:
--

I have an objection. The patch changes the result of FastMath.exp(709.0001) 
from 8.219229343394329E307 to Double.POSITIVE_INFINITY. Although this kind of 
error is less harmful than that caused by the bug, I would prefer a more 
rigorous solution.

> FastMath.exp may return NaN for non-NaN arguments
> -
>
> Key: MATH-1269
> URL: https://issues.apache.org/jira/browse/MATH-1269
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Otmar Ertl
> Attachments: MATH-1269.patch
>
>
> I have observed that FastMath.exp(709.8125) returns NaN. However, the 
> exponential function must never return NaN (if the argument is not NaN). The 
> result must always be non-negative or positive infinity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties

2015-09-09 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737500#comment-14737500
 ] 

Otmar Ertl commented on MATH-1246:
--

I am thinking of another way to treat ties:

The probability that two values sampled from a continuous distribution are 
equal is equal to 0. One of them is always greater than the other. However, 
represented as doubles we cannot distinguish them. Therefore, the best what we 
can do is to treat both cases equally likely. For example, if we have x = (0, 
3, 5) and y = (5, 6, 7) we get two different values for the observed 
D-statistic. If we assume value 5 in x to be smaller than that in y, we would 
get D=3. Otherwise, we would get D=2, both with probability 0.5. In the general 
case, we can determine a discrete distribution describing all possible values 
of the observed D-statistics. Finally, we calculate the p-value for each of 
those possible values and calculate the weighted average which we take as the 
final p-value.

Does this make sense? If yes, I think there is a way to adapt the new Monte 
Carlo approach.

> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---
>
> Key: MATH-1246
> URL: https://issues.apache.org/jira/browse/MATH-1246
> Project: Commons Math
>  Issue Type: Bug
>Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the 
> distribution of a D-statistic for m-n sets with no ties.  No warning or 
> special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1269) FastMath.exp may return NaN for non-NaN arguments

2015-09-09 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737436#comment-14737436
 ] 

Otmar Ertl commented on MATH-1269:
--

In my opinion simply changing NaNs to infinity can only be a short-term 
solution. Such a transformation at the end of this function is like admitting 
that we do not really know how our code works. This is not acceptable for such 
an essential function as the exponential function.

> FastMath.exp may return NaN for non-NaN arguments
> -
>
> Key: MATH-1269
> URL: https://issues.apache.org/jira/browse/MATH-1269
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Otmar Ertl
> Attachments: MATH-1269.patch
>
>
> I have observed that FastMath.exp(709.8125) returns NaN. However, the 
> exponential function must never return NaN (if the argument is not NaN). The 
> result must always be non-negative or positive infinity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties

2015-09-09 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737500#comment-14737500
 ] 

Otmar Ertl edited comment on MATH-1246 at 9/9/15 9:19 PM:
--

I am thinking of another way to treat ties:

The probability that two values sampled from a continuous distribution are 
equal is equal to 0. One of them is always greater than the other. However, 
represented as doubles we cannot distinguish them. Therefore, the best what we 
can do is to treat both cases equally likely. For example, if we have x = (0, 
3, 5) and y = (5, 6, 7) we get two different values for the observed 
D-statistic. If we assume value 5 in x to be smaller than that in y, we would 
get D=1. Otherwise, we would get D=2/3, both with probability 0.5. In the 
general case, we can determine a discrete distribution describing all possible 
values of the observed D-statistics. Finally, we calculate the p-value for each 
of those possible values and calculate the weighted average which we take as 
the final p-value.

Does this make sense? If yes, I think there is a way to adapt the new Monte 
Carlo approach.


was (Author: otmar ertl):
I am thinking of another way to treat ties:

The probability that two values sampled from a continuous distribution are 
equal is equal to 0. One of them is always greater than the other. However, 
represented as doubles we cannot distinguish them. Therefore, the best what we 
can do is to treat both cases equally likely. For example, if we have x = (0, 
3, 5) and y = (5, 6, 7) we get two different values for the observed 
D-statistic. If we assume value 5 in x to be smaller than that in y, we would 
get D=3. Otherwise, we would get D=2, both with probability 0.5. In the general 
case, we can determine a discrete distribution describing all possible values 
of the observed D-statistics. Finally, we calculate the p-value for each of 
those possible values and calculate the weighted average which we take as the 
final p-value.

Does this make sense? If yes, I think there is a way to adapt the new Monte 
Carlo approach.

> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---
>
> Key: MATH-1246
> URL: https://issues.apache.org/jira/browse/MATH-1246
> Project: Commons Math
>  Issue Type: Bug
>Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the 
> distribution of a D-statistic for m-n sets with no ties.  No warning or 
> special handling is delivered in the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1262) Heap overflow in org.apache.commons.math4.special.BesselJ

2015-09-08 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734949#comment-14734949
 ] 

Otmar Ertl commented on MATH-1262:
--

+1 for updating Javadoc and adding a note that memory needs are proportional to 
the order.

> Heap overflow in org.apache.commons.math4.special.BesselJ
> -
>
> Key: MATH-1262
> URL: https://issues.apache.org/jira/browse/MATH-1262
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Neil Walkinshaw
>Priority: Minor
>  Labels: newbie, performance
>
> This test case:
> {code:java}
> import org.apache.commons.math4.special.BesselJ;
> public class BesselJHeapBug {
> public static void main(String[] args) {
> BesselJ.value(1182054491, 3.589635306407856E-8D);
> }
> }
> {code}
> throws the following exception:
> {code}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>   at org.apache.commons.math4.special.BesselJ.rjBesl(BesselJ.java:247)
>   at org.apache.commons.math4.special.BesselJ.value(BesselJ.java:161)
>   at BesselJHeapBug.main(BesselJHeapBug.java:5)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MATH-1262) Heap overflow in org.apache.commons.math4.special.BesselJ

2015-09-06 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14721676#comment-14721676
 ] 

Otmar Ertl edited comment on MATH-1262 at 9/6/15 8:46 AM:
--

-This [paper|http://www.cl.cam.ac.uk/~jrh13/papers/bessel.pdf] could be 
interesting.- 

EDIT:
The paper states that the approach could be extended to moderate orders. 
However, I doubt that the method could handle large orders such as 1182054491 
as given in the test case.


was (Author: otmar ertl):
This [paper|http://www.cl.cam.ac.uk/~jrh13/papers/bessel.pdf] could be 
interesting.

> Heap overflow in org.apache.commons.math4.special.BesselJ
> -
>
> Key: MATH-1262
> URL: https://issues.apache.org/jira/browse/MATH-1262
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Neil Walkinshaw
>Priority: Minor
>  Labels: newbie, performance
>
> This test case:
> {code:java}
> import org.apache.commons.math4.special.BesselJ;
> public class BesselJHeapBug {
> public static void main(String[] args) {
> BesselJ.value(1182054491, 3.589635306407856E-8D);
> }
> }
> {code}
> throws the following exception:
> {code}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>   at org.apache.commons.math4.special.BesselJ.rjBesl(BesselJ.java:247)
>   at org.apache.commons.math4.special.BesselJ.value(BesselJ.java:161)
>   at BesselJHeapBug.main(BesselJHeapBug.java:5)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1262) Heap overflow in org.apache.commons.math4.special.BesselJ

2015-09-06 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732386#comment-14732386
 ] 

Otmar Ertl commented on MATH-1262:
--

The paper [C.Schwartz, Numerical Calculation of Bessel Functions, 
2012|http://arxiv.org/pdf/1209.1547] demonstrates an interesting approach to 
calculate Bessel functions. Unfortunately, it does not contain an algorithm 
that can be implemented straightforwardly.

> Heap overflow in org.apache.commons.math4.special.BesselJ
> -
>
> Key: MATH-1262
> URL: https://issues.apache.org/jira/browse/MATH-1262
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Neil Walkinshaw
>Priority: Minor
>  Labels: newbie, performance
>
> This test case:
> {code:java}
> import org.apache.commons.math4.special.BesselJ;
> public class BesselJHeapBug {
> public static void main(String[] args) {
> BesselJ.value(1182054491, 3.589635306407856E-8D);
> }
> }
> {code}
> throws the following exception:
> {code}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>   at org.apache.commons.math4.special.BesselJ.rjBesl(BesselJ.java:247)
>   at org.apache.commons.math4.special.BesselJ.value(BesselJ.java:161)
>   at BesselJHeapBug.main(BesselJHeapBug.java:5)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MATH-1269) FastMath.exp may return NaN

2015-09-06 Thread Otmar Ertl (JIRA)

Otmar Ertl created MATH-1269:


 Summary: FastMath.exp may return NaN
 Key: MATH-1269
 URL: https://issues.apache.org/jira/browse/MATH-1269
 Project: Commons Math
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Otmar Ertl


I have observed that FastMath.exp(709.8125) returns NaN. However, the 
exponential function must never return NaN. The result must always be 
non-negative or positive infinity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1269) FastMath.exp may return NaN for non-NaN arguments

2015-09-06 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1269:
-
Summary: FastMath.exp may return NaN for non-NaN arguments  (was: 
FastMath.exp may return NaN)

> FastMath.exp may return NaN for non-NaN arguments
> -
>
> Key: MATH-1269
> URL: https://issues.apache.org/jira/browse/MATH-1269
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Otmar Ertl
>
> I have observed that FastMath.exp(709.8125) returns NaN. However, the 
> exponential function must never return NaN (if the argument is not NaN). The 
> result must always be non-negative or positive infinity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1269) FastMath.exp may return NaN

2015-09-06 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1269:
-
Description: I have observed that FastMath.exp(709.8125) returns NaN. 
However, the exponential function must never return NaN (if the argument is not 
NaN). The result must always be non-negative or positive infinity.  (was: I 
have observed that FastMath.exp(709.8125) returns NaN. However, the exponential 
function must never return NaN. The result must always be non-negative or 
positive infinity.)

> FastMath.exp may return NaN
> ---
>
> Key: MATH-1269
> URL: https://issues.apache.org/jira/browse/MATH-1269
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Otmar Ertl
>
> I have observed that FastMath.exp(709.8125) returns NaN. However, the 
> exponential function must never return NaN (if the argument is not NaN). The 
> result must always be non-negative or positive infinity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1262) Heap overflow in org.apache.commons.math4.special.BesselJ

2015-08-30 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721676#comment-14721676
 ] 

Otmar Ertl commented on MATH-1262:
--

This [paper|http://www.cl.cam.ac.uk/~jrh13/papers/bessel.pdf] could be 
interesting.

 Heap overflow in org.apache.commons.math4.special.BesselJ
 -

 Key: MATH-1262
 URL: https://issues.apache.org/jira/browse/MATH-1262
 Project: Commons Math
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Neil Walkinshaw
Priority: Minor
  Labels: newbie, performance

 This test case:
 {code:java}
 import org.apache.commons.math4.special.BesselJ;
 public class BesselJHeapBug {
 public static void main(String[] args) {
 BesselJ.value(1182054491, 3.589635306407856E-8D);
 }
 }
 {code}
 throws the following exception:
 {code}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
   at org.apache.commons.math4.special.BesselJ.rjBesl(BesselJ.java:247)
   at org.apache.commons.math4.special.BesselJ.value(BesselJ.java:161)
   at BesselJHeapBug.main(BesselJHeapBug.java:5)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1262) Heap overflow in org.apache.commons.math4.special.BesselJ

2015-08-25 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711484#comment-14711484
 ] 

Otmar Ertl commented on MATH-1262:
--

The current implementation allocates a double array with size equal to order+1. 
Hence, the OutOfMemoryError is not very surprising in this case with order 
equal to 1182054491. To fix this, we would need to implement a more space 
efficient algorithm like the one used in boost, see 
https://github.com/mirror/boost/blob/master/boost/math/special_functions/detail/bessel_jn.hpp.

 Heap overflow in org.apache.commons.math4.special.BesselJ
 -

 Key: MATH-1262
 URL: https://issues.apache.org/jira/browse/MATH-1262
 Project: Commons Math
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Neil Walkinshaw
Priority: Minor
  Labels: newbie, performance

 This test case:
 {code:java}
 import org.apache.commons.math4.special.BesselJ;
 public class BesselJHeapBug {
 public static void main(String[] args) {
 BesselJ.value(1182054491, 3.589635306407856E-8D);
 }
 }
 {code}
 throws the following exception:
 {code}
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
   at org.apache.commons.math4.special.BesselJ.rjBesl(BesselJ.java:247)
   at org.apache.commons.math4.special.BesselJ.value(BesselJ.java:161)
   at BesselJHeapBug.main(BesselJHeapBug.java:5)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-08-24 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl resolved MATH-1220.
--
Resolution: Fixed

 More efficient sample() method for ZipfDistribution
 ---

 Key: MATH-1220
 URL: https://issues.apache.org/jira/browse/MATH-1220
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
Assignee: Otmar Ertl
 Fix For: 4.0, 3.6

 Attachments: Zipf Rejection Inversion Sampling Notes.pdf, patch_v1, 
 patch_v2, patch_v3


 Currently, sampling from a ZipfDistribution is very inefficient. Random 
 values are generated by inverting the CDF. However, the current 
 implementation uses O(N) power function evaluations to calculate the CDF for 
 some point. (Here N is the number of points of the Zipf distribution.) I 
 propose to use rejection sampling instead, which allows the generation of a 
 single random value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-08-24 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14709928#comment-14709928
 ] 

Otmar Ertl commented on MATH-1220:
--

Introduced rejection-inversion sampling approach with commits
* 484196fc2e11b632964895b6662dc783d4538b35 (3.6)
* 9c51e5316babbd370bc32aed0fee134216726ec9 (4.0)


 More efficient sample() method for ZipfDistribution
 ---

 Key: MATH-1220
 URL: https://issues.apache.org/jira/browse/MATH-1220
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
Assignee: Otmar Ertl
 Fix For: 4.0, 3.6

 Attachments: Zipf Rejection Inversion Sampling Notes.pdf, patch_v1, 
 patch_v2, patch_v3


 Currently, sampling from a ZipfDistribution is very inefficient. Random 
 values are generated by inverting the CDF. However, the current 
 implementation uses O(N) power function evaluations to calculate the CDF for 
 some point. (Here N is the number of points of the Zipf distribution.) I 
 propose to use rejection sampling instead, which allows the generation of a 
 single random value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-990) Improve performance of MathArrays.sortInPlace

2015-08-22 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707967#comment-14707967
 ] 

Otmar Ertl commented on MATH-990:
-

Applied patch in following commits:
* 72a46babeb0da02071e47425ebd55da915b98178 (4.0)
* 32c5f86125ea0ce1741a386eff0da5f8455f879c (3.6)

 Improve performance of MathArrays.sortInPlace
 -

 Key: MATH-990
 URL: https://issues.apache.org/jira/browse/MATH-990
 Project: Commons Math
  Issue Type: Improvement
Affects Versions: 3.2
Reporter: Gilles
Assignee: Otmar Ertl
Priority: Minor
  Labels: benchmark
 Fix For: 4.0

 Attachments: patch


 Performance suffers from lots of copying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-990) Improve performance of MathArrays.sortInPlace

2015-08-22 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-990:

Fix Version/s: 3.6

 Improve performance of MathArrays.sortInPlace
 -

 Key: MATH-990
 URL: https://issues.apache.org/jira/browse/MATH-990
 Project: Commons Math
  Issue Type: Improvement
Affects Versions: 3.2
Reporter: Gilles
Assignee: Otmar Ertl
Priority: Minor
  Labels: benchmark
 Fix For: 4.0, 3.6

 Attachments: patch


 Performance suffers from lots of copying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MATH-990) Improve performance of MathArrays.sortInPlace

2015-08-22 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl resolved MATH-990.
-
Resolution: Fixed

 Improve performance of MathArrays.sortInPlace
 -

 Key: MATH-990
 URL: https://issues.apache.org/jira/browse/MATH-990
 Project: Commons Math
  Issue Type: Improvement
Affects Versions: 3.2
Reporter: Gilles
Assignee: Otmar Ertl
Priority: Minor
  Labels: benchmark
 Fix For: 4.0, 3.6

 Attachments: patch


 Performance suffers from lots of copying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-990) Improve performance of MathArrays.sortInPlace

2015-08-20 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705515#comment-14705515
 ] 

Otmar Ertl commented on MATH-990:
-

I think I do not have the rights to reassign this issue.

 Improve performance of MathArrays.sortInPlace
 -

 Key: MATH-990
 URL: https://issues.apache.org/jira/browse/MATH-990
 Project: Commons Math
  Issue Type: Improvement
Affects Versions: 3.2
Reporter: Gilles
Assignee: Gilles
Priority: Minor
  Labels: benchmark
 Fix For: 4.0

 Attachments: patch


 Performance suffers from lots of copying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1258) compute() method in classes in org.apache.commons.math4.ml.distance package

2015-08-20 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705509#comment-14705509
 ] 

Otmar Ertl commented on MATH-1258:
--

I have just added the missing entries to changes.xml. I think we can close this 
issue now. It seems that I do not have the required rights to do that. 

 compute() method in classes in org.apache.commons.math4.ml.distance package
 ---

 Key: MATH-1258
 URL: https://issues.apache.org/jira/browse/MATH-1258
 Project: Commons Math
  Issue Type: Bug
Reporter: Gunel Jahangirova
Priority: Minor
 Attachments: patch.diff, pre-MATH-1258.patch


 Hi!
 There are five classes CanberraDistance, ChebyshevDistance, 
 EarthMoversDistance, EuclideanDistance and ManhattanDistance in 
 org.apache.commons.math4.ml.distance package, which compute different types 
 of distances. Each of them contains method compute(double[] a, double[] b) 
 that accepts two double arrays as variables.
 However, if the lengths of array a is greater than the length of array b, the 
 method compute() in all the five classes produces 
 java.lang.ArrayIndexOutOfBoundsException.
 For example,
  private void test0() {
CanberraDistance distance = new CanberraDistance();
 
 final double[] a = { 1, 2, 3, 4, 9, 4 };
 final double[] b = { -5, -6, 7, 4, 3 };
 distance.compute(a, b);
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MATH-1258) compute() method in classes in org.apache.commons.math4.ml.distance package

2015-08-20 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl resolved MATH-1258.
--
   Resolution: Fixed
Fix Version/s: 3.6
   4.0

 compute() method in classes in org.apache.commons.math4.ml.distance package
 ---

 Key: MATH-1258
 URL: https://issues.apache.org/jira/browse/MATH-1258
 Project: Commons Math
  Issue Type: Bug
Reporter: Gunel Jahangirova
Priority: Minor
 Fix For: 4.0, 3.6

 Attachments: patch.diff, pre-MATH-1258.patch


 Hi!
 There are five classes CanberraDistance, ChebyshevDistance, 
 EarthMoversDistance, EuclideanDistance and ManhattanDistance in 
 org.apache.commons.math4.ml.distance package, which compute different types 
 of distances. Each of them contains method compute(double[] a, double[] b) 
 that accepts two double arrays as variables.
 However, if the lengths of array a is greater than the length of array b, the 
 method compute() in all the five classes produces 
 java.lang.ArrayIndexOutOfBoundsException.
 For example,
  private void test0() {
CanberraDistance distance = new CanberraDistance();
 
 final double[] a = { 1, 2, 3, 4, 9, 4 };
 final double[] b = { -5, -6, 7, 4, 3 };
 distance.compute(a, b);
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1258) compute() method in classes in org.apache.commons.math4.ml.distance package

2015-08-20 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705674#comment-14705674
 ] 

Otmar Ertl commented on MATH-1258:
--

Thanks! It worked now.

 compute() method in classes in org.apache.commons.math4.ml.distance package
 ---

 Key: MATH-1258
 URL: https://issues.apache.org/jira/browse/MATH-1258
 Project: Commons Math
  Issue Type: Bug
Reporter: Gunel Jahangirova
Priority: Minor
 Fix For: 4.0, 3.6

 Attachments: patch.diff, pre-MATH-1258.patch


 Hi!
 There are five classes CanberraDistance, ChebyshevDistance, 
 EarthMoversDistance, EuclideanDistance and ManhattanDistance in 
 org.apache.commons.math4.ml.distance package, which compute different types 
 of distances. Each of them contains method compute(double[] a, double[] b) 
 that accepts two double arrays as variables.
 However, if the lengths of array a is greater than the length of array b, the 
 method compute() in all the five classes produces 
 java.lang.ArrayIndexOutOfBoundsException.
 For example,
  private void test0() {
CanberraDistance distance = new CanberraDistance();
 
 final double[] a = { 1, 2, 3, 4, 9, 4 };
 final double[] b = { -5, -6, 7, 4, 3 };
 distance.compute(a, b);
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1258) compute() method in classes in org.apache.commons.math4.ml.distance package

2015-08-20 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705180#comment-14705180
 ] 

Otmar Ertl commented on MATH-1258:
--

Committed my patch:
7934bfea106206d2840ba062eef105001601588a (3.6)
5ca0a1c3564d35293a5ecf03e5f32e6f0f6f445c (4.0)

 compute() method in classes in org.apache.commons.math4.ml.distance package
 ---

 Key: MATH-1258
 URL: https://issues.apache.org/jira/browse/MATH-1258
 Project: Commons Math
  Issue Type: Bug
Reporter: Gunel Jahangirova
Priority: Minor
 Attachments: patch.diff, pre-MATH-1258.patch


 Hi!
 There are five classes CanberraDistance, ChebyshevDistance, 
 EarthMoversDistance, EuclideanDistance and ManhattanDistance in 
 org.apache.commons.math4.ml.distance package, which compute different types 
 of distances. Each of them contains method compute(double[] a, double[] b) 
 that accepts two double arrays as variables.
 However, if the lengths of array a is greater than the length of array b, the 
 method compute() in all the five classes produces 
 java.lang.ArrayIndexOutOfBoundsException.
 For example,
  private void test0() {
CanberraDistance distance = new CanberraDistance();
 
 final double[] a = { 1, 2, 3, 4, 9, 4 };
 final double[] b = { -5, -6, 7, 4, 3 };
 distance.compute(a, b);
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-08-20 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705207#comment-14705207
 ] 

Otmar Ertl commented on MATH-1220:
--

If there are not any concerns, I will commit [^patch_v3].

 More efficient sample() method for ZipfDistribution
 ---

 Key: MATH-1220
 URL: https://issues.apache.org/jira/browse/MATH-1220
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Fix For: 4.0, 3.6

 Attachments: Zipf Rejection Inversion Sampling Notes.pdf, patch_v1, 
 patch_v2, patch_v3


 Currently, sampling from a ZipfDistribution is very inefficient. Random 
 values are generated by inverting the CDF. However, the current 
 implementation uses O(N) power function evaluations to calculate the CDF for 
 some point. (Here N is the number of points of the Zipf distribution.) I 
 propose to use rejection sampling instead, which allows the generation of a 
 single random value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1258) compute() method in classes in org.apache.commons.math4.ml.distance package

2015-08-19 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1258:
-
Attachment: patch.diff

I also do not like throwing an ArrayIndexOutOfBoundsException due to 2 reasons:

* The current behavior is asymmetric. distance(a,b) succeeds while distance(b, 
a) fails with an exception, if a and b have different lengths. Since distance 
functions are known to be symmetric, their behavior should be symmetric too, 
even in case of invalid parameters.
* The element-by-element functions in MathArrays throw a 
DimensionMismatchException instead of an ArrayIndexOutOfBoundsException. For 
the sake of consistency, I think throwing a DimensionMismatchException is more 
appropriate for distance functions as well.

Please see attached patch.


 compute() method in classes in org.apache.commons.math4.ml.distance package
 ---

 Key: MATH-1258
 URL: https://issues.apache.org/jira/browse/MATH-1258
 Project: Commons Math
  Issue Type: Bug
Reporter: Gunel Jahangirova
Priority: Minor
 Attachments: patch.diff


 Hi!
 There are five classes CanberraDistance, ChebyshevDistance, 
 EarthMoversDistance, EuclideanDistance and ManhattanDistance in 
 org.apache.commons.math4.ml.distance package, which compute different types 
 of distances. Each of them contains method compute(double[] a, double[] b) 
 that accepts two double arrays as variables.
 However, if the lengths of array a is greater than the length of array b, the 
 method compute() in all the five classes produces 
 java.lang.ArrayIndexOutOfBoundsException.
 For example,
  private void test0() {
CanberraDistance distance = new CanberraDistance();
 
 final double[] a = { 1, 2, 3, 4, 9, 4 };
 final double[] b = { -5, -6, 7, 4, 3 };
 distance.compute(a, b);
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1257) NormalDistribution.cumulativeProbability() suffers from cancellation

2015-08-19 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703489#comment-14703489
 ] 

Otmar Ertl commented on MATH-1257:
--

+1 for the patch, it eliminates one source of error. For values much smaller 
than -1 the regularizedGammaQ function evaluates to a very small positive 
values. The computation was like 0.5*((regularizedGammaQ - 1) + 1). With the 
patch the computation is now 0.5*regularizedGammaQ, obviously improving 
accuracy.

 NormalDistribution.cumulativeProbability() suffers from cancellation
 

 Key: MATH-1257
 URL: https://issues.apache.org/jira/browse/MATH-1257
 Project: Commons Math
  Issue Type: Bug
Affects Versions: 3.5
Reporter: Bill Murphy
Priority: Minor
 Attachments: MATH-1257.patch


 I see the following around line 194:
 {noformat}
 return 0.5 * (1 + Erf.erf(dev / (standardDeviation * SQRT2)));
 {noformat}
 When erf() returns a very small value, this cancels in the addition with the 
 1.0 which leads to poor precision in the results.
 I would suggest changing this line to read more like:
 {noformat}
 return 0.5 * Erf.erfc( -dev / standardDeviation * SQRT2 );
 {noformat} 
 Should you want some test cases for extreme values (one might argue that 
 within 10 standard deviations isn't all that extreme) then you can check the 
 following: http://www.jstatsoft.org/v52/i07/ then look in the v52i07-xls.zip 
 at replication-01-distribution-standard-normal.xls
 I think you will also find that evaluation of expressions such as 
 {noformat}NormalDistribution( 0, 1 ).cumulativeProbability( -10.0 );{noformat}
 are pretty far off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-08-09 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1220:
-
Attachment: patch_v3

I have revised the patch that makes use of rejection-inversion sampling. I have 
improved documentation and inlined a proof that shows that the generated random 
numbers are really Zipf distributed. I think the new approach should be more 
comprehensible now. I would be grateful for a review.

 More efficient sample() method for ZipfDistribution
 ---

 Key: MATH-1220
 URL: https://issues.apache.org/jira/browse/MATH-1220
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Fix For: 4.0, 3.6

 Attachments: Zipf Rejection Inversion Sampling Notes.pdf, patch_v1, 
 patch_v2, patch_v3


 Currently, sampling from a ZipfDistribution is very inefficient. Random 
 values are generated by inverting the CDF. However, the current 
 implementation uses O(N) power function evaluations to calculate the CDF for 
 some point. (Here N is the number of points of the Zipf distribution.) I 
 propose to use rejection sampling instead, which allows the generation of a 
 single random value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1254) Unit test CorrelatedRandomVectorGeneratorTest is flaky

2015-08-09 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1254:
-
Attachment: seed628.patch

 Unit test CorrelatedRandomVectorGeneratorTest is flaky
 --

 Key: MATH-1254
 URL: https://issues.apache.org/jira/browse/MATH-1254
 Project: Commons Math
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Otmar Ertl
Priority: Minor
 Attachments: seed628.patch


 CorrelatedRandomVectorGeneratorTest.testSampleWithZeroCovariance sometimes 
 fails. The failure can be reproduced by setting the seed of the 
 JDKRandomGenerator in createSampler equal to 628. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MATH-1254) Unit test CorrelatedRandomVectorGeneratorTest is flaky

2015-08-09 Thread Otmar Ertl (JIRA)

Otmar Ertl created MATH-1254:


 Summary: Unit test CorrelatedRandomVectorGeneratorTest is flaky
 Key: MATH-1254
 URL: https://issues.apache.org/jira/browse/MATH-1254
 Project: Commons Math
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Otmar Ertl
Priority: Minor


CorrelatedRandomVectorGeneratorTest.testSampleWithZeroCovariance sometimes 
fails. The failure can be reproduced by setting the seed of the 
JDKRandomGenerator in createSampler equal to 628. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1253) Skewness could get more precision from slightly reordered code.

2015-08-08 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662994#comment-14662994
 ] 

Otmar Ertl commented on MATH-1253:
--

The given example is problematic anyway, because the numbers (1.234E11, 
1.234E51, 1.234E101, 1.234E151) are many orders of magnitude apart, for which 
cancellation effects are a major issue when using floating point types. I can 
reproduce the NaN result for values that have either large or small exponents, 
for example (1.234E148, 1.234E149, 1.234E150, 1.234E151) and (1.234E-148, 
1.234E-149, 1.234E-150, 1.234E-151). The proposed code does not avoid this 
over-/underflows either, because the division is performed after the two 
multiplications which cause the over-/underflow. However, following code would 
work:
{code}
final double variance = (accum - (accum2 * accum2 / length)) / 
(length - 1);
final double stdDevInverse = 1d / FastMath.sqrt(variance);

double accum3 = 0.0;
for (int i = begin; i  begin + length; i++) {
final double d = (values[i] - m) * stdDevInverse;
accum3 += d * d * d;
}
{code}
(Here the inverse is precomputed, in order to avoid additional divisions which 
are likely to be more expensive than multiplications.) Nevertheless, I am not 
sure, If we really should fix that. The variance calculation is also prone to 
over-/underflows for numbers greater than sqrt(Double.MAX_VALUE) and smaller 
than sqrt(Double.MIN_VALUE), respectively. So the fix would slightly extend 
the valid input range from (cbrt(Double.MIN_VALUE), cbrt(Double.MAX_VALUE)) to 
(sqrt(Double.MIN_VALUE), sqrt(Double.MAX_VALUE)) at the expense of an 
additional multiplication within the loop. Of course, if there was also an 
accurate variance calculation method, that avoids over-/underflows for values 
outside of (sqrt(Double.MIN_VALUE), sqrt(Double.MAX_VALUE)), this fix would 
make much more sense to me.

 Skewness could get more precision from slightly reordered code.
 ---

 Key: MATH-1253
 URL: https://issues.apache.org/jira/browse/MATH-1253
 Project: Commons Math
  Issue Type: Bug
Affects Versions: 3.5
Reporter: Bill Murphy
Priority: Minor

 In Skewness.java, approx line 180, there is code like:
 {noformat}
 double accum3 = 0.0;
 for (int i = begin; i  begin + length; i++) {
 final double d = values[i] - m;
 accum3 += d * d * d;
 }
 accum3 /= variance * FastMath.sqrt(variance);
 {noformat}
 If the division was moved into the for loop, accum3 would be less likely to 
 overflow to Infinity (or -Infinity). This might allow computation to return a 
 result in a case such as:
 {noformat}
 double[] numArray = { 1.234E11, 1.234E51, 1.234E101, 1.234E151 };
 Skewnessskew = new Skewness();
 doublesk = skew.evaluate( numArray );
 {noformat}
 Currently, this returns NaN, but I'd prefer it returned approx 
 1.154700538379252.
 The change I'm proposing would have the code instead read like:
 {noformat}
 double accum3 = 0.0;
 double divisor = variance * FastMath.sqrt(variance);
 for (int i = begin; i  begin + length; i++) {
 final double d = values[i] - m;
 accum3 += d * d * d / divisor;
 }
 {noformat}
 Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1242) Speed up KolmogorovSmirnovTest.monteCarloP()

2015-07-17 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631814#comment-14631814
 ] 

Otmar Ertl commented on MATH-1242:
--

Fixed shuffle algorithm and added unit test in the following commits:
* 4.0 cf4416a84203ff9d360d4398da3b166e6d0c72b9
* 3.6 49b4bbd41b116c55083300d864e824e22b7127df

 Speed up KolmogorovSmirnovTest.monteCarloP()
 

 Key: MATH-1242
 URL: https://issues.apache.org/jira/browse/MATH-1242
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Fix For: 4.0, 3.6

 Attachments: bug_fix_and_unit_test.patch, modified.patch, patch, patch


 The advantages of the new implementation are:
 * It has linear run-time complexity because sorting of data is avoided
 * Consumes less than half random numbers (min(n,m) instead of (n+m))
 * Allocates less memory
 The speed-up can be verified using the testTwoSampleMonteCarloPerformance() 
 unit test. Here is the output on my machine of the test using the old 
 monteCarloP() implementation
 {noformat}
 n=2, m=5000, time=16.894s
 n=3, m=, time=11.917s
 n=4, m=2500, time=8.69s
 n=5, m=2000, time=7.38s
 n=6, m=1666, time=6.235s
 n=7, m=1428, time=5.472s
 n=8, m=1250, time=4.517s
 n=9, m=, time=4.246s
 n=10, m=1000, time=3.797s
 n=11, m=909, time=3.502s
 n=12, m=833, time=3.249s
 n=13, m=769, time=3.0s
 n=14, m=714, time=2.816s
 n=15, m=666, time=2.669s
 n=16, m=625, time=2.44s
 n=17, m=588, time=2.443s
 n=18, m=555, time=2.349s
 n=19, m=526, time=2.219s
 n=20, m=500, time=2.135s
 n=21, m=476, time=2.022s
 n=22, m=454, time=1.952s
 n=23, m=434, time=1.907s
 n=24, m=416, time=1.823s
 n=25, m=400, time=1.793s
 n=26, m=384, time=1.783s
 n=27, m=370, time=1.655s
 n=28, m=357, time=1.602s
 n=29, m=344, time=1.549s
 n=30, m=333, time=1.513s
 n=31, m=322, time=1.492s
 n=32, m=312, time=1.382s
 n=33, m=303, time=1.391s
 n=34, m=294, time=1.383s
 n=35, m=285, time=1.454s
 n=36, m=277, time=1.433s
 n=37, m=270, time=1.402s
 n=38, m=263, time=1.374s
 n=39, m=256, time=1.286s
 n=40, m=250, time=1.334s
 n=41, m=243, time=1.328s
 n=42, m=238, time=1.306s
 n=43, m=232, time=1.316s
 n=44, m=227, time=1.273s
 n=45, m=222, time=1.304s
 n=46, m=217, time=1.265s
 n=47, m=212, time=1.254s
 n=48, m=208, time=1.243s
 n=49, m=204, time=1.236s
 n=50, m=200, time=1.237s
 n=51, m=196, time=1.201s
 n=52, m=192, time=1.227s
 n=53, m=188, time=1.179s
 n=54, m=185, time=1.162s
 n=55, m=181, time=1.163s
 n=56, m=178, time=1.16s
 n=57, m=175, time=1.167s
 n=58, m=172, time=1.168s
 n=59, m=169, time=1.169s
 n=60, m=166, time=1.167s
 n=61, m=163, time=1.171s
 n=62, m=161, time=1.183s
 n=63, m=158, time=1.132s
 n=64, m=156, time=1.127s
 n=65, m=153, time=1.153s
 n=66, m=151, time=1.139s
 n=67, m=149, time=1.107s
 n=68, m=147, time=1.103s
 n=69, m=144, time=1.118s
 n=70, m=142, time=1.124s
 n=71, m=140, time=1.12s
 n=72, m=138, time=1.143s
 n=73, m=136, time=1.146s
 n=74, m=135, time=1.143s
 n=75, m=133, time=1.141s
 n=76, m=131, time=1.138s
 n=77, m=129, time=1.137s
 n=78, m=128, time=1.1s
 n=79, m=126, time=1.111s
 n=80, m=125, time=1.109s
 n=81, m=123, time=1.145s
 n=82, m=121, time=1.135s
 n=83, m=120, time=1.122s
 n=84, m=119, time=1.132s
 n=85, m=117, time=1.137s
 n=86, m=116, time=1.133s
 n=87, m=114, time=1.115s
 n=88, m=113, time=1.123s
 n=89, m=112, time=1.121s
 n=90, m=111, time=1.134s
 n=91, m=109, time=1.145s
 n=92, m=108, time=1.156s
 n=93, m=107, time=1.153s
 n=94, m=106, time=1.137s
 n=95, m=105, time=1.134s
 n=96, m=104, time=1.147s
 n=97, m=103, time=1.151s
 n=98, m=102, time=1.147s
 n=99, m=101, time=1.181s
 n=100, m=100, time=1.189s
 {noformat}
 and this is the output using the new implementation
 {noformat}
 n=2, m=5000, time=0.602s
 n=3, m=, time=0.4s
 n=4, m=2500, time=0.192s
 n=5, m=2000, time=0.163s
 n=6, m=1666, time=0.144s
 n=7, m=1428, time=0.129s
 n=8, m=1250, time=0.115s
 n=9, m=, time=0.117s
 n=10, m=1000, time=0.11s
 n=11, m=909, time=0.106s
 n=12, m=833, time=0.11s
 n=13, m=769, time=0.102s
 n=14, m=714, time=0.104s
 n=15, m=666, time=0.103s
 n=16, m=625, time=0.094s
 n=17, m=588, time=0.102s
 n=18, m=555, time=0.106s
 n=19, m=526, time=0.106s
 n=20, m=500, time=0.151s
 n=21, m=476, time=0.118s
 n=22, m=454, time=0.12s
 n=23, m=434, time=0.123s
 n=24, m=416, time=0.126s
 n=25, m=400, time=0.127s
 n=26, m=384, time=0.13s
 n=27, m=370, time=0.132s
 n=28, m=357, time=0.135s
 n=29, m=344, time=0.137s
 n=30, m=333, time=0.14s
 n=31, m=322, time=0.143s
 n=32, m=312, time=0.133s
 n=33, m=303, time=0.149s
 n=34, m=294, time=0.162s
 n=35, m=285, time=0.156s
 n=36, m=277, time=0.157s
 n=37, m=270, time=0.16s
 n=38, m=263, time=0.168s
 n=39, m=256, time=0.148s
 n=40, m=250, time=0.184s
 n=41, m=243, time=0.175s
 n=42, m=238, time=0.175s
 n=43, m=232, time=0.179s
 n=44, m=227, time=0.186s
 n=45, m=222,

[jira] [Resolved] (MATH-1242) Speed up KolmogorovSmirnovTest.monteCarloP()

2015-07-17 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl resolved MATH-1242.
--
Resolution: Fixed

 Speed up KolmogorovSmirnovTest.monteCarloP()
 

 Key: MATH-1242
 URL: https://issues.apache.org/jira/browse/MATH-1242
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Fix For: 4.0, 3.6

 Attachments: bug_fix_and_unit_test.patch, modified.patch, patch, patch


 The advantages of the new implementation are:
 * It has linear run-time complexity because sorting of data is avoided
 * Consumes less than half random numbers (min(n,m) instead of (n+m))
 * Allocates less memory
 The speed-up can be verified using the testTwoSampleMonteCarloPerformance() 
 unit test. Here is the output on my machine of the test using the old 
 monteCarloP() implementation
 {noformat}
 n=2, m=5000, time=16.894s
 n=3, m=, time=11.917s
 n=4, m=2500, time=8.69s
 n=5, m=2000, time=7.38s
 n=6, m=1666, time=6.235s
 n=7, m=1428, time=5.472s
 n=8, m=1250, time=4.517s
 n=9, m=, time=4.246s
 n=10, m=1000, time=3.797s
 n=11, m=909, time=3.502s
 n=12, m=833, time=3.249s
 n=13, m=769, time=3.0s
 n=14, m=714, time=2.816s
 n=15, m=666, time=2.669s
 n=16, m=625, time=2.44s
 n=17, m=588, time=2.443s
 n=18, m=555, time=2.349s
 n=19, m=526, time=2.219s
 n=20, m=500, time=2.135s
 n=21, m=476, time=2.022s
 n=22, m=454, time=1.952s
 n=23, m=434, time=1.907s
 n=24, m=416, time=1.823s
 n=25, m=400, time=1.793s
 n=26, m=384, time=1.783s
 n=27, m=370, time=1.655s
 n=28, m=357, time=1.602s
 n=29, m=344, time=1.549s
 n=30, m=333, time=1.513s
 n=31, m=322, time=1.492s
 n=32, m=312, time=1.382s
 n=33, m=303, time=1.391s
 n=34, m=294, time=1.383s
 n=35, m=285, time=1.454s
 n=36, m=277, time=1.433s
 n=37, m=270, time=1.402s
 n=38, m=263, time=1.374s
 n=39, m=256, time=1.286s
 n=40, m=250, time=1.334s
 n=41, m=243, time=1.328s
 n=42, m=238, time=1.306s
 n=43, m=232, time=1.316s
 n=44, m=227, time=1.273s
 n=45, m=222, time=1.304s
 n=46, m=217, time=1.265s
 n=47, m=212, time=1.254s
 n=48, m=208, time=1.243s
 n=49, m=204, time=1.236s
 n=50, m=200, time=1.237s
 n=51, m=196, time=1.201s
 n=52, m=192, time=1.227s
 n=53, m=188, time=1.179s
 n=54, m=185, time=1.162s
 n=55, m=181, time=1.163s
 n=56, m=178, time=1.16s
 n=57, m=175, time=1.167s
 n=58, m=172, time=1.168s
 n=59, m=169, time=1.169s
 n=60, m=166, time=1.167s
 n=61, m=163, time=1.171s
 n=62, m=161, time=1.183s
 n=63, m=158, time=1.132s
 n=64, m=156, time=1.127s
 n=65, m=153, time=1.153s
 n=66, m=151, time=1.139s
 n=67, m=149, time=1.107s
 n=68, m=147, time=1.103s
 n=69, m=144, time=1.118s
 n=70, m=142, time=1.124s
 n=71, m=140, time=1.12s
 n=72, m=138, time=1.143s
 n=73, m=136, time=1.146s
 n=74, m=135, time=1.143s
 n=75, m=133, time=1.141s
 n=76, m=131, time=1.138s
 n=77, m=129, time=1.137s
 n=78, m=128, time=1.1s
 n=79, m=126, time=1.111s
 n=80, m=125, time=1.109s
 n=81, m=123, time=1.145s
 n=82, m=121, time=1.135s
 n=83, m=120, time=1.122s
 n=84, m=119, time=1.132s
 n=85, m=117, time=1.137s
 n=86, m=116, time=1.133s
 n=87, m=114, time=1.115s
 n=88, m=113, time=1.123s
 n=89, m=112, time=1.121s
 n=90, m=111, time=1.134s
 n=91, m=109, time=1.145s
 n=92, m=108, time=1.156s
 n=93, m=107, time=1.153s
 n=94, m=106, time=1.137s
 n=95, m=105, time=1.134s
 n=96, m=104, time=1.147s
 n=97, m=103, time=1.151s
 n=98, m=102, time=1.147s
 n=99, m=101, time=1.181s
 n=100, m=100, time=1.189s
 {noformat}
 and this is the output using the new implementation
 {noformat}
 n=2, m=5000, time=0.602s
 n=3, m=, time=0.4s
 n=4, m=2500, time=0.192s
 n=5, m=2000, time=0.163s
 n=6, m=1666, time=0.144s
 n=7, m=1428, time=0.129s
 n=8, m=1250, time=0.115s
 n=9, m=, time=0.117s
 n=10, m=1000, time=0.11s
 n=11, m=909, time=0.106s
 n=12, m=833, time=0.11s
 n=13, m=769, time=0.102s
 n=14, m=714, time=0.104s
 n=15, m=666, time=0.103s
 n=16, m=625, time=0.094s
 n=17, m=588, time=0.102s
 n=18, m=555, time=0.106s
 n=19, m=526, time=0.106s
 n=20, m=500, time=0.151s
 n=21, m=476, time=0.118s
 n=22, m=454, time=0.12s
 n=23, m=434, time=0.123s
 n=24, m=416, time=0.126s
 n=25, m=400, time=0.127s
 n=26, m=384, time=0.13s
 n=27, m=370, time=0.132s
 n=28, m=357, time=0.135s
 n=29, m=344, time=0.137s
 n=30, m=333, time=0.14s
 n=31, m=322, time=0.143s
 n=32, m=312, time=0.133s
 n=33, m=303, time=0.149s
 n=34, m=294, time=0.162s
 n=35, m=285, time=0.156s
 n=36, m=277, time=0.157s
 n=37, m=270, time=0.16s
 n=38, m=263, time=0.168s
 n=39, m=256, time=0.148s
 n=40, m=250, time=0.184s
 n=41, m=243, time=0.175s
 n=42, m=238, time=0.175s
 n=43, m=232, time=0.179s
 n=44, m=227, time=0.186s
 n=45, m=222, time=0.186s
 n=46, m=217, time=0.188s
 n=47, m=212, time=0.192s
 n=48, m=208, time=0.195s
 n=49, m=204, time=0.199s
 n=50, m=200, time=0.21s
 n=51, m=196, time=0.205s
 n=52, m=192, time=0.208s

[jira] [Updated] (MATH-1242) Speed up KolmogorovSmirnovTest.monteCarloP()

2015-07-15 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1242:
-
Attachment: bug_fix_and_unit_test.patch

Here is an update of the fix. I have moved the shuffle algorithm to a static 
package-private method which is now explicitly tested. If you agree with the 
patch, I will try to do my first commit.

 Speed up KolmogorovSmirnovTest.monteCarloP()
 

 Key: MATH-1242
 URL: https://issues.apache.org/jira/browse/MATH-1242
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Fix For: 4.0, 3.6

 Attachments: bug_fix_and_unit_test.patch, modified.patch, patch, patch


 The advantages of the new implementation are:
 * It has linear run-time complexity because sorting of data is avoided
 * Consumes less than half random numbers (min(n,m) instead of (n+m))
 * Allocates less memory
 The speed-up can be verified using the testTwoSampleMonteCarloPerformance() 
 unit test. Here is the output on my machine of the test using the old 
 monteCarloP() implementation
 {noformat}
 n=2, m=5000, time=16.894s
 n=3, m=, time=11.917s
 n=4, m=2500, time=8.69s
 n=5, m=2000, time=7.38s
 n=6, m=1666, time=6.235s
 n=7, m=1428, time=5.472s
 n=8, m=1250, time=4.517s
 n=9, m=, time=4.246s
 n=10, m=1000, time=3.797s
 n=11, m=909, time=3.502s
 n=12, m=833, time=3.249s
 n=13, m=769, time=3.0s
 n=14, m=714, time=2.816s
 n=15, m=666, time=2.669s
 n=16, m=625, time=2.44s
 n=17, m=588, time=2.443s
 n=18, m=555, time=2.349s
 n=19, m=526, time=2.219s
 n=20, m=500, time=2.135s
 n=21, m=476, time=2.022s
 n=22, m=454, time=1.952s
 n=23, m=434, time=1.907s
 n=24, m=416, time=1.823s
 n=25, m=400, time=1.793s
 n=26, m=384, time=1.783s
 n=27, m=370, time=1.655s
 n=28, m=357, time=1.602s
 n=29, m=344, time=1.549s
 n=30, m=333, time=1.513s
 n=31, m=322, time=1.492s
 n=32, m=312, time=1.382s
 n=33, m=303, time=1.391s
 n=34, m=294, time=1.383s
 n=35, m=285, time=1.454s
 n=36, m=277, time=1.433s
 n=37, m=270, time=1.402s
 n=38, m=263, time=1.374s
 n=39, m=256, time=1.286s
 n=40, m=250, time=1.334s
 n=41, m=243, time=1.328s
 n=42, m=238, time=1.306s
 n=43, m=232, time=1.316s
 n=44, m=227, time=1.273s
 n=45, m=222, time=1.304s
 n=46, m=217, time=1.265s
 n=47, m=212, time=1.254s
 n=48, m=208, time=1.243s
 n=49, m=204, time=1.236s
 n=50, m=200, time=1.237s
 n=51, m=196, time=1.201s
 n=52, m=192, time=1.227s
 n=53, m=188, time=1.179s
 n=54, m=185, time=1.162s
 n=55, m=181, time=1.163s
 n=56, m=178, time=1.16s
 n=57, m=175, time=1.167s
 n=58, m=172, time=1.168s
 n=59, m=169, time=1.169s
 n=60, m=166, time=1.167s
 n=61, m=163, time=1.171s
 n=62, m=161, time=1.183s
 n=63, m=158, time=1.132s
 n=64, m=156, time=1.127s
 n=65, m=153, time=1.153s
 n=66, m=151, time=1.139s
 n=67, m=149, time=1.107s
 n=68, m=147, time=1.103s
 n=69, m=144, time=1.118s
 n=70, m=142, time=1.124s
 n=71, m=140, time=1.12s
 n=72, m=138, time=1.143s
 n=73, m=136, time=1.146s
 n=74, m=135, time=1.143s
 n=75, m=133, time=1.141s
 n=76, m=131, time=1.138s
 n=77, m=129, time=1.137s
 n=78, m=128, time=1.1s
 n=79, m=126, time=1.111s
 n=80, m=125, time=1.109s
 n=81, m=123, time=1.145s
 n=82, m=121, time=1.135s
 n=83, m=120, time=1.122s
 n=84, m=119, time=1.132s
 n=85, m=117, time=1.137s
 n=86, m=116, time=1.133s
 n=87, m=114, time=1.115s
 n=88, m=113, time=1.123s
 n=89, m=112, time=1.121s
 n=90, m=111, time=1.134s
 n=91, m=109, time=1.145s
 n=92, m=108, time=1.156s
 n=93, m=107, time=1.153s
 n=94, m=106, time=1.137s
 n=95, m=105, time=1.134s
 n=96, m=104, time=1.147s
 n=97, m=103, time=1.151s
 n=98, m=102, time=1.147s
 n=99, m=101, time=1.181s
 n=100, m=100, time=1.189s
 {noformat}
 and this is the output using the new implementation
 {noformat}
 n=2, m=5000, time=0.602s
 n=3, m=, time=0.4s
 n=4, m=2500, time=0.192s
 n=5, m=2000, time=0.163s
 n=6, m=1666, time=0.144s
 n=7, m=1428, time=0.129s
 n=8, m=1250, time=0.115s
 n=9, m=, time=0.117s
 n=10, m=1000, time=0.11s
 n=11, m=909, time=0.106s
 n=12, m=833, time=0.11s
 n=13, m=769, time=0.102s
 n=14, m=714, time=0.104s
 n=15, m=666, time=0.103s
 n=16, m=625, time=0.094s
 n=17, m=588, time=0.102s
 n=18, m=555, time=0.106s
 n=19, m=526, time=0.106s
 n=20, m=500, time=0.151s
 n=21, m=476, time=0.118s
 n=22, m=454, time=0.12s
 n=23, m=434, time=0.123s
 n=24, m=416, time=0.126s
 n=25, m=400, time=0.127s
 n=26, m=384, time=0.13s
 n=27, m=370, time=0.132s
 n=28, m=357, time=0.135s
 n=29, m=344, time=0.137s
 n=30, m=333, time=0.14s
 n=31, m=322, time=0.143s
 n=32, m=312, time=0.133s
 n=33, m=303, time=0.149s
 n=34, m=294, time=0.162s
 n=35, m=285, time=0.156s
 n=36, m=277, time=0.157s
 n=37, m=270, time=0.16s
 n=38, m=263, time=0.168s
 n=39, m=256, time=0.148s
 n=40, m=250, time=0.184s
 n=41, m=243, time=0.175s
 n=42, m=238, time=0.175s
 n=43, m=232, time=0.179s
 n=44, m=227,

[jira] [Updated] (MATH-1242) Speed up KolmogorovSmirnovTest.monteCarloP()

2015-07-11 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1242:
-
Attachment: patch

There is a bug in the simplified Fisher-Yates shuffle algorithm. The random 
number r needs to be drawn from 0 = r = k. Since the parameter of nextInt is 
exclusive we need to pass k+1 instead of k.

 Speed up KolmogorovSmirnovTest.monteCarloP()
 

 Key: MATH-1242
 URL: https://issues.apache.org/jira/browse/MATH-1242
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Fix For: 4.0, 3.6

 Attachments: modified.patch, patch, patch


 The advantages of the new implementation are:
 * It has linear run-time complexity because sorting of data is avoided
 * Consumes less than half random numbers (min(n,m) instead of (n+m))
 * Allocates less memory
 The speed-up can be verified using the testTwoSampleMonteCarloPerformance() 
 unit test. Here is the output on my machine of the test using the old 
 monteCarloP() implementation
 {noformat}
 n=2, m=5000, time=16.894s
 n=3, m=, time=11.917s
 n=4, m=2500, time=8.69s
 n=5, m=2000, time=7.38s
 n=6, m=1666, time=6.235s
 n=7, m=1428, time=5.472s
 n=8, m=1250, time=4.517s
 n=9, m=, time=4.246s
 n=10, m=1000, time=3.797s
 n=11, m=909, time=3.502s
 n=12, m=833, time=3.249s
 n=13, m=769, time=3.0s
 n=14, m=714, time=2.816s
 n=15, m=666, time=2.669s
 n=16, m=625, time=2.44s
 n=17, m=588, time=2.443s
 n=18, m=555, time=2.349s
 n=19, m=526, time=2.219s
 n=20, m=500, time=2.135s
 n=21, m=476, time=2.022s
 n=22, m=454, time=1.952s
 n=23, m=434, time=1.907s
 n=24, m=416, time=1.823s
 n=25, m=400, time=1.793s
 n=26, m=384, time=1.783s
 n=27, m=370, time=1.655s
 n=28, m=357, time=1.602s
 n=29, m=344, time=1.549s
 n=30, m=333, time=1.513s
 n=31, m=322, time=1.492s
 n=32, m=312, time=1.382s
 n=33, m=303, time=1.391s
 n=34, m=294, time=1.383s
 n=35, m=285, time=1.454s
 n=36, m=277, time=1.433s
 n=37, m=270, time=1.402s
 n=38, m=263, time=1.374s
 n=39, m=256, time=1.286s
 n=40, m=250, time=1.334s
 n=41, m=243, time=1.328s
 n=42, m=238, time=1.306s
 n=43, m=232, time=1.316s
 n=44, m=227, time=1.273s
 n=45, m=222, time=1.304s
 n=46, m=217, time=1.265s
 n=47, m=212, time=1.254s
 n=48, m=208, time=1.243s
 n=49, m=204, time=1.236s
 n=50, m=200, time=1.237s
 n=51, m=196, time=1.201s
 n=52, m=192, time=1.227s
 n=53, m=188, time=1.179s
 n=54, m=185, time=1.162s
 n=55, m=181, time=1.163s
 n=56, m=178, time=1.16s
 n=57, m=175, time=1.167s
 n=58, m=172, time=1.168s
 n=59, m=169, time=1.169s
 n=60, m=166, time=1.167s
 n=61, m=163, time=1.171s
 n=62, m=161, time=1.183s
 n=63, m=158, time=1.132s
 n=64, m=156, time=1.127s
 n=65, m=153, time=1.153s
 n=66, m=151, time=1.139s
 n=67, m=149, time=1.107s
 n=68, m=147, time=1.103s
 n=69, m=144, time=1.118s
 n=70, m=142, time=1.124s
 n=71, m=140, time=1.12s
 n=72, m=138, time=1.143s
 n=73, m=136, time=1.146s
 n=74, m=135, time=1.143s
 n=75, m=133, time=1.141s
 n=76, m=131, time=1.138s
 n=77, m=129, time=1.137s
 n=78, m=128, time=1.1s
 n=79, m=126, time=1.111s
 n=80, m=125, time=1.109s
 n=81, m=123, time=1.145s
 n=82, m=121, time=1.135s
 n=83, m=120, time=1.122s
 n=84, m=119, time=1.132s
 n=85, m=117, time=1.137s
 n=86, m=116, time=1.133s
 n=87, m=114, time=1.115s
 n=88, m=113, time=1.123s
 n=89, m=112, time=1.121s
 n=90, m=111, time=1.134s
 n=91, m=109, time=1.145s
 n=92, m=108, time=1.156s
 n=93, m=107, time=1.153s
 n=94, m=106, time=1.137s
 n=95, m=105, time=1.134s
 n=96, m=104, time=1.147s
 n=97, m=103, time=1.151s
 n=98, m=102, time=1.147s
 n=99, m=101, time=1.181s
 n=100, m=100, time=1.189s
 {noformat}
 and this is the output using the new implementation
 {noformat}
 n=2, m=5000, time=0.602s
 n=3, m=, time=0.4s
 n=4, m=2500, time=0.192s
 n=5, m=2000, time=0.163s
 n=6, m=1666, time=0.144s
 n=7, m=1428, time=0.129s
 n=8, m=1250, time=0.115s
 n=9, m=, time=0.117s
 n=10, m=1000, time=0.11s
 n=11, m=909, time=0.106s
 n=12, m=833, time=0.11s
 n=13, m=769, time=0.102s
 n=14, m=714, time=0.104s
 n=15, m=666, time=0.103s
 n=16, m=625, time=0.094s
 n=17, m=588, time=0.102s
 n=18, m=555, time=0.106s
 n=19, m=526, time=0.106s
 n=20, m=500, time=0.151s
 n=21, m=476, time=0.118s
 n=22, m=454, time=0.12s
 n=23, m=434, time=0.123s
 n=24, m=416, time=0.126s
 n=25, m=400, time=0.127s
 n=26, m=384, time=0.13s
 n=27, m=370, time=0.132s
 n=28, m=357, time=0.135s
 n=29, m=344, time=0.137s
 n=30, m=333, time=0.14s
 n=31, m=322, time=0.143s
 n=32, m=312, time=0.133s
 n=33, m=303, time=0.149s
 n=34, m=294, time=0.162s
 n=35, m=285, time=0.156s
 n=36, m=277, time=0.157s
 n=37, m=270, time=0.16s
 n=38, m=263, time=0.168s
 n=39, m=256, time=0.148s
 n=40, m=250, time=0.184s
 n=41, m=243, time=0.175s
 n=42, m=238, time=0.175s
 n=43, m=232, time=0.179s
 n=44, m=227, time=0.186s
 n=45, m=222, time=0.186s
 n=46, m=217,

[jira] [Commented] (MATH-1242) Speed up KolmogorovSmirnovTest.monteCarloP()

2015-06-26 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602970#comment-14602970
 ] 

Otmar Ertl commented on MATH-1242:
--

Your modifications look fine. Thanks!

 Speed up KolmogorovSmirnovTest.monteCarloP()
 

 Key: MATH-1242
 URL: https://issues.apache.org/jira/browse/MATH-1242
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Attachments: modified.patch, patch


 The advantages of the new implementation are:
 * It has linear run-time complexity because sorting of data is avoided
 * Consumes less than half random numbers (min(n,m) instead of (n+m))
 * Allocates less memory
 The speed-up can be verified using the testTwoSampleMonteCarloPerformance() 
 unit test. Here is the output on my machine of the test using the old 
 monteCarloP() implementation
 {noformat}
 n=2, m=5000, time=16.894s
 n=3, m=, time=11.917s
 n=4, m=2500, time=8.69s
 n=5, m=2000, time=7.38s
 n=6, m=1666, time=6.235s
 n=7, m=1428, time=5.472s
 n=8, m=1250, time=4.517s
 n=9, m=, time=4.246s
 n=10, m=1000, time=3.797s
 n=11, m=909, time=3.502s
 n=12, m=833, time=3.249s
 n=13, m=769, time=3.0s
 n=14, m=714, time=2.816s
 n=15, m=666, time=2.669s
 n=16, m=625, time=2.44s
 n=17, m=588, time=2.443s
 n=18, m=555, time=2.349s
 n=19, m=526, time=2.219s
 n=20, m=500, time=2.135s
 n=21, m=476, time=2.022s
 n=22, m=454, time=1.952s
 n=23, m=434, time=1.907s
 n=24, m=416, time=1.823s
 n=25, m=400, time=1.793s
 n=26, m=384, time=1.783s
 n=27, m=370, time=1.655s
 n=28, m=357, time=1.602s
 n=29, m=344, time=1.549s
 n=30, m=333, time=1.513s
 n=31, m=322, time=1.492s
 n=32, m=312, time=1.382s
 n=33, m=303, time=1.391s
 n=34, m=294, time=1.383s
 n=35, m=285, time=1.454s
 n=36, m=277, time=1.433s
 n=37, m=270, time=1.402s
 n=38, m=263, time=1.374s
 n=39, m=256, time=1.286s
 n=40, m=250, time=1.334s
 n=41, m=243, time=1.328s
 n=42, m=238, time=1.306s
 n=43, m=232, time=1.316s
 n=44, m=227, time=1.273s
 n=45, m=222, time=1.304s
 n=46, m=217, time=1.265s
 n=47, m=212, time=1.254s
 n=48, m=208, time=1.243s
 n=49, m=204, time=1.236s
 n=50, m=200, time=1.237s
 n=51, m=196, time=1.201s
 n=52, m=192, time=1.227s
 n=53, m=188, time=1.179s
 n=54, m=185, time=1.162s
 n=55, m=181, time=1.163s
 n=56, m=178, time=1.16s
 n=57, m=175, time=1.167s
 n=58, m=172, time=1.168s
 n=59, m=169, time=1.169s
 n=60, m=166, time=1.167s
 n=61, m=163, time=1.171s
 n=62, m=161, time=1.183s
 n=63, m=158, time=1.132s
 n=64, m=156, time=1.127s
 n=65, m=153, time=1.153s
 n=66, m=151, time=1.139s
 n=67, m=149, time=1.107s
 n=68, m=147, time=1.103s
 n=69, m=144, time=1.118s
 n=70, m=142, time=1.124s
 n=71, m=140, time=1.12s
 n=72, m=138, time=1.143s
 n=73, m=136, time=1.146s
 n=74, m=135, time=1.143s
 n=75, m=133, time=1.141s
 n=76, m=131, time=1.138s
 n=77, m=129, time=1.137s
 n=78, m=128, time=1.1s
 n=79, m=126, time=1.111s
 n=80, m=125, time=1.109s
 n=81, m=123, time=1.145s
 n=82, m=121, time=1.135s
 n=83, m=120, time=1.122s
 n=84, m=119, time=1.132s
 n=85, m=117, time=1.137s
 n=86, m=116, time=1.133s
 n=87, m=114, time=1.115s
 n=88, m=113, time=1.123s
 n=89, m=112, time=1.121s
 n=90, m=111, time=1.134s
 n=91, m=109, time=1.145s
 n=92, m=108, time=1.156s
 n=93, m=107, time=1.153s
 n=94, m=106, time=1.137s
 n=95, m=105, time=1.134s
 n=96, m=104, time=1.147s
 n=97, m=103, time=1.151s
 n=98, m=102, time=1.147s
 n=99, m=101, time=1.181s
 n=100, m=100, time=1.189s
 {noformat}
 and this is the output using the new implementation
 {noformat}
 n=2, m=5000, time=0.602s
 n=3, m=, time=0.4s
 n=4, m=2500, time=0.192s
 n=5, m=2000, time=0.163s
 n=6, m=1666, time=0.144s
 n=7, m=1428, time=0.129s
 n=8, m=1250, time=0.115s
 n=9, m=, time=0.117s
 n=10, m=1000, time=0.11s
 n=11, m=909, time=0.106s
 n=12, m=833, time=0.11s
 n=13, m=769, time=0.102s
 n=14, m=714, time=0.104s
 n=15, m=666, time=0.103s
 n=16, m=625, time=0.094s
 n=17, m=588, time=0.102s
 n=18, m=555, time=0.106s
 n=19, m=526, time=0.106s
 n=20, m=500, time=0.151s
 n=21, m=476, time=0.118s
 n=22, m=454, time=0.12s
 n=23, m=434, time=0.123s
 n=24, m=416, time=0.126s
 n=25, m=400, time=0.127s
 n=26, m=384, time=0.13s
 n=27, m=370, time=0.132s
 n=28, m=357, time=0.135s
 n=29, m=344, time=0.137s
 n=30, m=333, time=0.14s
 n=31, m=322, time=0.143s
 n=32, m=312, time=0.133s
 n=33, m=303, time=0.149s
 n=34, m=294, time=0.162s
 n=35, m=285, time=0.156s
 n=36, m=277, time=0.157s
 n=37, m=270, time=0.16s
 n=38, m=263, time=0.168s
 n=39, m=256, time=0.148s
 n=40, m=250, time=0.184s
 n=41, m=243, time=0.175s
 n=42, m=238, time=0.175s
 n=43, m=232, time=0.179s
 n=44, m=227, time=0.186s
 n=45, m=222, time=0.186s
 n=46, m=217, time=0.188s
 n=47, m=212, time=0.192s
 n=48, m=208, time=0.195s
 n=49, m=204, time=0.199s
 n=50, m=200, time=0.21s
 n=51, m=196, time=0.205s
 n=52, m=192, time=0.208s

[jira] [Updated] (MATH-1242) Speed up KolmogorovSmirnovTest.monteCarloP()

2015-06-25 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1242:
-
Attachment: patch

 Speed up KolmogorovSmirnovTest.monteCarloP()
 

 Key: MATH-1242
 URL: https://issues.apache.org/jira/browse/MATH-1242
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Attachments: patch


 The advantages of the new implementation are:
 * It has linear run-time complexity because sorting of data is avoided
 * Consumes less than half random numbers (min(n,m) instead of (n+m))
 * Allocates less memory
 The speed-up can be verified using the testTwoSampleMonteCarloPerformance() 
 unit test. Here is the output on my machine of the test using the old 
 monteCarloP() implementation
 {noformat}
 n=2, m=5000, time=16.894s
 n=3, m=, time=11.917s
 n=4, m=2500, time=8.69s
 n=5, m=2000, time=7.38s
 n=6, m=1666, time=6.235s
 n=7, m=1428, time=5.472s
 n=8, m=1250, time=4.517s
 n=9, m=, time=4.246s
 n=10, m=1000, time=3.797s
 n=11, m=909, time=3.502s
 n=12, m=833, time=3.249s
 n=13, m=769, time=3.0s
 n=14, m=714, time=2.816s
 n=15, m=666, time=2.669s
 n=16, m=625, time=2.44s
 n=17, m=588, time=2.443s
 n=18, m=555, time=2.349s
 n=19, m=526, time=2.219s
 n=20, m=500, time=2.135s
 n=21, m=476, time=2.022s
 n=22, m=454, time=1.952s
 n=23, m=434, time=1.907s
 n=24, m=416, time=1.823s
 n=25, m=400, time=1.793s
 n=26, m=384, time=1.783s
 n=27, m=370, time=1.655s
 n=28, m=357, time=1.602s
 n=29, m=344, time=1.549s
 n=30, m=333, time=1.513s
 n=31, m=322, time=1.492s
 n=32, m=312, time=1.382s
 n=33, m=303, time=1.391s
 n=34, m=294, time=1.383s
 n=35, m=285, time=1.454s
 n=36, m=277, time=1.433s
 n=37, m=270, time=1.402s
 n=38, m=263, time=1.374s
 n=39, m=256, time=1.286s
 n=40, m=250, time=1.334s
 n=41, m=243, time=1.328s
 n=42, m=238, time=1.306s
 n=43, m=232, time=1.316s
 n=44, m=227, time=1.273s
 n=45, m=222, time=1.304s
 n=46, m=217, time=1.265s
 n=47, m=212, time=1.254s
 n=48, m=208, time=1.243s
 n=49, m=204, time=1.236s
 n=50, m=200, time=1.237s
 n=51, m=196, time=1.201s
 n=52, m=192, time=1.227s
 n=53, m=188, time=1.179s
 n=54, m=185, time=1.162s
 n=55, m=181, time=1.163s
 n=56, m=178, time=1.16s
 n=57, m=175, time=1.167s
 n=58, m=172, time=1.168s
 n=59, m=169, time=1.169s
 n=60, m=166, time=1.167s
 n=61, m=163, time=1.171s
 n=62, m=161, time=1.183s
 n=63, m=158, time=1.132s
 n=64, m=156, time=1.127s
 n=65, m=153, time=1.153s
 n=66, m=151, time=1.139s
 n=67, m=149, time=1.107s
 n=68, m=147, time=1.103s
 n=69, m=144, time=1.118s
 n=70, m=142, time=1.124s
 n=71, m=140, time=1.12s
 n=72, m=138, time=1.143s
 n=73, m=136, time=1.146s
 n=74, m=135, time=1.143s
 n=75, m=133, time=1.141s
 n=76, m=131, time=1.138s
 n=77, m=129, time=1.137s
 n=78, m=128, time=1.1s
 n=79, m=126, time=1.111s
 n=80, m=125, time=1.109s
 n=81, m=123, time=1.145s
 n=82, m=121, time=1.135s
 n=83, m=120, time=1.122s
 n=84, m=119, time=1.132s
 n=85, m=117, time=1.137s
 n=86, m=116, time=1.133s
 n=87, m=114, time=1.115s
 n=88, m=113, time=1.123s
 n=89, m=112, time=1.121s
 n=90, m=111, time=1.134s
 n=91, m=109, time=1.145s
 n=92, m=108, time=1.156s
 n=93, m=107, time=1.153s
 n=94, m=106, time=1.137s
 n=95, m=105, time=1.134s
 n=96, m=104, time=1.147s
 n=97, m=103, time=1.151s
 n=98, m=102, time=1.147s
 n=99, m=101, time=1.181s
 n=100, m=100, time=1.189s
 {noformat}
 and this is the output using the new implementation
 {noformat}
 n=2, m=5000, time=0.602s
 n=3, m=, time=0.4s
 n=4, m=2500, time=0.192s
 n=5, m=2000, time=0.163s
 n=6, m=1666, time=0.144s
 n=7, m=1428, time=0.129s
 n=8, m=1250, time=0.115s
 n=9, m=, time=0.117s
 n=10, m=1000, time=0.11s
 n=11, m=909, time=0.106s
 n=12, m=833, time=0.11s
 n=13, m=769, time=0.102s
 n=14, m=714, time=0.104s
 n=15, m=666, time=0.103s
 n=16, m=625, time=0.094s
 n=17, m=588, time=0.102s
 n=18, m=555, time=0.106s
 n=19, m=526, time=0.106s
 n=20, m=500, time=0.151s
 n=21, m=476, time=0.118s
 n=22, m=454, time=0.12s
 n=23, m=434, time=0.123s
 n=24, m=416, time=0.126s
 n=25, m=400, time=0.127s
 n=26, m=384, time=0.13s
 n=27, m=370, time=0.132s
 n=28, m=357, time=0.135s
 n=29, m=344, time=0.137s
 n=30, m=333, time=0.14s
 n=31, m=322, time=0.143s
 n=32, m=312, time=0.133s
 n=33, m=303, time=0.149s
 n=34, m=294, time=0.162s
 n=35, m=285, time=0.156s
 n=36, m=277, time=0.157s
 n=37, m=270, time=0.16s
 n=38, m=263, time=0.168s
 n=39, m=256, time=0.148s
 n=40, m=250, time=0.184s
 n=41, m=243, time=0.175s
 n=42, m=238, time=0.175s
 n=43, m=232, time=0.179s
 n=44, m=227, time=0.186s
 n=45, m=222, time=0.186s
 n=46, m=217, time=0.188s
 n=47, m=212, time=0.192s
 n=48, m=208, time=0.195s
 n=49, m=204, time=0.199s
 n=50, m=200, time=0.21s
 n=51, m=196, time=0.205s
 n=52, m=192, time=0.208s
 n=53, m=188, time=0.24s
 n=54, m=185, time=0.22s
 n=55, m=181, time=0.218s
 n=56, m=178,

[jira] [Created] (MATH-1242) Speed up KolmogorovSmirnovTest.monteCarloP()

2015-06-25 Thread Otmar Ertl (JIRA)

Otmar Ertl created MATH-1242:


 Summary: Speed up KolmogorovSmirnovTest.monteCarloP()
 Key: MATH-1242
 URL: https://issues.apache.org/jira/browse/MATH-1242
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Attachments: patch

The advantages of the new implementation are:
* It has linear run-time complexity because sorting of data is avoided
* Consumes less than half random numbers (min(n,m) instead of (n+m))
* Allocates less memory

The speed-up can be verified using the testTwoSampleMonteCarloPerformance() 
unit test. Here is the output on my machine of the test using the old 
monteCarloP() implementation
{noformat}
n=2, m=5000, time=16.894s
n=3, m=, time=11.917s
n=4, m=2500, time=8.69s
n=5, m=2000, time=7.38s
n=6, m=1666, time=6.235s
n=7, m=1428, time=5.472s
n=8, m=1250, time=4.517s
n=9, m=, time=4.246s
n=10, m=1000, time=3.797s
n=11, m=909, time=3.502s
n=12, m=833, time=3.249s
n=13, m=769, time=3.0s
n=14, m=714, time=2.816s
n=15, m=666, time=2.669s
n=16, m=625, time=2.44s
n=17, m=588, time=2.443s
n=18, m=555, time=2.349s
n=19, m=526, time=2.219s
n=20, m=500, time=2.135s
n=21, m=476, time=2.022s
n=22, m=454, time=1.952s
n=23, m=434, time=1.907s
n=24, m=416, time=1.823s
n=25, m=400, time=1.793s
n=26, m=384, time=1.783s
n=27, m=370, time=1.655s
n=28, m=357, time=1.602s
n=29, m=344, time=1.549s
n=30, m=333, time=1.513s
n=31, m=322, time=1.492s
n=32, m=312, time=1.382s
n=33, m=303, time=1.391s
n=34, m=294, time=1.383s
n=35, m=285, time=1.454s
n=36, m=277, time=1.433s
n=37, m=270, time=1.402s
n=38, m=263, time=1.374s
n=39, m=256, time=1.286s
n=40, m=250, time=1.334s
n=41, m=243, time=1.328s
n=42, m=238, time=1.306s
n=43, m=232, time=1.316s
n=44, m=227, time=1.273s
n=45, m=222, time=1.304s
n=46, m=217, time=1.265s
n=47, m=212, time=1.254s
n=48, m=208, time=1.243s
n=49, m=204, time=1.236s
n=50, m=200, time=1.237s
n=51, m=196, time=1.201s
n=52, m=192, time=1.227s
n=53, m=188, time=1.179s
n=54, m=185, time=1.162s
n=55, m=181, time=1.163s
n=56, m=178, time=1.16s
n=57, m=175, time=1.167s
n=58, m=172, time=1.168s
n=59, m=169, time=1.169s
n=60, m=166, time=1.167s
n=61, m=163, time=1.171s
n=62, m=161, time=1.183s
n=63, m=158, time=1.132s
n=64, m=156, time=1.127s
n=65, m=153, time=1.153s
n=66, m=151, time=1.139s
n=67, m=149, time=1.107s
n=68, m=147, time=1.103s
n=69, m=144, time=1.118s
n=70, m=142, time=1.124s
n=71, m=140, time=1.12s
n=72, m=138, time=1.143s
n=73, m=136, time=1.146s
n=74, m=135, time=1.143s
n=75, m=133, time=1.141s
n=76, m=131, time=1.138s
n=77, m=129, time=1.137s
n=78, m=128, time=1.1s
n=79, m=126, time=1.111s
n=80, m=125, time=1.109s
n=81, m=123, time=1.145s
n=82, m=121, time=1.135s
n=83, m=120, time=1.122s
n=84, m=119, time=1.132s
n=85, m=117, time=1.137s
n=86, m=116, time=1.133s
n=87, m=114, time=1.115s
n=88, m=113, time=1.123s
n=89, m=112, time=1.121s
n=90, m=111, time=1.134s
n=91, m=109, time=1.145s
n=92, m=108, time=1.156s
n=93, m=107, time=1.153s
n=94, m=106, time=1.137s
n=95, m=105, time=1.134s
n=96, m=104, time=1.147s
n=97, m=103, time=1.151s
n=98, m=102, time=1.147s
n=99, m=101, time=1.181s
n=100, m=100, time=1.189s
{noformat}
and this is the output using the new implementation
{noformat}
n=2, m=5000, time=0.602s
n=3, m=, time=0.4s
n=4, m=2500, time=0.192s
n=5, m=2000, time=0.163s
n=6, m=1666, time=0.144s
n=7, m=1428, time=0.129s
n=8, m=1250, time=0.115s
n=9, m=, time=0.117s
n=10, m=1000, time=0.11s
n=11, m=909, time=0.106s
n=12, m=833, time=0.11s
n=13, m=769, time=0.102s
n=14, m=714, time=0.104s
n=15, m=666, time=0.103s
n=16, m=625, time=0.094s
n=17, m=588, time=0.102s
n=18, m=555, time=0.106s
n=19, m=526, time=0.106s
n=20, m=500, time=0.151s
n=21, m=476, time=0.118s
n=22, m=454, time=0.12s
n=23, m=434, time=0.123s
n=24, m=416, time=0.126s
n=25, m=400, time=0.127s
n=26, m=384, time=0.13s
n=27, m=370, time=0.132s
n=28, m=357, time=0.135s
n=29, m=344, time=0.137s
n=30, m=333, time=0.14s
n=31, m=322, time=0.143s
n=32, m=312, time=0.133s
n=33, m=303, time=0.149s
n=34, m=294, time=0.162s
n=35, m=285, time=0.156s
n=36, m=277, time=0.157s
n=37, m=270, time=0.16s
n=38, m=263, time=0.168s
n=39, m=256, time=0.148s
n=40, m=250, time=0.184s
n=41, m=243, time=0.175s
n=42, m=238, time=0.175s
n=43, m=232, time=0.179s
n=44, m=227, time=0.186s
n=45, m=222, time=0.186s
n=46, m=217, time=0.188s
n=47, m=212, time=0.192s
n=48, m=208, time=0.195s
n=49, m=204, time=0.199s
n=50, m=200, time=0.21s
n=51, m=196, time=0.205s
n=52, m=192, time=0.208s
n=53, m=188, time=0.24s
n=54, m=185, time=0.22s
n=55, m=181, time=0.218s
n=56, m=178, time=0.222s
n=57, m=175, time=0.225s
n=58, m=172, time=0.229s
n=59, m=169, time=0.231s
n=60, m=166, time=0.235s
n=61, m=163, time=0.239s
n=62, m=161, time=0.241s
n=63, m=158, time=0.245s
n=64, m=156, time=0.234s
n=65, m=153, time=0.252s
n=66, m=151, time=0.254s
n=67, m=149, time=0.257s
n=68, m=147, time=0.26s
n=69, m=144, time=0.264s

[jira] [Updated] (MATH-1236) Improve Kolmogorov-Smirnov statistic for two samples

2015-06-19 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1236:
-
Attachment: patch

 Improve Kolmogorov-Smirnov statistic for two samples
 

 Key: MATH-1236
 URL: https://issues.apache.org/jira/browse/MATH-1236
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Attachments: patch


 I reimplemented the Kolmogorov-Smirnov test statistic for two samples because 
 of its problems with identical values. Although its already fixed in the 
 meantime (see MATH-1197), the proposed code is an improvement, because it is 
 simpler and does not require binary searches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MATH-1236) Improve Kolmogorov-Smirnov statistic for two samples

2015-06-19 Thread Otmar Ertl (JIRA)

Otmar Ertl created MATH-1236:


 Summary: Improve Kolmogorov-Smirnov statistic for two samples
 Key: MATH-1236
 URL: https://issues.apache.org/jira/browse/MATH-1236
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Attachments: patch

I reimplemented the Kolmogorov-Smirnov test statistic for two samples because 
of its problems with identical values. Although its already fixed in the 
meantime (see MATH-1197), the proposed code is an improvement, because it is 
simpler and does not require binary searches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-05-23 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1220:
-
Attachment: Zipf Rejection Inversion Sampling Notes.pdf

I attached my notes that should make it clearer how the original algorithm was 
transformed.

 More efficient sample() method for ZipfDistribution
 ---

 Key: MATH-1220
 URL: https://issues.apache.org/jira/browse/MATH-1220
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Fix For: 4.0, 3.6

 Attachments: Zipf Rejection Inversion Sampling Notes.pdf, patch_v1, 
 patch_v2


 Currently, sampling from a ZipfDistribution is very inefficient. Random 
 values are generated by inverting the CDF. However, the current 
 implementation uses O(N) power function evaluations to calculate the CDF for 
 some point. (Here N is the number of points of the Zipf distribution.) I 
 propose to use rejection sampling instead, which allows the generation of a 
 single random value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-05-03 Thread Otmar Ertl (JIRA)

[
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526002#comment-14526002
]

Otmar Ertl commented on MATH-1220:
--

Yes, my proposed method outperforms the rejection inversion method for large
exponents. However, I suppose large exponents (greater than 10) are not very
interesting for data modelling. The exponent is usually close to 1 or even less
than 1. See for example following paper about word frequencies
http://colala.bcs.rochester.edu/papers/piantadosi2014zipfs.pdf. The original
rejection inversion method cannot handle exponents in that range. The second
patch contains an adapted rejection inversion algorithm, that covers all
non-negative exponents. (I still favor to include 0 as allowed value for the
exponent. It is easier for data modelling, if you are allowed to choose the
parameter from an closed interval. Furthermore, exponent equal to 0 represents
a meaningful distribution, since it corresponds to a uniform distribution.)

More efficient sample() method for ZipfDistribution
---

Key: MATH-1220
URL: https://issues.apache.org/jira/browse/MATH-1220
Project: Commons Math
Issue Type: Improvement
Reporter: Otmar Ertl
Fix For: 4.0, 3.6

Attachments: patch_v1, patch_v2

Currently, sampling from a ZipfDistribution is very inefficient. Random
values are generated by inverting the CDF. However, the current
implementation uses O(N) power function evaluations to calculate the CDF for
some point. (Here N is the number of points of the Zipf distribution.) I
propose to use rejection sampling instead, which allows the generation of a
single random value in constant time.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-05-03 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1220:
-
Attachment: patch_v2

I have found some time to read the paper you mentioned: 
http://epub.wu.ac.at/1176/. The algorithm described there is superior to the 
method I have proposed. The only drawback is that it is restricted to exponents 
 larger than 1. However, I have found a way to transform the algorithm so that 
it should work for any non-negative exponents (see patch_v2).

 More efficient sample() method for ZipfDistribution
 ---

 Key: MATH-1220
 URL: https://issues.apache.org/jira/browse/MATH-1220
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Fix For: 4.0, 3.6

 Attachments: patch_v1, patch_v2


 Currently, sampling from a ZipfDistribution is very inefficient. Random 
 values are generated by inverting the CDF. However, the current 
 implementation uses O(N) power function evaluations to calculate the CDF for 
 some point. (Here N is the number of points of the Zipf distribution.) I 
 propose to use rejection sampling instead, which allows the generation of a 
 single random value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-05-03 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl reopened MATH-1220:
--

 More efficient sample() method for ZipfDistribution
 ---

 Key: MATH-1220
 URL: https://issues.apache.org/jira/browse/MATH-1220
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Fix For: 4.0, 3.6

 Attachments: patch_v1, patch_v2


 Currently, sampling from a ZipfDistribution is very inefficient. Random 
 values are generated by inverting the CDF. However, the current 
 implementation uses O(N) power function evaluations to calculate the CDF for 
 some point. (Here N is the number of points of the Zipf distribution.) I 
 propose to use rejection sampling instead, which allows the generation of a 
 single random value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-04-30 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14521921#comment-14521921
 ] 

Otmar Ertl commented on MATH-1220:
--

Since we are sampling from a finite number of points, convergence of the 
infinite series is irrelevant. Exponent equal to 0 corresponds to a uniform 
distribution.

Yes, the tail includes the points starting from 2 to N, because the first point 
(the head) is treated differently in order to limit the rejection rate. 
Otherwise, the rejection rate could become arbitrarily large for large 
exponents.

 More efficient sample() method for ZipfDistribution
 ---

 Key: MATH-1220
 URL: https://issues.apache.org/jira/browse/MATH-1220
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Attachments: patch_v1


 Currently, sampling from a ZipfDistribution is very inefficient. Random 
 values are generated by inverting the CDF. However, the current 
 implementation uses O(N) power function evaluations to calculate the CDF for 
 some point. (Here N is the number of points of the Zipf distribution.) I 
 propose to use rejection sampling instead, which allows the generation of a 
 single random value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-04-29 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520123#comment-14520123
 ] 

Otmar Ertl commented on MATH-1220:
--

Caching generalizedHarmonic(numberOfElements, exponent) makes sense.

The inverse cumulative probability would be more efficient by simply summing up 
the probabilities of points until the searched probability is met.

Furthermore, I would allow the exponent to be non-negative. Currently, it is 
restricted to positive values.

I have developed the method by myself. I do not know if a similar method can be 
found in literature. So far, apart from this math library, I have no plans to 
publish it somewhere else. I am not sure, if I could bring up the time to write 
some paper.









 More efficient sample() method for ZipfDistribution
 ---

 Key: MATH-1220
 URL: https://issues.apache.org/jira/browse/MATH-1220
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Attachments: patch_v1


 Currently, sampling from a ZipfDistribution is very inefficient. Random 
 values are generated by inverting the CDF. However, the current 
 implementation uses O(N) power function evaluations to calculate the CDF for 
 some point. (Here N is the number of points of the Zipf distribution.) I 
 propose to use rejection sampling instead, which allows the generation of a 
 single random value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-04-27 Thread Otmar Ertl (JIRA)

[
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514257#comment-14514257
]

Otmar Ertl edited comment on MATH-1220 at 4/27/15 3:10 PM:
---

Do you mean using a pre-calculated CDF table and sampling using binary search?
This would have O(N) initialization costs, O(N) space costs, and sampling would
require O(log(N)) time. All that is constant for the method I have proposed. If
initialization and memory costs are not an issue, sampling using such a
pre-calculated table for the CDF is probably faster for smaller N.

The proposed rejection sampling algorithm divides the support into two parts.
The head which currently only consists of the first point (equal to 1) and the
tail which consists of all remaining points. Rejection sampling is only used to
sample points from the tail. Instead, it would be possible to take the first X
(=N) points as the head and to use such a pre-calculated CDF table in
combination with binary serach for sampling points from the head. For the tail
still the rejection method can be applied. When developing that algorithm I had
that in mind, but finally decided to set X = 1 to minimize the initialization
costs. The optimal value for X is likely to be use case dependent and difficult
to determine. I do not think that a pure CDF table based approach (X=N) is
feasible for all N, because initialization and memory costs could get very
large.

was (Author: otmar ertl):
Do you mean using a pre-calculated CDF tanle and sampling using binary search?
This would have O(N) initialization costs, O(N) space costs, and sampling would
require O(log(N)) time. All that is constant for the method I have proposed. If
initialization and memory costs are not an issue, sampling using such a
pre-calculated table for the CDF is probably faster for smaller N.

More efficient sample() method for ZipfDistribution
---

Key: MATH-1220
URL: https://issues.apache.org/jira/browse/MATH-1220
Project: Commons Math
Issue Type: Improvement
Reporter: Otmar Ertl
Attachments: patch_v1

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-04-27 Thread Otmar Ertl (JIRA)

[
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514257#comment-14514257
]

Otmar Ertl commented on MATH-1220:
--

Do you mean using a pre-calculated CDF tanle and sampling using binary search?
This would have O(N) initialization costs, O(N) space costs, and sampling would
require O(log(N)) time. All that is constant for the method I have proposed. If
initialization and memory costs are not an issue, sampling using such a
pre-calculated table for the CDF is probably faster for smaller N.

More efficient sample() method for ZipfDistribution
---

Key: MATH-1220
URL: https://issues.apache.org/jira/browse/MATH-1220
Project: Commons Math
Issue Type: Improvement
Reporter: Otmar Ertl
Attachments: patch_v1

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-04-27 Thread Otmar Ertl (JIRA)

[
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516401#comment-14516401
]

Otmar Ertl commented on MATH-1220:
--

The avg. number of iterations until a random value is accepted can be bound by
a constant that is independent of the number of points and the exponent.
Therefore, the O(1) runtime complexity is ensured. See the output of the
testPerformance() unit test, which shows that the avg. number of consumed
uniformly distributed random numbers is limited, which is directly connected to
the number of iterations.

I can imagine to generalize the rejection sampling method in its basic form.
Basically, you need to define two methods, a method to generate a value from
the instrumental distribution and a method to calculate the corresponding
acceptance rate for that value. However, there are many rejection sampling
methods that do not fit into this basic scheme (see for example the Ziggurat
algorithm or the Monty Python method to generate normal distributed values).
Furthermore, defining a method to calculate the acceptance rate is not feasible
in all cases, because it restricts transformations of the inequality
(acceptance rate uniformly distributed random value). For example, the
proposed method could also be optimized without the definition of an explicit
method for the acceptance rate (a division could be replaced in favor of a
mutliplication).

More efficient sample() method for ZipfDistribution
---

Key: MATH-1220
URL: https://issues.apache.org/jira/browse/MATH-1220
Project: Commons Math
Issue Type: Improvement
Reporter: Otmar Ertl
Attachments: patch_v1

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-04-26 Thread Otmar Ertl (JIRA)

Otmar Ertl created MATH-1220:


 Summary: More efficient sample() method for ZipfDistribution
 Key: MATH-1220
 URL: https://issues.apache.org/jira/browse/MATH-1220
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl


Currently, sampling from a ZipfDistribution is very inefficient. Random values 
are generated by inverting the CDF. However, the current implementation uses 
O(N) power function evaluations to calculate the CDF for some point. (Here N is 
the number of points of the Zipf distribution.) I propose to use rejection 
sampling instead, which allows the generation of a single random value in 
constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-04-26 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-1220:
-
Attachment: patch_v1

 More efficient sample() method for ZipfDistribution
 ---

 Key: MATH-1220
 URL: https://issues.apache.org/jira/browse/MATH-1220
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Attachments: patch_v1


 Currently, sampling from a ZipfDistribution is very inefficient. Random 
 values are generated by inverting the CDF. However, the current 
 implementation uses O(N) power function evaluations to calculate the CDF for 
 some point. (Here N is the number of points of the Zipf distribution.) I 
 propose to use rejection sampling instead, which allows the generation of a 
 single random value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution

2015-04-26 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513221#comment-14513221
 ] 

Otmar Ertl commented on MATH-1220:
--

The patch overrides the sample() method for ZipfDistribution by an 
implementation that uses rejection sampling. To demonstrate the speed of the 
new method the (ignored) unit test testPerformance() can be used. Using the 
default sample() method from AbstractIntegerDistribution instead, the test will 
not finish in reasonable time. The generation of a single random value consumes 
at most 2 uniformly distributed random numbers on average, dependent on the 
parameters of the Zipf distribution (see the output of testPerformance()).

 More efficient sample() method for ZipfDistribution
 ---

 Key: MATH-1220
 URL: https://issues.apache.org/jira/browse/MATH-1220
 Project: Commons Math
  Issue Type: Improvement
Reporter: Otmar Ertl
 Attachments: patch_v1


 Currently, sampling from a ZipfDistribution is very inefficient. Random 
 values are generated by inverting the CDF. However, the current 
 implementation uses O(N) power function evaluations to calculate the CDF for 
 some point. (Here N is the number of points of the Zipf distribution.) I 
 propose to use rejection sampling instead, which allows the generation of a 
 single random value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-990) Improve performance of MathArrays.sortInPlace

2014-11-21 Thread Otmar Ertl (JIRA)


 [ 
https://issues.apache.org/jira/browse/MATH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otmar Ertl updated MATH-990:

Attachment: patch

The performance can be further improved while still using Collections.sort by 
avoiding the generic Pair class (see attached patch). A reduction of costs by a 
factor of 0.8 was observed in case of 2 parallel arrays.

 Improve performance of MathArrays.sortInPlace
 -

 Key: MATH-990
 URL: https://issues.apache.org/jira/browse/MATH-990
 Project: Commons Math
  Issue Type: Improvement
Affects Versions: 3.2
Reporter: Gilles
Assignee: Gilles
Priority: Minor
  Labels: benchmark
 Fix For: 4.0

 Attachments: patch


 Performance suffers from lots of copying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1159) Add quartiles to SummaryStatistics

2014-10-22 Thread Otmar Ertl (JIRA)

[
https://issues.apache.org/jira/browse/MATH-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180422#comment-14180422
]

Otmar Ertl commented on MATH-1159:
--

If multiple quantile values need to be computed simultaneously (e.g lower
quartile, median, and upper quartile), it would make sense to use the P²
algorithm for histograms that uses more than 5 marker positions as described in
the original paper. Depending on the desired quantile values different sets of
marker positions could be used. Using a PSquarePercentile object for each of
the desired quantiles has some overhead. For example, each of them store the
minimum and the maximum value. The 2nd and 4th marker positions used for the
median correspond to the 3rd marker positions of the lower and upper quartiles,
respectively. In summary, a P² algorithm implementation that calculates all
desired quantiles simultaneously would be much more efficient.

Add quartiles to SummaryStatistics
--

Key: MATH-1159
URL: https://issues.apache.org/jira/browse/MATH-1159
Project: Commons Math
Issue Type: Improvement
Affects Versions: 3.3
Reporter: Phil Steitz

Using PSquarePercentile, we can add quartile computation to
SummaryStatistics. Since maintaining quartiles will add some overhead,
implementation should allow the feature to be turned off via some kind of
constructor flag. This does open the can of worms regarding turning on / off
other stats, which is probably a good thing to think about as the
implementation of this feature is developed.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1158) Factory method to provide sampling from distributions

2014-10-22 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180464#comment-14180464
 ] 

Otmar Ertl commented on MATH-1158:
--

I like your proposal. I think there is much more to deprecate. Since you 
deprecated all methods that use the RandomGenerator instance, it would be 
consequent to also deprecate all the constructors of all RealDistribution 
implementations that accept a RandomGenerator instance. Why not separating the 
sampling functionality for IntegerDistribution too?

 Factory method to provide sampling from distributions
 -

 Key: MATH-1158
 URL: https://issues.apache.org/jira/browse/MATH-1158
 Project: Commons Math
  Issue Type: New Feature
Affects Versions: 3.3
Reporter: Gilles
Priority: Minor
  Labels: API, deprecation
 Fix For: 3.4

 Attachments: MATH-1158.patch


 A proposal to separate the sampling functionality from the core of the 
 distribution classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1154) Statistical tests in stat.inference package are very slow due to implicit RandomGenerator initialization

2014-10-08 Thread Otmar Ertl (JIRA)

[
https://issues.apache.org/jira/browse/MATH-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163330#comment-14163330
]

Otmar Ertl commented on MATH-1154:
--

I agree with Phil that my proposed patch does not address the core problem,
which is definitely MATH-1124. The patch was thought as a short-term fix, which
improves the performance essentially with minimal invasive code changes. Once
the root issue is solved, both patches proposed here (mine and Thomas') will
become obsolete anyway. The question is when will MATH-1124 be resolved? In the
case it is fixed for the next release, none of both patches need to be applied.
If not, it makes sense to apply one of both patches for the short-term. The
question remains, do you only want a speed up of the statistical tests (Thomas'
patch) or also of the distribution classes (my patch) ?

Statistical tests in stat.inference package are very slow due to implicit
RandomGenerator initialization

Key: MATH-1154
URL: https://issues.apache.org/jira/browse/MATH-1154
Project: Commons Math
Issue Type: Bug
Affects Versions: 3.3
Reporter: Otmar Ertl
Attachments: MATH-1154.patch, math3.patch

Some statistical tests defined in the stat.inference package (e.g.
BinomialTest or ChiSquareTest) are unnecessarily very slow (up to a factor 20
slower than necessary). The reason is the implicit slow initialization of a
default (Well19937c) random generator instance each time a test is performed.
The affected tests create some distribution instance in order to use some
methods defined therein. However, they do not use any method for random
generation. Nevertheless a random number generator instance is automatically
created when creating a distribution instance, which is the reason for the
serious slowdown. The problem is related to MATH-1124.
There are following solutions:
1) Fix the affected statistical tests by passing a light-weight
RandomGenerator implementation (or even null) to the constructor of the
distribution.
2) Or use for all distributions a RandomGenerator implementation that uses
lazy initialization to generate the Well19937c instance as late as possible.
This would also solve MATH-1124.
I will attach a patch proposal together with a performance test, that will
demonstrate the speed up after a fix.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1124) Instances of AbstractRealDistribution require a random generator.

2014-10-07 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162234#comment-14162234
 ] 

Otmar Ertl commented on MATH-1124:
--

What about the C++ STL design choice? The corresponding Java design would be to 
pass a random generator instance to the sample method. For instance, have a 
look on the code example given on 
http://www.cplusplus.com/reference/random/uniform_real_distribution/

 Instances of AbstractRealDistribution require a random generator.
 -

 Key: MATH-1124
 URL: https://issues.apache.org/jira/browse/MATH-1124
 Project: Commons Math
  Issue Type: Bug
Reporter: Ajo Fod

 A couple of observations:
 ... The default random generator takes a while to instantiate. 
 ... Many functions of these distributions don't require a random generator. 
 Generally speaking only sampling requires it.
 So, why force the default constructor to initialize with a new random 
 generator ... why not use a global generic or simple generator? 
 Or do away with random generator except for sampling?
 This issue was observed with the TDistribution class , but it is probably 
 applicable to many classes as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MATH-1124) Instances of AbstractRealDistribution require a random generator.

2014-10-07 Thread Otmar Ertl (JIRA)

[
https://issues.apache.org/jira/browse/MATH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162234#comment-14162234
]

Otmar Ertl edited comment on MATH-1124 at 10/7/14 6:25 PM:
---

What about the C++ STL design choice? The corresponding Java design would be to
pass a random generator instance to the sample method. For instance, have a
look on the code example given on
http://www.cplusplus.com/reference/random/uniform_real_distribution/

EDIT:
This would also allow the distribution classes to be entirely immutable.

was (Author: otmar ertl):
What about the C++ STL design choice? The corresponding Java design would be to
pass a random generator instance to the sample method. For instance, have a
look on the code example given on
http://www.cplusplus.com/reference/random/uniform_real_distribution/

Instances of AbstractRealDistribution require a random generator.
-

Key: MATH-1124
URL: https://issues.apache.org/jira/browse/MATH-1124
Project: Commons Math
Issue Type: Bug
Reporter: Ajo Fod

A couple of observations:
... The default random generator takes a while to instantiate.
... Many functions of these distributions don't require a random generator.
Generally speaking only sampling requires it.
So, why force the default constructor to initialize with a new random
generator ... why not use a global generic or simple generator?
Or do away with random generator except for sampling?
This issue was observed with the TDistribution class , but it is probably
applicable to many classes as well.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1124) Instances of AbstractRealDistribution require a random generator.

2014-10-07 Thread Otmar Ertl (JIRA)


[ 
https://issues.apache.org/jira/browse/MATH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162367#comment-14162367
 ] 

Otmar Ertl commented on MATH-1124:
--

@Ole Ersoy: For some distributions it is quite common to use multiple uniform 
distributed random numbers to generate a single random number following the 
given distributions (e.g normal distribution). Therefore, passing the random 
number generator is a much better design choice.

 Instances of AbstractRealDistribution require a random generator.
 -

 Key: MATH-1124
 URL: https://issues.apache.org/jira/browse/MATH-1124
 Project: Commons Math
  Issue Type: Bug
Reporter: Ajo Fod

 A couple of observations:
 ... The default random generator takes a while to instantiate. 
 ... Many functions of these distributions don't require a random generator. 
 Generally speaking only sampling requires it.
 So, why force the default constructor to initialize with a new random 
 generator ... why not use a global generic or simple generator? 
 Or do away with random generator except for sampling?
 This issue was observed with the TDistribution class , but it is probably 
 applicable to many classes as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1124) Instances of AbstractRealDistribution require a random generator.

2014-10-07 Thread Otmar Ertl (JIRA)

[
https://issues.apache.org/jira/browse/MATH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163124#comment-14163124
]

Otmar Ertl commented on MATH-1124:
--

It is fairly easy to write a RandomGenerator mock using AbstractRandomGenerator
that returns a predefined sequence of random numbers, if testing is your major
concern. In my opinion, your proposed interface
{quote}
sample(double[] uniformRNArray)
{quote}
is much more complex than
{quote}
sample(RandomGenerator rng)
{quote}
More important, there are distributions for which rejection sampling is used to
generate random numbers. In this case, the number of required uniform
distributed random numbers to generate a single random number obeying the given
distribution is random too (see
http://en.wikipedia.org/wiki/Rejection_sampling).

Instances of AbstractRealDistribution require a random generator.
-

Key: MATH-1124
URL: https://issues.apache.org/jira/browse/MATH-1124
Project: Commons Math
Issue Type: Bug
Reporter: Ajo Fod

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MATH-1154) Statistical tests in stat.inference package are very slow due to implicit RandomGenerator initialization

2014-10-05 Thread Otmar Ertl (JIRA)

Otmar Ertl created MATH-1154:


 Summary: Statistical tests in stat.inference package are very slow 
due to implicit RandomGenerator initialization
 Key: MATH-1154
 URL: https://issues.apache.org/jira/browse/MATH-1154
 Project: Commons Math
  Issue Type: Bug
Affects Versions: 3.3
Reporter: Otmar Ertl


Some statistical tests defined in the stat.inference package (e.g. BinomialTest 
or ChiSquareTest) are unnecessarily very slow (up to a factor 20 slower than 
necessary). The reason is the implicit slow initialization of a default 
(Well19937c) random generator instance each time a test is performed. The 
affected tests create some distribution instance in order to use some methods 
defined therein. However, they do not use any method for random generation. 
Nevertheless a random number generator instance is automatically created when 
creating a distribution instance, which is the reason for the serious slowdown. 
The problem is related to MATH-1124.

There are following solutions:
1) Fix the affected statistical tests by passing a light-weight RandomGenerator 
implementation (or even null) to the constructor of the distribution.
2) Or use for all distributions a RandomGenerator implementation that uses lazy 
initialization to generate the Well19937c instance as late as possible. This 
would also solve MATH-1124.

I will attach a patch proposal together with a performance test, that will 
demonstrate the speed up after a fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MATH-1154) Statistical tests in stat.inference package are very slow due to implicit RandomGenerator initialization

2014-10-05 Thread Otmar Ertl (JIRA)

[
https://issues.apache.org/jira/browse/MATH-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Otmar Ertl updated MATH-1154:
-
Attachment: math3.patch

This patch demonstrates a fix using lazy initialization of default random
number generator instances. Furthermore a test is included which gave following
results before
{noformat}
statistical tests performance test (calls per timed block: 10, timed
blocks: 100, time unit: ms)
name time/call std error total time ratio
difference
binomial test 1 1.38289492e-02 1.71975630e-04 1.3829e+05 1.e+00
0.e+00
binomial test 2 1.38270752e-02 1.61613547e-04 1.3827e+05 9.9986e-01
-1.87395300e+01
chi square test 2.67553017e-02 2.29903602e-04 2.6755e+05 1.9347e+00
1.29263525e+05
{noformat}
and after
{noformat}
statistical tests performance test (calls per timed block: 10, timed
blocks: 100, time unit: ms)
name time/call std error total time ratio
difference
binomial test 1 7.26630369e-04 5.87472596e-05 7.2663e+03 1.e+00
0.e+00
binomial test 2 7.27780967e-04 2.44728991e-05 7.2778e+03 1.0016e+00
1.15059780e+01
chi square test 5.21210430e-04 3.14044354e-05 5.2121e+03 7.1730e-01
-2.05419939e+03
{noformat}
the fix. A speedup up to a factor of 20 can be seen.

Statistical tests in stat.inference package are very slow due to implicit
RandomGenerator initialization

Key: MATH-1154
URL: https://issues.apache.org/jira/browse/MATH-1154
Project: Commons Math
Issue Type: Bug
Affects Versions: 3.3
Reporter: Otmar Ertl
Attachments: math3.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

91 matches

Mail list logo