[jira] [Commented] (MATH-1441) SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on every call

2018-01-27 Thread Gilles (JIRA)

[ 
https://issues.apache.org/jira/browse/MATH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342131#comment-16342131
 ] 

Gilles commented on MATH-1441:
--

{quote}it's the `inverseCumulativeProbability` calculation immediately after 
that's expensive.
{quote}
In your case, the use of a fixed "alpha" (0.05 by default) makes all those call 
always compute the same value; so it seems that both "n" and "alpha" (and their 
expensive dependents) must be cached.
{quote}I might be interested in contributing something like that if you're 
interested
{quote}
Sure.
{quote}what's the minimum JVM version you're targeting in Commons Math 4.0?
{quote}
24.3 ;)
 As it is we should rather focus on the new "Commons Statistics" component 
which I referred to above. As a new project, it is targeted at Java 8.

It would be great if you could post on the ML with a proposal on how to move 
that code to the new repository. If you intend to work on that, I can create a 
new module: {{commons-statistics-regression}} (?).
{quote}sometimes a pleasant surprise, to work with people who care about their 
open-source software.
{quote}
Some would say that I care too much. See ML archive for details...

Actually, we are looking for developers who'd like to care too (in the longer 
term).
 And "Commons Math" is too broad for that. As an example, I'm not a direct user 
of anything in {{o.a.c.math4.stat}}, so I should not be the one to dig into 
unknown code figuring out why things are the way they are (with the consequence 
that a review could uncover design problems – cf. other JIRA reports about it).

In the new module, we should take the time to define a stable API.

> SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on 
> every call
> -
>
> Key: MATH-1441
> URL: https://issues.apache.org/jira/browse/MATH-1441
> Project: Commons Math
>  Issue Type: Improvement
>Affects Versions: 3.6.1
> Environment: Java 8, Linux x64.
>Reporter: Max Aller
>Priority: Minor
>  Labels: performance
> Attachments: visualvm screenshot.png
>
>
> SimpleRegression#getSlopeConfidenceInterval, when called a lot (on the other 
> of 100k or 1M times), is surprisingly slow - 3M calls, on my 3rd gen i7 
> machine, takes *75 seconds*. This is primarily because recalculating the 
> inverse cumulative probability, and reconstructing the TDistribution object 
> itself, is somewhat expensive, entailing a lot of `log` and `sqrt` and 
> iteration and all that stuff.
> The confidence interval is based on two values - *n* and *alpha*. I'd posit 
> that alpha will _usually_ be one of a very small set of values, and n, well, 
> at least in my case, I'm always calling this method with the same number of 
> data points - a moving window of time-series data. But I recognize n might be 
> all over the place for some users.
> In any event, I strongly believe some level of caching would greatly benefit 
> the users of Commons Math. I've forked my own version of 
> getSlopeConfidenceInterval that uses caching. Here's a test case 
> demonstrating that:
> {code:java}
> class SlowRegressionsTest {
> @Test
> void slow() {
> SimpleRegression simpleRegression = new SimpleRegression();
> simpleRegression.addData(0.0, 0.0);
> simpleRegression.addData(1.0, 1.0);
> simpleRegression.addData(2.0, 2.0);
> long start = System.currentTimeMillis();
> for (int i = 0; i < 3_000_000; i++) {
> simpleRegression.getSlopeConfidenceInterval();
> }
> long duration = System.currentTimeMillis() - start;
> System.out.printf("`slow` took %dms%n", duration);
> }
> @Test
> void fast() {
> SimpleRegression simpleRegression = new SimpleRegression();
> simpleRegression.addData(0.0, 0.0);
> simpleRegression.addData(1.0, 1.0);
> simpleRegression.addData(2.0, 2.0);
> long start = System.currentTimeMillis();
> for (int i = 0; i < 3_000_000; i++) {
> 
> SimpleRegressionUtilsKt.fastGetSlopeConfidenceInterval(simpleRegression);
> }
> long duration = System.currentTimeMillis() - start;
> System.out.printf("`fast` took %dms%n", duration);
> }
> }{code}
> which prints out
> {noformat}
> `fast` took 159ms
> `slow` took 75304ms{noformat}
> Nearly 500x faster - 53ns/call. My particular solution is written in Kotlin 
> for Java 8, so not of direct relevance to you, but here it is:
> {code:java}
> package math
> import org.apache.commons.math3.distribution.TDistribution
> import org.apache.commons.math3.exception.OutOfRangeException
> import org.apache.commons.math3.exception.util.LocalizedFormats
> import org.apache.commons.math3.stat.regression.SimpleRegression
> import 

[jira] [Commented] (MATH-1441) SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on every call

2018-01-26 Thread Max Aller (JIRA)

[ 
https://issues.apache.org/jira/browse/MATH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341976#comment-16341976
 ] 

Max Aller commented on MATH-1441:
-

I concur that constructing the TDistribution object isn't the bottleneck here – 
it's the `inverseCumulativeProbability` calculation immediately after that's 
expensive. So...an effective caching solution will also need to encompass that 
calculation. Though generating 1M+ TDistribution objects, as in my test case, 
doesn't help.

Benchmark-wise, after dropping the executions from 3M to 1M for my sanity's 
sake, the execution performance improved from 9.873s to 9.600s with caching – a 
modest ~3% improvement. So I agree rolling back was the right call.

The right approach, IMO, is for there to be some subsystem/class for lazily 
calculating and storing these quasi-constants like these, and use those 
throughout the codebase as it made sense. I might be interested in contributing 
something like that if you're interested – what's the minimum JVM version 
you're targeting in Commons Math 4.0? I've also subscribed to the Commons 
Developer ML.

I'm also attaching a picture of my VisualVM's Sampler with my little benchmark 
running, at the end.

Oh, and lastly, thanks for looking at the issue and taking a stab at the 
solution! Always nice, and sometimes a pleasant surprise, to work with people 
who care about their open-source software.

---

!visualvm screenshot.png!

> SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on 
> every call
> -
>
> Key: MATH-1441
> URL: https://issues.apache.org/jira/browse/MATH-1441
> Project: Commons Math
>  Issue Type: Improvement
>Affects Versions: 3.6.1
> Environment: Java 8, Linux x64.
>Reporter: Max Aller
>Priority: Minor
>  Labels: performance
> Attachments: visualvm screenshot.png
>
>
> SimpleRegression#getSlopeConfidenceInterval, when called a lot (on the other 
> of 100k or 1M times), is surprisingly slow - 3M calls, on my 3rd gen i7 
> machine, takes *75 seconds*. This is primarily because recalculating the 
> inverse cumulative probability, and reconstructing the TDistribution object 
> itself, is somewhat expensive, entailing a lot of `log` and `sqrt` and 
> iteration and all that stuff.
> The confidence interval is based on two values - *n* and *alpha*. I'd posit 
> that alpha will _usually_ be one of a very small set of values, and n, well, 
> at least in my case, I'm always calling this method with the same number of 
> data points - a moving window of time-series data. But I recognize n might be 
> all over the place for some users.
> In any event, I strongly believe some level of caching would greatly benefit 
> the users of Commons Math. I've forked my own version of 
> getSlopeConfidenceInterval that uses caching. Here's a test case 
> demonstrating that:
> {code:java}
> class SlowRegressionsTest {
> @Test
> void slow() {
> SimpleRegression simpleRegression = new SimpleRegression();
> simpleRegression.addData(0.0, 0.0);
> simpleRegression.addData(1.0, 1.0);
> simpleRegression.addData(2.0, 2.0);
> long start = System.currentTimeMillis();
> for (int i = 0; i < 3_000_000; i++) {
> simpleRegression.getSlopeConfidenceInterval();
> }
> long duration = System.currentTimeMillis() - start;
> System.out.printf("`slow` took %dms%n", duration);
> }
> @Test
> void fast() {
> SimpleRegression simpleRegression = new SimpleRegression();
> simpleRegression.addData(0.0, 0.0);
> simpleRegression.addData(1.0, 1.0);
> simpleRegression.addData(2.0, 2.0);
> long start = System.currentTimeMillis();
> for (int i = 0; i < 3_000_000; i++) {
> 
> SimpleRegressionUtilsKt.fastGetSlopeConfidenceInterval(simpleRegression);
> }
> long duration = System.currentTimeMillis() - start;
> System.out.printf("`fast` took %dms%n", duration);
> }
> }{code}
> which prints out
> {noformat}
> `fast` took 159ms
> `slow` took 75304ms{noformat}
> Nearly 500x faster - 53ns/call. My particular solution is written in Kotlin 
> for Java 8, so not of direct relevance to you, but here it is:
> {code:java}
> package math
> import org.apache.commons.math3.distribution.TDistribution
> import org.apache.commons.math3.exception.OutOfRangeException
> import org.apache.commons.math3.exception.util.LocalizedFormats
> import org.apache.commons.math3.stat.regression.SimpleRegression
> import java.util.concurrent.ConcurrentHashMap
> @JvmOverloads
> fun SimpleRegression.fastGetSlopeConfidenceInterval(alpha: Double = 0.05): 
> Double {
> if (n < 3) {
> return Double.NaN
> }
> if (alpha 

[jira] [Commented] (MATH-1441) SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on every call

2018-01-26 Thread Gilles (JIRA)

[ 
https://issues.apache.org/jira/browse/MATH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341802#comment-16341802
 ] 

Gilles commented on MATH-1441:
--

I reverted the change in commit 60bcdd450d6950da981d70b8b08379bdd744c82d 
because I couldn't notice any performance improvement.
Please post the result of your benchmark on the current code ("master" branch). 
Thanks.

> SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on 
> every call
> -
>
> Key: MATH-1441
> URL: https://issues.apache.org/jira/browse/MATH-1441
> Project: Commons Math
>  Issue Type: Improvement
>Affects Versions: 3.6.1
> Environment: Java 8, Linux x64.
>Reporter: Max Aller
>Priority: Minor
>  Labels: performance
>
> SimpleRegression#getSlopeConfidenceInterval, when called a lot (on the other 
> of 100k or 1M times), is surprisingly slow - 3M calls, on my 3rd gen i7 
> machine, takes *75 seconds*. This is primarily because recalculating the 
> inverse cumulative probability, and reconstructing the TDistribution object 
> itself, is somewhat expensive, entailing a lot of `log` and `sqrt` and 
> iteration and all that stuff.
> The confidence interval is based on two values - *n* and *alpha*. I'd posit 
> that alpha will _usually_ be one of a very small set of values, and n, well, 
> at least in my case, I'm always calling this method with the same number of 
> data points - a moving window of time-series data. But I recognize n might be 
> all over the place for some users.
> In any event, I strongly believe some level of caching would greatly benefit 
> the users of Commons Math. I've forked my own version of 
> getSlopeConfidenceInterval that uses caching. Here's a test case 
> demonstrating that:
> {code:java}
> class SlowRegressionsTest {
> @Test
> void slow() {
> SimpleRegression simpleRegression = new SimpleRegression();
> simpleRegression.addData(0.0, 0.0);
> simpleRegression.addData(1.0, 1.0);
> simpleRegression.addData(2.0, 2.0);
> long start = System.currentTimeMillis();
> for (int i = 0; i < 3_000_000; i++) {
> simpleRegression.getSlopeConfidenceInterval();
> }
> long duration = System.currentTimeMillis() - start;
> System.out.printf("`slow` took %dms%n", duration);
> }
> @Test
> void fast() {
> SimpleRegression simpleRegression = new SimpleRegression();
> simpleRegression.addData(0.0, 0.0);
> simpleRegression.addData(1.0, 1.0);
> simpleRegression.addData(2.0, 2.0);
> long start = System.currentTimeMillis();
> for (int i = 0; i < 3_000_000; i++) {
> 
> SimpleRegressionUtilsKt.fastGetSlopeConfidenceInterval(simpleRegression);
> }
> long duration = System.currentTimeMillis() - start;
> System.out.printf("`fast` took %dms%n", duration);
> }
> }{code}
> which prints out
> {noformat}
> `fast` took 159ms
> `slow` took 75304ms{noformat}
> Nearly 500x faster - 53ns/call. My particular solution is written in Kotlin 
> for Java 8, so not of direct relevance to you, but here it is:
> {code:java}
> package math
> import org.apache.commons.math3.distribution.TDistribution
> import org.apache.commons.math3.exception.OutOfRangeException
> import org.apache.commons.math3.exception.util.LocalizedFormats
> import org.apache.commons.math3.stat.regression.SimpleRegression
> import java.util.concurrent.ConcurrentHashMap
> @JvmOverloads
> fun SimpleRegression.fastGetSlopeConfidenceInterval(alpha: Double = 0.05): 
> Double {
> if (n < 3) {
> return Double.NaN
> }
> if (alpha >= 1 || alpha <= 0) {
> throw OutOfRangeException(
> LocalizedFormats.SIGNIFICANCE_LEVEL,
> alpha, 0, 1
> )
> }
> // No advertised NotStrictlyPositiveException here - will return NaN above
> // PATCH: use cached inverse cumulative probability
> return slopeStdErr * getInverseCumulativeProbability(n, alpha)
> }
> private val cache = ConcurrentHashMap()
> private data class Key(val n: Long, val alpha: Double)
> private fun getInverseCumulativeProbability(n: Long, alpha: Double): Double =
> cache.getOrPut(Key(n, alpha)) {
> TDistribution((n - 2).toDouble()).inverseCumulativeProbability(1.0 - 
> alpha / 2.0)
> }
> {code}
> Limitations: 1. Kotlin, 2. ConcurrentHashMap is unbounded here.
> I don't know how/if Commons Math does caching elsewhere, but it'd sure be 
> handy here, I believe. What are your thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MATH-1441) SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on every call

2018-01-26 Thread Gilles (JIRA)

[ 
https://issues.apache.org/jira/browse/MATH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341280#comment-16341280
 ] 

Gilles commented on MATH-1441:
--

Could please check the solution included in commit 
1f07a0bf70049cb0a2f6b52ea7c261da5adb1abb in "master" branch?

> SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on 
> every call
> -
>
> Key: MATH-1441
> URL: https://issues.apache.org/jira/browse/MATH-1441
> Project: Commons Math
>  Issue Type: Improvement
>Affects Versions: 3.6.1
> Environment: Java 8, Linux x64.
>Reporter: Max Aller
>Priority: Minor
>  Labels: performance
>
> SimpleRegression#getSlopeConfidenceInterval, when called a lot (on the other 
> of 100k or 1M times), is surprisingly slow - 3M calls, on my 3rd gen i7 
> machine, takes *75 seconds*. This is primarily because recalculating the 
> inverse cumulative probability, and reconstructing the TDistribution object 
> itself, is somewhat expensive, entailing a lot of `log` and `sqrt` and 
> iteration and all that stuff.
> The confidence interval is based on two values - *n* and *alpha*. I'd posit 
> that alpha will _usually_ be one of a very small set of values, and n, well, 
> at least in my case, I'm always calling this method with the same number of 
> data points - a moving window of time-series data. But I recognize n might be 
> all over the place for some users.
> In any event, I strongly believe some level of caching would greatly benefit 
> the users of Commons Math. I've forked my own version of 
> getSlopeConfidenceInterval that uses caching. Here's a test case 
> demonstrating that:
> {code:java}
> class SlowRegressionsTest {
> @Test
> void slow() {
> SimpleRegression simpleRegression = new SimpleRegression();
> simpleRegression.addData(0.0, 0.0);
> simpleRegression.addData(1.0, 1.0);
> simpleRegression.addData(2.0, 2.0);
> long start = System.currentTimeMillis();
> for (int i = 0; i < 3_000_000; i++) {
> simpleRegression.getSlopeConfidenceInterval();
> }
> long duration = System.currentTimeMillis() - start;
> System.out.printf("`slow` took %dms%n", duration);
> }
> @Test
> void fast() {
> SimpleRegression simpleRegression = new SimpleRegression();
> simpleRegression.addData(0.0, 0.0);
> simpleRegression.addData(1.0, 1.0);
> simpleRegression.addData(2.0, 2.0);
> long start = System.currentTimeMillis();
> for (int i = 0; i < 3_000_000; i++) {
> 
> SimpleRegressionUtilsKt.fastGetSlopeConfidenceInterval(simpleRegression);
> }
> long duration = System.currentTimeMillis() - start;
> System.out.printf("`fast` took %dms%n", duration);
> }
> }{code}
> which prints out
> {noformat}
> `fast` took 159ms
> `slow` took 75304ms{noformat}
> Nearly 500x faster - 53ns/call. My particular solution is written in Kotlin 
> for Java 8, so not of direct relevance to you, but here it is:
> {code:java}
> package math
> import org.apache.commons.math3.distribution.TDistribution
> import org.apache.commons.math3.exception.OutOfRangeException
> import org.apache.commons.math3.exception.util.LocalizedFormats
> import org.apache.commons.math3.stat.regression.SimpleRegression
> import java.util.concurrent.ConcurrentHashMap
> @JvmOverloads
> fun SimpleRegression.fastGetSlopeConfidenceInterval(alpha: Double = 0.05): 
> Double {
> if (n < 3) {
> return Double.NaN
> }
> if (alpha >= 1 || alpha <= 0) {
> throw OutOfRangeException(
> LocalizedFormats.SIGNIFICANCE_LEVEL,
> alpha, 0, 1
> )
> }
> // No advertised NotStrictlyPositiveException here - will return NaN above
> // PATCH: use cached inverse cumulative probability
> return slopeStdErr * getInverseCumulativeProbability(n, alpha)
> }
> private val cache = ConcurrentHashMap()
> private data class Key(val n: Long, val alpha: Double)
> private fun getInverseCumulativeProbability(n: Long, alpha: Double): Double =
> cache.getOrPut(Key(n, alpha)) {
> TDistribution((n - 2).toDouble()).inverseCumulativeProbability(1.0 - 
> alpha / 2.0)
> }
> {code}
> Limitations: 1. Kotlin, 2. ConcurrentHashMap is unbounded here.
> I don't know how/if Commons Math does caching elsewhere, but it'd sure be 
> handy here, I believe. What are your thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MATH-1441) SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on every call

2018-01-23 Thread Gilles (JIRA)

[ 
https://issues.apache.org/jira/browse/MATH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335710#comment-16335710
 ] 

Gilles commented on MATH-1441:
--

{quote}What are your thoughts?
{quote}
If it's a common use-case, it is certainly worth looking at how caching should 
be implemented without jeopardizing robustness.

As it happens, we are creating new components out of "Commons Math"; work on 
["Commons 
Statistics"|https://git1-us-west.apache.org/repos/asf?p=commons-statistics.git] 
has just started.

Please file another report on the associated [issue-tracking 
system|https://issues.apache.org/jira/projects/STATISTICS/issues] with a link 
to this one.

You are most welcome to join in the discussions on the "dev" ML.
The move to a new component is an excellent opportunity to try and improve the 
design.

> SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on 
> every call
> -
>
> Key: MATH-1441
> URL: https://issues.apache.org/jira/browse/MATH-1441
> Project: Commons Math
>  Issue Type: Improvement
>Affects Versions: 3.6.1
> Environment: Java 8, Linux x64.
>Reporter: Max Aller
>Priority: Minor
>  Labels: performance
>
> SimpleRegression#getSlopeConfidenceInterval, when called a lot (on the other 
> of 100k or 1M times), is surprisingly slow - 3M calls, on my 3rd gen i7 
> machine, takes *75 seconds*. This is primarily because recalculating the 
> inverse cumulative probability, and reconstructing the TDistribution object 
> itself, is somewhat expensive, entailing a lot of `log` and `sqrt` and 
> iteration and all that stuff.
> The confidence interval is based on two values - *n* and *alpha*. I'd posit 
> that alpha will _usually_ be one of a very small set of values, and n, well, 
> at least in my case, I'm always calling this method with the same number of 
> data points - a moving window of time-series data. But I recognize n might be 
> all over the place for some users.
> In any event, I strongly believe some level of caching would greatly benefit 
> the users of Commons Math. I've forked my own version of 
> getSlopeConfidenceInterval that uses caching. Here's a test case 
> demonstrating that:
> {code:java}
> class SlowRegressionsTest {
> @Test
> void slow() {
> SimpleRegression simpleRegression = new SimpleRegression();
> simpleRegression.addData(0.0, 0.0);
> simpleRegression.addData(1.0, 1.0);
> simpleRegression.addData(2.0, 2.0);
> long start = System.currentTimeMillis();
> for (int i = 0; i < 3_000_000; i++) {
> simpleRegression.getSlopeConfidenceInterval();
> }
> long duration = System.currentTimeMillis() - start;
> System.out.printf("`slow` took %dms%n", duration);
> }
> @Test
> void fast() {
> SimpleRegression simpleRegression = new SimpleRegression();
> simpleRegression.addData(0.0, 0.0);
> simpleRegression.addData(1.0, 1.0);
> simpleRegression.addData(2.0, 2.0);
> long start = System.currentTimeMillis();
> for (int i = 0; i < 3_000_000; i++) {
> 
> SimpleRegressionUtilsKt.fastGetSlopeConfidenceInterval(simpleRegression);
> }
> long duration = System.currentTimeMillis() - start;
> System.out.printf("`fast` took %dms%n", duration);
> }
> }{code}
> which prints out
> {noformat}
> `fast` took 159ms
> `slow` took 75304ms{noformat}
> Nearly 500x faster - 53ns/call. My particular solution is written in Kotlin 
> for Java 8, so not of direct relevance to you, but here it is:
> {code:java}
> package math
> import org.apache.commons.math3.distribution.TDistribution
> import org.apache.commons.math3.exception.OutOfRangeException
> import org.apache.commons.math3.exception.util.LocalizedFormats
> import org.apache.commons.math3.stat.regression.SimpleRegression
> import java.util.concurrent.ConcurrentHashMap
> @JvmOverloads
> fun SimpleRegression.fastGetSlopeConfidenceInterval(alpha: Double = 0.05): 
> Double {
> if (n < 3) {
> return Double.NaN
> }
> if (alpha >= 1 || alpha <= 0) {
> throw OutOfRangeException(
> LocalizedFormats.SIGNIFICANCE_LEVEL,
> alpha, 0, 1
> )
> }
> // No advertised NotStrictlyPositiveException here - will return NaN above
> // PATCH: use cached inverse cumulative probability
> return slopeStdErr * getInverseCumulativeProbability(n, alpha)
> }
> private val cache = ConcurrentHashMap()
> private data class Key(val n: Long, val alpha: Double)
> private fun getInverseCumulativeProbability(n: Long, alpha: Double): Double =
> cache.getOrPut(Key(n, alpha)) {
> TDistribution((n - 2).toDouble()).inverseCumulativeProbability(1.0