[jira] [Commented] (MATH-1441) SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on every call

Max Aller (JIRA) Fri, 26 Jan 2018 21:31:15 -0800

    [ 
https://issues.apache.org/jira/browse/MATH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341976#comment-16341976
 ]


Max Aller commented on MATH-1441:
---------------------------------

I concur that constructing the TDistribution object isn't the bottleneck here – 
it's the `inverseCumulativeProbability` calculation immediately after that's 
expensive. So...an effective caching solution will also need to encompass that 
calculation. Though generating 1M+ TDistribution objects, as in my test case, 
doesn't help.

Benchmark-wise, after dropping the executions from 3M to 1M for my sanity's 
sake, the execution performance improved from 9.873s to 9.600s with caching – a 
modest ~3% improvement. So I agree rolling back was the right call.

The right approach, IMO, is for there to be some subsystem/class for lazily 
calculating and storing these quasi-constants like these, and use those 
throughout the codebase as it made sense. I might be interested in contributing 
something like that if you're interested – what's the minimum JVM version 
you're targeting in Commons Math 4.0? I've also subscribed to the Commons 
Developer ML.

I'm also attaching a picture of my VisualVM's Sampler with my little benchmark 
running, at the end.

Oh, and lastly, thanks for looking at the issue and taking a stab at the 
solution! Always nice, and sometimes a pleasant surprise, to work with people 
who care about their open-source software.

---

!visualvm screenshot.png!

> SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on 
> every call
> -------------------------------------------------------------------------------------
>
>                 Key: MATH-1441
>                 URL: https://issues.apache.org/jira/browse/MATH-1441
>             Project: Commons Math
>          Issue Type: Improvement
>    Affects Versions: 3.6.1
>         Environment: Java 8, Linux x64.
>            Reporter: Max Aller
>            Priority: Minor
>              Labels: performance
>         Attachments: visualvm screenshot.png
>
>
> SimpleRegression#getSlopeConfidenceInterval, when called a lot (on the other 
> of 100k or 1M times), is surprisingly slow - 3M calls, on my 3rd gen i7 
> machine, takes *75 seconds*. This is primarily because recalculating the 
> inverse cumulative probability, and reconstructing the TDistribution object 
> itself, is somewhat expensive, entailing a lot of `log` and `sqrt` and 
> iteration and all that stuff.
> The confidence interval is based on two values - *n* and *alpha*. I'd posit 
> that alpha will _usually_ be one of a very small set of values, and n, well, 
> at least in my case, I'm always calling this method with the same number of 
> data points - a moving window of time-series data. But I recognize n might be 
> all over the place for some users.
> In any event, I strongly believe some level of caching would greatly benefit 
> the users of Commons Math. I've forked my own version of 
> getSlopeConfidenceInterval that uses caching. Here's a test case 
> demonstrating that:
> {code:java}
> class SlowRegressionsTest {
>     @Test
>     void slow() {
>         SimpleRegression simpleRegression = new SimpleRegression();
>         simpleRegression.addData(0.0, 0.0);
>         simpleRegression.addData(1.0, 1.0);
>         simpleRegression.addData(2.0, 2.0);
>         long start = System.currentTimeMillis();
>         for (int i = 0; i < 3_000_000; i++) {
>             simpleRegression.getSlopeConfidenceInterval();
>         }
>         long duration = System.currentTimeMillis() - start;
>         System.out.printf("`slow` took %dms%n", duration);
>     }
>     @Test
>     void fast() {
>         SimpleRegression simpleRegression = new SimpleRegression();
>         simpleRegression.addData(0.0, 0.0);
>         simpleRegression.addData(1.0, 1.0);
>         simpleRegression.addData(2.0, 2.0);
>         long start = System.currentTimeMillis();
>         for (int i = 0; i < 3_000_000; i++) {
>             
> SimpleRegressionUtilsKt.fastGetSlopeConfidenceInterval(simpleRegression);
>         }
>         long duration = System.currentTimeMillis() - start;
>         System.out.printf("`fast` took %dms%n", duration);
>     }
> }{code}
> which prints out
> {noformat}
> `fast` took 159ms
> `slow` took 75304ms{noformat}
> Nearly 500x faster - 53ns/call. My particular solution is written in Kotlin 
> for Java 8, so not of direct relevance to you, but here it is:
> {code:java}
> package math
> import org.apache.commons.math3.distribution.TDistribution
> import org.apache.commons.math3.exception.OutOfRangeException
> import org.apache.commons.math3.exception.util.LocalizedFormats
> import org.apache.commons.math3.stat.regression.SimpleRegression
> import java.util.concurrent.ConcurrentHashMap
> @JvmOverloads
> fun SimpleRegression.fastGetSlopeConfidenceInterval(alpha: Double = 0.05): 
> Double {
>     if (n < 3) {
>         return Double.NaN
>     }
>     if (alpha >= 1 || alpha <= 0) {
>         throw OutOfRangeException(
>             LocalizedFormats.SIGNIFICANCE_LEVEL,
>             alpha, 0, 1
>         )
>     }
>     // No advertised NotStrictlyPositiveException here - will return NaN above
>     // PATCH: use cached inverse cumulative probability
>     return slopeStdErr * getInverseCumulativeProbability(n, alpha)
> }
> private val cache = ConcurrentHashMap<Key, Double>()
> private data class Key(val n: Long, val alpha: Double)
> private fun getInverseCumulativeProbability(n: Long, alpha: Double): Double =
>     cache.getOrPut(Key(n, alpha)) {
>         TDistribution((n - 2).toDouble()).inverseCumulativeProbability(1.0 - 
> alpha / 2.0)
>     }
> {code}
> Limitations: 1. Kotlin, 2. ConcurrentHashMap is unbounded here.
> I don't know how/if Commons Math does caching elsewhere, but it'd sure be 
> handy here, I believe. What are your thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MATH-1441) SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on every call

Reply via email to