[ https://issues.apache.org/jira/browse/MATH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341976#comment-16341976 ]
Max Aller commented on MATH-1441: --------------------------------- I concur that constructing the TDistribution object isn't the bottleneck here – it's the `inverseCumulativeProbability` calculation immediately after that's expensive. So...an effective caching solution will also need to encompass that calculation. Though generating 1M+ TDistribution objects, as in my test case, doesn't help. Benchmark-wise, after dropping the executions from 3M to 1M for my sanity's sake, the execution performance improved from 9.873s to 9.600s with caching – a modest ~3% improvement. So I agree rolling back was the right call. The right approach, IMO, is for there to be some subsystem/class for lazily calculating and storing these quasi-constants like these, and use those throughout the codebase as it made sense. I might be interested in contributing something like that if you're interested – what's the minimum JVM version you're targeting in Commons Math 4.0? I've also subscribed to the Commons Developer ML. I'm also attaching a picture of my VisualVM's Sampler with my little benchmark running, at the end. Oh, and lastly, thanks for looking at the issue and taking a stab at the solution! Always nice, and sometimes a pleasant surprise, to work with people who care about their open-source software. --- !visualvm screenshot.png! > SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on > every call > ------------------------------------------------------------------------------------- > > Key: MATH-1441 > URL: https://issues.apache.org/jira/browse/MATH-1441 > Project: Commons Math > Issue Type: Improvement > Affects Versions: 3.6.1 > Environment: Java 8, Linux x64. > Reporter: Max Aller > Priority: Minor > Labels: performance > Attachments: visualvm screenshot.png > > > SimpleRegression#getSlopeConfidenceInterval, when called a lot (on the other > of 100k or 1M times), is surprisingly slow - 3M calls, on my 3rd gen i7 > machine, takes *75 seconds*. This is primarily because recalculating the > inverse cumulative probability, and reconstructing the TDistribution object > itself, is somewhat expensive, entailing a lot of `log` and `sqrt` and > iteration and all that stuff. > The confidence interval is based on two values - *n* and *alpha*. I'd posit > that alpha will _usually_ be one of a very small set of values, and n, well, > at least in my case, I'm always calling this method with the same number of > data points - a moving window of time-series data. But I recognize n might be > all over the place for some users. > In any event, I strongly believe some level of caching would greatly benefit > the users of Commons Math. I've forked my own version of > getSlopeConfidenceInterval that uses caching. Here's a test case > demonstrating that: > {code:java} > class SlowRegressionsTest { > @Test > void slow() { > SimpleRegression simpleRegression = new SimpleRegression(); > simpleRegression.addData(0.0, 0.0); > simpleRegression.addData(1.0, 1.0); > simpleRegression.addData(2.0, 2.0); > long start = System.currentTimeMillis(); > for (int i = 0; i < 3_000_000; i++) { > simpleRegression.getSlopeConfidenceInterval(); > } > long duration = System.currentTimeMillis() - start; > System.out.printf("`slow` took %dms%n", duration); > } > @Test > void fast() { > SimpleRegression simpleRegression = new SimpleRegression(); > simpleRegression.addData(0.0, 0.0); > simpleRegression.addData(1.0, 1.0); > simpleRegression.addData(2.0, 2.0); > long start = System.currentTimeMillis(); > for (int i = 0; i < 3_000_000; i++) { > > SimpleRegressionUtilsKt.fastGetSlopeConfidenceInterval(simpleRegression); > } > long duration = System.currentTimeMillis() - start; > System.out.printf("`fast` took %dms%n", duration); > } > }{code} > which prints out > {noformat} > `fast` took 159ms > `slow` took 75304ms{noformat} > Nearly 500x faster - 53ns/call. My particular solution is written in Kotlin > for Java 8, so not of direct relevance to you, but here it is: > {code:java} > package math > import org.apache.commons.math3.distribution.TDistribution > import org.apache.commons.math3.exception.OutOfRangeException > import org.apache.commons.math3.exception.util.LocalizedFormats > import org.apache.commons.math3.stat.regression.SimpleRegression > import java.util.concurrent.ConcurrentHashMap > @JvmOverloads > fun SimpleRegression.fastGetSlopeConfidenceInterval(alpha: Double = 0.05): > Double { > if (n < 3) { > return Double.NaN > } > if (alpha >= 1 || alpha <= 0) { > throw OutOfRangeException( > LocalizedFormats.SIGNIFICANCE_LEVEL, > alpha, 0, 1 > ) > } > // No advertised NotStrictlyPositiveException here - will return NaN above > // PATCH: use cached inverse cumulative probability > return slopeStdErr * getInverseCumulativeProbability(n, alpha) > } > private val cache = ConcurrentHashMap<Key, Double>() > private data class Key(val n: Long, val alpha: Double) > private fun getInverseCumulativeProbability(n: Long, alpha: Double): Double = > cache.getOrPut(Key(n, alpha)) { > TDistribution((n - 2).toDouble()).inverseCumulativeProbability(1.0 - > alpha / 2.0) > } > {code} > Limitations: 1. Kotlin, 2. ConcurrentHashMap is unbounded here. > I don't know how/if Commons Math does caching elsewhere, but it'd sure be > handy here, I believe. What are your thoughts? -- This message was sent by Atlassian JIRA (v7.6.3#76005)