RE: [statistics] Pull request for GLSMultipleLinearRegression

Ben Nguyen Thu, 23 May 2019 07:25:55 -0700

Hello,

There is currently a transition from the commons-math-stat libraries to the new 
commons-statistics library. I am working on regression related design for my 
Google Summer of Code project. I am a new contributor and would love to work 
with more people who have used these tools extensively for more insights.


The transition is mostly in the design stages. We are still figuring out 
essential problems like which linear math library to use (not from commons-math 
since its outdated) and designing a better/more flexible UI.

I have not looked into GLS as in-depth yet (as much as OLS or the new 
LogisticRegression component), perhaps you can help contribute to the GLS 
component to ensure your needs are met. Our goal is also to maximize 
efficiencies in all areas, utilizing Java 8 features such as the Streams API 
where it would increase performance.

Issue for regression component, please post insights here as well: 
https://issues.apache.org/jira/browse/STATISTICS-8
GitHub Repo: https://github.com/apache/commons-statistics

Thank you for your post,
Cheers,
-Ben Nguyen

From: Елена Картышева
Sent: Thursday, May 23, 2019 8:44 AM
To: dev
Subject: [statistics] Pull request for GLSMultipleLinearRegression

Hello.

I would like to propose a pull request implementing an option to use variance 
vector instead of covariance matrix. It allows users to avoid unnecessary 
memory usage and excessive computation in case of uncorrelated but 
heteroscedastic errors thus making it possible to work with huge input 
matrices. Using variance vector in such cases allows to reduce time complexity 
from O(N^2) to just O(N) (where N is a number of observations) and dramatically 
reduce memory usage. For example, in my practice arose a need to train 
generalized linear model. Usage of Iteratively reweighted least squares 
algorithm requires weighted regression with more than a million observations. 
Current implementation would require approximately 12 terabytes of memory while 
patched version needs only 8 megabytes. Since IRLS is iterative algorithm a 
million-times complexity reduction is also pretty handy.

 
-- 
Sincerely yours, Elena Kartysheva.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

RE: [statistics] Pull request for GLSMultipleLinearRegression

Reply via email to