[GSoC][STATISTICS][Regression] Architecture Implementation Suggestions

Ben Nguyen Thu, 16 May 2019 01:02:25 -0700

Hello,

I have some broad general ideas about how the regression module should be 
structured, as outlined in my proposal briefly with UMLs
This is the current implementation inside commons-math-stat-regression:





This is my propsed idea, where the structure was partly inspired by SuanShu 
since it supported multiple types of regression (including logistic):
https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear

Disclaimer: I have only studied some econometrics and second year computer 
science in university, so I have zero professional data engineering experience, 
but am excited to start learning with this project. So, I don’t currently know 
the exact needs of data engineers in regards to this module and am learning as 
I go….which is why I would very much appreciate any input on the kinds of 
requirements data engineers would want from this regression module. 

>From someone who has used the current implementation or will use this new 
>implementation:
- What would make your life easier? 
- What should definitely be kept? 
- What should be added/improved?
- Any specific features or design criterions? 
- Any changes or radically different approaches to the following idea?
Note: OLS, GLS and Logistic regression are the first to be implemented, with 
focus to make architectural support for further additions. Changes will make 
use of new Java 8 features, specifically the Java Streams API to improve 
performance and readability.



Updates to this proposed implementation UML in my proposal:
- “statistics-regression-reqLinearMath” will be replaced with EJML as suggested 
by Mr. Eric Barnhill
o This will include a custom matrix class extended from EJML’s SimpleBase -> 
StatisticsMatrix
o So if we decide to use an Apache Commons implementation of matrices later on, 
only this class should be changed internally.
- Abstract classes should have interfaces above them or perhaps just be 
interfaces if a simpler approach is implemented (ie minimal OOP)
Notes about this proposed implementation:
- AbstractVariables and it’s child classes may not be necessary, ie just 
Estimators and Residuals classes
- Or perhaps it’s best to follow the current implementation’s example and have 
a single class per regression type for hierarchy simplicity (but risking 
redundancies)?
- I have not looked into specific data members or individual methods yet. So 
far just taking notes from the current implementation and SuanShu
- The “statistics-regression-updating” components have quite complex algorithms 
which will require a lot of time for me to understand completely
o So for now, I see myself making minimal changes to them, prioritizing the new 
“stored” components.
- RegressionDataLoader’s purpose is to: 
o provide a clean input interface 
o and to ensure that data from say double[ ][ ] is only converted to working 
form as a StatisticsMatrix object once 
• while allowing multiple types of regression to be calculated via a universal 
form…. 
• which could become a challenge once details are in order.

So this is the current state of my plan, with your input, I will move to the 
next steps, plan more details and start creating the software flowchart.

Thank you in advance for any advice/suggestions,
-Ben Nguyen

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

[GSoC][STATISTICS][Regression] Architecture Implementation Suggestions

Reply via email to