Hello, I have some broad general ideas about how the regression module should be structured, as outlined in my proposal briefly with UMLs This is the current implementation inside commons-math-stat-regression:
This is my propsed idea, where the structure was partly inspired by SuanShu since it supported multiple types of regression (including logistic): https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear Disclaimer: I have only studied some econometrics and second year computer science in university, so I have zero professional data engineering experience, but am excited to start learning with this project. So, I don’t currently know the exact needs of data engineers in regards to this module and am learning as I go….which is why I would very much appreciate any input on the kinds of requirements data engineers would want from this regression module. >From someone who has used the current implementation or will use this new >implementation: - What would make your life easier? - What should definitely be kept? - What should be added/improved? - Any specific features or design criterions? - Any changes or radically different approaches to the following idea? Note: OLS, GLS and Logistic regression are the first to be implemented, with focus to make architectural support for further additions. Changes will make use of new Java 8 features, specifically the Java Streams API to improve performance and readability. Updates to this proposed implementation UML in my proposal: - “statistics-regression-reqLinearMath” will be replaced with EJML as suggested by Mr. Eric Barnhill o This will include a custom matrix class extended from EJML’s SimpleBase -> StatisticsMatrix o So if we decide to use an Apache Commons implementation of matrices later on, only this class should be changed internally. - Abstract classes should have interfaces above them or perhaps just be interfaces if a simpler approach is implemented (ie minimal OOP) Notes about this proposed implementation: - AbstractVariables and it’s child classes may not be necessary, ie just Estimators and Residuals classes - Or perhaps it’s best to follow the current implementation’s example and have a single class per regression type for hierarchy simplicity (but risking redundancies)? - I have not looked into specific data members or individual methods yet. So far just taking notes from the current implementation and SuanShu - The “statistics-regression-updating” components have quite complex algorithms which will require a lot of time for me to understand completely o So for now, I see myself making minimal changes to them, prioritizing the new “stored” components. - RegressionDataLoader’s purpose is to: o provide a clean input interface o and to ensure that data from say double[ ][ ] is only converted to working form as a StatisticsMatrix object once • while allowing multiple types of regression to be calculated via a universal form…. • which could become a challenge once details are in order. So this is the current state of my plan, with your input, I will move to the next steps, plan more details and start creating the software flowchart. Thank you in advance for any advice/suggestions, -Ben Nguyen
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org