Hello.
Le jeu. 16 mai 2019 à 10:02, Ben Nguyen a écrit :
>
> Hello,
>
>
>
> I have some broad general ideas about how the regression module should be
> structured, as outlined in my proposal briefly with UMLs
>
> This is the current implementation inside commons-math-stat-regression:
It seems there is/was an image here but I don't see it.
For this kind of information, please use JIRA (and provide the link here).
>
>
> This is my propsed idea, where the structure was partly inspired by SuanShu
> since it supported multiple types of regression (including logistic):
>
> https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear
>
>
>
> Disclaimer: I have only studied some econometrics and second year computer
> science in university, so I have zero professional data engineering
> experience, but am excited to start learning with this project. So, I don’t
> currently know the exact needs of data engineers in regards to this module
> and am learning as I go….which is why I would very much appreciate any input
> on the kinds of requirements data engineers would want from this regression
> module.
Basing a design on use-cases is very useful.
You should collect a range of them (small/large datasets, in-memory/stream,
dense/sparse) in order to figure what parts of the code can be common and
what requires specialization.
>
> From someone who has used the current implementation or will use this new
> implementation:
>
> What would make your life easier?
> What should definitely be kept?
> What should be added/improved?
> Any specific features or design criterions?
> Any changes or radically different approaches to the following idea?
Good questions!
What are your answers? ;-)
> Note: OLS, GLS and Logistic regression are the first to be implemented, with
> focus to make architectural support for further additions. Changes will make
> use of new Java 8 features, specifically the Java Streams API to improve
> performance and readability.
>
+1
I'd suggest to select one and start coding, without fearing that you'll
probably have to change a lot of it as more use-cases are collected.
>
>
> Updates to this proposed implementation UML in my proposal:
>
> “statistics-regression-reqLinearMath” will be replaced with EJML as suggested
> by Mr. Eric Barnhill
>
> This will include a custom matrix class extended from EJML’s SimpleBase ->
> StatisticsMatrix
> So if we decide to use an Apache Commons implementation of matrices later on,
> only this class should be changed internally.
Good precaution; but I doubt that we can include everything in a
single class.
How to best encapsulate the linear algebra (external) library is a
subject on its own, worth its own thread: Cramming many questions
in a single post makes it likely that some will be missed by some
people who might later on question the chosen path. [External
dependencies is a sensitive issue, in Commons...]
Also, I remind that we need to take into account the comparative
benchmarks which I posted recently. [Even if just to conclude that
EJML has overwhelming advantages (which?) that make it more
suitable than its "competitors".]
>
> Abstract classes should have interfaces above them or perhaps just be
> interfaces if a simpler approach is implemented (ie minimal OOP)
>
> Notes about this proposed implementation:
>
> AbstractVariables and it’s child classes may not be necessary, ie just
> Estimators and Residuals classes
> Or perhaps it’s best to follow the current implementation’s example and have
> a single class per regression type for hierarchy simplicity (but risking
> redundancies)?
> I have not looked into specific data members or individual methods yet. So
> far just taking notes from the current implementation and SuanShu
> The “statistics-regression-updating” components have quite complex algorithms
> which will require a lot of time for me to understand completely
>
> So for now, I see myself making minimal changes to them, prioritizing the new
> “stored” components.
IMHO, this will better discussed once an initial implementation is shown
(or perhaps, as Eric suggested, with unit tests).
Again, better to start a new thread for each specific question, possibly backed
with a new JIRA report focussed on a particular task (see "Create sub-tasks"
on JIRA).
>
> RegressionDataLoader’s purpose is to:
>
> provide a clean input interface
> and to ensure that data from say double[ ][ ] is only converted to working
> form as a StatisticsMatrix object once
Until proven wrong, I'm a proponent of separating I/O from "useful"
computations.
I.e. I suggest that we consider on the one hand what API is required for all the
intented functionalitites, and on the other (in a *different* "maven
module"), all the
conversions that may be implemented for the convenience of users.
> while allowing multiple types of regression to be calculated via a universal
> form….
> which could become a