Re: [GSoC][STATISTICS][Regression] Architecture Implementation Suggestions

2019-05-16 Thread Gilles Sadowski
Hello.

Le jeu. 16 mai 2019 à 10:02, Ben Nguyen  a écrit :
>
> Hello,
>
>
>
> I have some broad general ideas about how the regression module should be 
> structured, as outlined in my proposal briefly with UMLs
>
> This is the current implementation inside commons-math-stat-regression:

It seems there is/was an image here but I don't see it.

For this kind of information, please use JIRA (and provide the link here).

>
>
> This is my propsed idea, where the structure was partly inspired by SuanShu 
> since it supported multiple types of regression (including logistic):
>
> https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear
>
>
>
> Disclaimer: I have only studied some econometrics and second year computer 
> science in university, so I have zero professional data engineering 
> experience, but am excited to start learning with this project. So, I don’t 
> currently know the exact needs of data engineers in regards to this module 
> and am learning as I go….which is why I would very much appreciate any input 
> on the kinds of requirements data engineers would want from this regression 
> module.

Basing a design on use-cases is very useful.
You should collect a range of them (small/large datasets, in-memory/stream,
dense/sparse) in order to figure what parts of the code can be common and
what requires specialization.

>
> From someone who has used the current implementation or will use this new 
> implementation:
>
> What would make your life easier?
> What should definitely be kept?
> What should be added/improved?
> Any specific features or design criterions?
> Any changes or radically different approaches to the following idea?

Good questions!
What are your answers? ;-)

> Note: OLS, GLS and Logistic regression are the first to be implemented, with 
> focus to make architectural support for further additions. Changes will make 
> use of new Java 8 features, specifically the Java Streams API to improve 
> performance and readability.
>

+1
I'd suggest to select one and start coding, without fearing that you'll
probably have to change a lot of it as more use-cases are collected.

>
>
> Updates to this proposed implementation UML in my proposal:
>
> “statistics-regression-reqLinearMath” will be replaced with EJML as suggested 
> by Mr. Eric Barnhill
>
> This will include a custom matrix class extended from EJML’s SimpleBase -> 
> StatisticsMatrix
> So if we decide to use an Apache Commons implementation of matrices later on, 
> only this class should be changed internally.

Good precaution; but I doubt that we can include everything in a
single class.
How to best encapsulate the linear algebra (external) library is a
subject on its own, worth its own thread:  Cramming many questions
in a single post makes it likely that some will be missed by some
people who might later on question the chosen path.  [External
dependencies is a sensitive issue, in Commons...]

Also, I remind that we need to take into account the comparative
benchmarks which I posted recently.  [Even if just to conclude that
EJML has overwhelming advantages (which?) that make it more
suitable than its "competitors".]

>
> Abstract classes should have interfaces above them or perhaps just be 
> interfaces if a simpler approach is implemented (ie minimal OOP)
>
> Notes about this proposed implementation:
>
> AbstractVariables and it’s child classes may not be necessary, ie just 
> Estimators and Residuals classes
> Or perhaps it’s best to follow the current implementation’s example and have 
> a single class per regression type for hierarchy simplicity (but risking 
> redundancies)?
> I have not looked into specific data members or individual methods yet. So 
> far just taking notes from the current implementation and SuanShu
> The “statistics-regression-updating” components have quite complex algorithms 
> which will require a lot of time for me to understand completely
>
> So for now, I see myself making minimal changes to them, prioritizing the new 
> “stored” components.

IMHO, this will better discussed once an initial implementation is shown
(or perhaps, as Eric suggested, with unit tests).

Again, better to start a new thread for each specific question, possibly backed
with a new JIRA report focussed on a particular task (see "Create sub-tasks"
on JIRA).

>
> RegressionDataLoader’s purpose is to:
>
> provide a clean input interface
> and to ensure that data from say double[ ][ ] is only converted to working 
> form as a StatisticsMatrix object once

Until proven wrong, I'm a proponent of separating I/O from "useful"
computations.
I.e. I suggest that we consider on the one hand what API is required for all the
intented functionalitites, and on the other (in a *different* "maven
module"), all the
conversions that may be implemented for the convenience of users.

> while allowing multiple types of regression to be calculated via a universal 
> form….
> which could become a 

[GSoC][STATISTICS][Regression] Architecture Implementation Suggestions

2019-05-16 Thread Ben Nguyen
Hello,

I have some broad general ideas about how the regression module should be 
structured, as outlined in my proposal briefly with UMLs
This is the current implementation inside commons-math-stat-regression:




This is my propsed idea, where the structure was partly inspired by SuanShu 
since it supported multiple types of regression (including logistic):
https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear

Disclaimer: I have only studied some econometrics and second year computer 
science in university, so I have zero professional data engineering experience, 
but am excited to start learning with this project. So, I don’t currently know 
the exact needs of data engineers in regards to this module and am learning as 
I go….which is why I would very much appreciate any input on the kinds of 
requirements data engineers would want from this regression module. 

>From someone who has used the current implementation or will use this new 
>implementation:
- What would make your life easier? 
- What should definitely be kept? 
- What should be added/improved?
- Any specific features or design criterions? 
- Any changes or radically different approaches to the following idea?
Note: OLS, GLS and Logistic regression are the first to be implemented, with 
focus to make architectural support for further additions. Changes will make 
use of new Java 8 features, specifically the Java Streams API to improve 
performance and readability.



Updates to this proposed implementation UML in my proposal:
- “statistics-regression-reqLinearMath” will be replaced with EJML as suggested 
by Mr. Eric Barnhill
o This will include a custom matrix class extended from EJML’s SimpleBase -> 
StatisticsMatrix
o So if we decide to use an Apache Commons implementation of matrices later on, 
only this class should be changed internally.
- Abstract classes should have interfaces above them or perhaps just be 
interfaces if a simpler approach is implemented (ie minimal OOP)
Notes about this proposed implementation:
- AbstractVariables and it’s child classes may not be necessary, ie just 
Estimators and Residuals classes
- Or perhaps it’s best to follow the current implementation’s example and have 
a single class per regression type for hierarchy simplicity (but risking 
redundancies)?
- I have not looked into specific data members or individual methods yet. So 
far just taking notes from the current implementation and SuanShu
- The “statistics-regression-updating” components have quite complex algorithms 
which will require a lot of time for me to understand completely
o So for now, I see myself making minimal changes to them, prioritizing the new 
“stored” components.
- RegressionDataLoader’s purpose is to: 
o provide a clean input interface 
o and to ensure that data from say double[ ][ ] is only converted to working 
form as a StatisticsMatrix object once 
• while allowing multiple types of regression to be calculated via a universal 
form…. 
• which could become a challenge once details are in order.

So this is the current state of my plan, with your input, I will move to the 
next steps, plan more details and start creating the software flowchart.

Thank you in advance for any advice/suggestions,
-Ben Nguyen

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org