[ 
https://issues.apache.org/jira/browse/STATISTICS-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Nguyen updated STATISTICS-11:
---------------------------------
    Description: 
*LINK TO DEVELOPMENT BRANCH:*

*[https://github.com/BBenNguyenn/commons-statistics/tree/STATISTICS-8_Regression_Module/commons-statistics-regression]*

 

+*[GSoC][STATISTICS][Regression] Architecture ImplementationSuggestions*+

Hello,

I have some broad general ideas about how the regression module should be 
structured, as outlined in my proposal briefly with UMLs

This is the current implementation inside commons-math-stat-regression:

 
[!https://media.discordapp.net/attachments/550006517747810324/578420187460796417/unknown.png?width=400&height=295!|https://cdn.discordapp.com/attachments/550006517747810324/578420187460796417/unknown.png]

 

 *{color:#ff0000}GILLES SADOWSKI:{color}*

{color:#ff0000}It seems there is/was an image here but I don't see it.{color}

{color:#ff0000}For this kind of information, please use JIRA (and provide the 
link here).{color}

This is my propsed idea, where the structure was partly inspired by SuanShu 
since it supported multiple types of regression (including logistic):

[https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear]

 

Disclaimer: I have only studied some econometrics and second year computer 
science in university, so I have zero professional data engineering experience, 
but am excited to start learning with this project. So, I don’t currently know 
the exact needs of data engineers in regards to this module and am learning as 
I go….which is why I would very much appreciate any input on the kinds of 
requirements data engineers would want from this regression module.

 *{color:#ff0000}GILLES SADOWSKI:{color}*

{color:#ff0000}Basing a design on use-cases is very useful.{color}

{color:#ff0000}You should collect a range of them (small/large datasets, 
in-memory/stream,{color}

{color:#ff0000}dense/sparse) in order to figure what parts of the code can be 
common and{color}

{color:#ff0000}what requires specialization.{color}

>From someone who has used the current implementation or will use this new 
>implementation:
 * What would make your life easier?
 * What should definitely be kept?
 * What should be added/improved?
 * Any specific features or design criterions?
 * Any changes or radically different approaches to the following idea?

 *{color:#ff0000}GILLES SADOWSKI:{color}*

{color:#ff0000}Good questions!{color}

{color:#ff0000}What are your answers? ;){color}

Note: OLS, GLS and Logistic regression are the first to be implemented, with 
focus to make architectural support for further additions. Changes will make 
use of new Java 8 features, specifically the Java Streams API to improve 
performance and readability.

 

 *{color:#ff0000}GILLES SADOWSKI:{color}*

{color:#ff0000}+1{color}

{color:#ff0000}I'd suggest to select one and start coding, without fearing that 
you'll{color}

{color:#ff0000}probably have to change a lot of it as more use-cases are 
collected.{color}

[!https://media.discordapp.net/attachments/550006517747810324/578420230850740225/unknown.png?width=219&height=300!|https://cdn.discordapp.com/attachments/550006517747810324/578420230850740225/unknown.png]

*+Updates to this proposed implementation UML in my proposal:+*
 * “statistics-regression-reqLinearMath” will be replaced with EJML as 
suggested by Mr. Eric Barnhill
 * This will include a custom matrix class extended from EJML’s SimpleBase -> 
StatisticsMatrix
 * So if we decide to use an Apache Commons implementation of matrices later 
on, only this class should be changed internally.

{color:#ff0000} *GILLES SADOWSKI:*{color}

{color:#ff0000}Good precaution; but I doubt that we can include everything in 
a{color}

{color:#ff0000}single class.{color}

{color:#ff0000}How to best encapsulate the linear algebra (external) library is 
a{color}

{color:#ff0000}subject on its own, worth its own thread:  Cramming many 
questions{color}

{color:#ff0000}in a single post makes it likely that some will be missed by 
some{color}

{color:#ff0000}people who might later on question the chosen path.  
[External{color}

{color:#ff0000}dependencies is a sensitive issue, in Commons...]{color}

{color:#ff0000} {color}

{color:#ff0000}Also, I remind that we need to take into account the 
comparative{color}

{color:#ff0000}benchmarks which I posted recently.  [Even if just to conclude 
that{color}

{color:#ff0000}EJML has overwhelming advantages (which?) that make it 
more{color}

{color:#ff0000}suitable than its "competitors".]{color}
 * Abstract classes should have interfaces above them or perhaps just be 
interfaces if a simpler approach is implemented (ie minimal OOP)

*+Notes about this proposed implementation:+*
 * AbstractVariables and it’s child classes may not be necessary, ie just 
Estimators and Residuals classes
 * Or perhaps it’s best to follow the current implementation’s example and have 
a single class per regression type for hierarchy simplicity (but risking 
redundancies)?
 * I have not looked into specific data members or individual methods yet. So 
far just taking notes from the current implementation and SuanShu
 * The “statistics-regression-updating” components have quite complex 
algorithms which will require a lot of time for me to understand completely
 * So for now, I see myself making minimal changes to them, prioritizing the 
new “stored” components.

{color:#ff0000} *GILLES SADOWSKI:*{color}

{color:#ff0000}IMHO, this will better discussed once an initial implementation 
is shown{color}

{color:#ff0000}(or perhaps, as Eric suggested, with unit tests).{color}

{color:#ff0000}Again, better to start a new thread for each specific question, 
possibly backed{color}

{color:#ff0000}with a new JIRA report focussed on a particular task (see 
"Create sub-tasks"{color}

{color:#ff0000}on JIRA).{color}
 * RegressionDataLoader’s purpose is to:
 * provide a clean input interface
 * and to ensure that data from say double[ ][ ] is only converted to working 
form as a StatisticsMatrix object once

{color:#ff0000} *GILLES SADOWSKI:*{color}

{color:#ff0000}Until proven wrong, I'm a proponent of separating I/O from 
"useful"{color}

{color:#ff0000}computations.{color}

{color:#ff0000}I.e. I suggest that we consider on the one hand what API is 
required for all the{color}

{color:#ff0000}intented functionalitites, and on the other (in a *different* 
"maven{color}

{color:#ff0000}module"), all the{color}

{color:#ff0000}conversions that may be implemented for the convenience of 
users.{color}
 * while allowing multiple types of regression to be calculated via a universal 
form….
 * which could become a challenge once details are in order.

 

So this is the current state of my plan, with your input, I will move to the 
next steps, plan more details and start creating the software flowchart.

 

Thank you in advance for any advice/suggestions,

{color:#ff0000} *GILLES SADOWSKI:*{color}

{color:#ff0000}To summarize, my main suggestion is to split this post in 
more{color}

{color:#ff0000}manageable chunks.{color}

{color:#ff0000}Regards,{color}

{color:#ff0000}Gilles{color}

-Ben Nguyen

  was:
+*[GSoC][STATISTICS][Regression] Architecture ImplementationSuggestions*+

Hello,

I have some broad general ideas about how the regression module should be 
structured, as outlined in my proposal briefly with UMLs

This is the current implementation inside commons-math-stat-regression:

 
[!https://media.discordapp.net/attachments/550006517747810324/578420187460796417/unknown.png?width=400&height=295!|https://cdn.discordapp.com/attachments/550006517747810324/578420187460796417/unknown.png]

 

 *{color:#FF0000}GILLES SADOWSKI:{color}*

{color:#FF0000}It seems there is/was an image here but I don't see it.{color}

{color:#FF0000}For this kind of information, please use JIRA (and provide the 
link here).{color}

This is my propsed idea, where the structure was partly inspired by SuanShu 
since it supported multiple types of regression (including logistic):

[https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear]

 

Disclaimer: I have only studied some econometrics and second year computer 
science in university, so I have zero professional data engineering experience, 
but am excited to start learning with this project. So, I don’t currently know 
the exact needs of data engineers in regards to this module and am learning as 
I go….which is why I would very much appreciate any input on the kinds of 
requirements data engineers would want from this regression module.

 *{color:#FF0000}GILLES SADOWSKI:{color}*

{color:#FF0000}Basing a design on use-cases is very useful.{color}

{color:#FF0000}You should collect a range of them (small/large datasets, 
in-memory/stream,{color}

{color:#FF0000}dense/sparse) in order to figure what parts of the code can be 
common and{color}

{color:#FF0000}what requires specialization.{color}

>From someone who has used the current implementation or will use this new 
>implementation:
 * What would make your life easier?
 * What should definitely be kept?
 * What should be added/improved?
 * Any specific features or design criterions?
 * Any changes or radically different approaches to the following idea?

 *{color:#FF0000}GILLES SADOWSKI:{color}*

{color:#FF0000}Good questions!{color}

{color:#FF0000}What are your answers? ;-){color}

Note: OLS, GLS and Logistic regression are the first to be implemented, with 
focus to make architectural support for further additions. Changes will make 
use of new Java 8 features, specifically the Java Streams API to improve 
performance and readability.

 

 *{color:#FF0000}GILLES SADOWSKI:{color}*

{color:#FF0000}+1{color}

{color:#FF0000}I'd suggest to select one and start coding, without fearing that 
you'll{color}

{color:#FF0000}probably have to change a lot of it as more use-cases are 
collected.{color}

[!https://media.discordapp.net/attachments/550006517747810324/578420230850740225/unknown.png?width=219&height=300!|https://cdn.discordapp.com/attachments/550006517747810324/578420230850740225/unknown.png]



*+Updates to this proposed implementation UML in my proposal:+*
 * “statistics-regression-reqLinearMath” will be replaced with EJML as 
suggested by Mr. Eric Barnhill
 * This will include a custom matrix class extended from EJML’s SimpleBase -> 
StatisticsMatrix
 * So if we decide to use an Apache Commons implementation of matrices later 
on, only this class should be changed internally.

{color:#FF0000} *GILLES SADOWSKI:*{color}

{color:#FF0000}Good precaution; but I doubt that we can include everything in 
a{color}

{color:#FF0000}single class.{color}

{color:#FF0000}How to best encapsulate the linear algebra (external) library is 
a{color}

{color:#FF0000}subject on its own, worth its own thread:  Cramming many 
questions{color}

{color:#FF0000}in a single post makes it likely that some will be missed by 
some{color}

{color:#FF0000}people who might later on question the chosen path.  
[External{color}

{color:#FF0000}dependencies is a sensitive issue, in Commons...]{color}

{color:#FF0000} {color}

{color:#FF0000}Also, I remind that we need to take into account the 
comparative{color}

{color:#FF0000}benchmarks which I posted recently.  [Even if just to conclude 
that{color}

{color:#FF0000}EJML has overwhelming advantages (which?) that make it 
more{color}

{color:#FF0000}suitable than its "competitors".]{color}
 * Abstract classes should have interfaces above them or perhaps just be 
interfaces if a simpler approach is implemented (ie minimal OOP)

*+Notes about this proposed implementation:+*
 * AbstractVariables and it’s child classes may not be necessary, ie just 
Estimators and Residuals classes
 * Or perhaps it’s best to follow the current implementation’s example and have 
a single class per regression type for hierarchy simplicity (but risking 
redundancies)?
 * I have not looked into specific data members or individual methods yet. So 
far just taking notes from the current implementation and SuanShu
 * The “statistics-regression-updating” components have quite complex 
algorithms which will require a lot of time for me to understand completely
 * So for now, I see myself making minimal changes to them, prioritizing the 
new “stored” components.

{color:#FF0000} *GILLES SADOWSKI:*{color}

{color:#FF0000}IMHO, this will better discussed once an initial implementation 
is shown{color}

{color:#FF0000}(or perhaps, as Eric suggested, with unit tests).{color}

{color:#FF0000}Again, better to start a new thread for each specific question, 
possibly backed{color}

{color:#FF0000}with a new JIRA report focussed on a particular task (see 
"Create sub-tasks"{color}

{color:#FF0000}on JIRA).{color}
 * RegressionDataLoader’s purpose is to:
 * provide a clean input interface
 * and to ensure that data from say double[ ][ ] is only converted to working 
form as a StatisticsMatrix object once

{color:#FF0000} *GILLES SADOWSKI:*{color}

{color:#FF0000}Until proven wrong, I'm a proponent of separating I/O from 
"useful"{color}

{color:#FF0000}computations.{color}

{color:#FF0000}I.e. I suggest that we consider on the one hand what API is 
required for all the{color}

{color:#FF0000}intented functionalitites, and on the other (in a *different* 
"maven{color}

{color:#FF0000}module"), all the{color}

{color:#FF0000}conversions that may be implemented for the convenience of 
users.{color}
 * while allowing multiple types of regression to be calculated via a universal 
form….
 * which could become a challenge once details are in order.

 

So this is the current state of my plan, with your input, I will move to the 
next steps, plan more details and start creating the software flowchart.

 

Thank you in advance for any advice/suggestions,

{color:#FF0000} *GILLES SADOWSKI:*{color}

{color:#FF0000}To summarize, my main suggestion is to split this post in 
more{color}

{color:#FF0000}manageable chunks.{color}

{color:#FF0000}Regards,{color}

{color:#FF0000}Gilles{color}

-Ben Nguyen


> OVERALL-TASK (not yet split): Designing Robust Class Structure and 
> Architecture
> -------------------------------------------------------------------------------
>
>                 Key: STATISTICS-11
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-11
>             Project: Apache Commons Statistics
>          Issue Type: Sub-task
>            Reporter: Ben Nguyen
>            Priority: Major
>         Attachments: Current-Implementation.png, Proposed Detailed UML.png, 
> Proposed-Implementation.png, image-2019-05-30-17-38-29-722.png, 
> image-2019-05-30-17-39-47-225.png, image-2019-05-30-17-41-07-980.png, 
> image-2019-05-30-17-41-50-998.png
>
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>
> *LINK TO DEVELOPMENT BRANCH:*
> *[https://github.com/BBenNguyenn/commons-statistics/tree/STATISTICS-8_Regression_Module/commons-statistics-regression]*
>  
> +*[GSoC][STATISTICS][Regression] Architecture ImplementationSuggestions*+
> Hello,
> I have some broad general ideas about how the regression module should be 
> structured, as outlined in my proposal briefly with UMLs
> This is the current implementation inside commons-math-stat-regression:
>  
> [!https://media.discordapp.net/attachments/550006517747810324/578420187460796417/unknown.png?width=400&height=295!|https://cdn.discordapp.com/attachments/550006517747810324/578420187460796417/unknown.png]
>  
>  *{color:#ff0000}GILLES SADOWSKI:{color}*
> {color:#ff0000}It seems there is/was an image here but I don't see it.{color}
> {color:#ff0000}For this kind of information, please use JIRA (and provide the 
> link here).{color}
> This is my propsed idea, where the structure was partly inspired by SuanShu 
> since it supported multiple types of regression (including logistic):
> [https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear]
>  
> Disclaimer: I have only studied some econometrics and second year computer 
> science in university, so I have zero professional data engineering 
> experience, but am excited to start learning with this project. So, I don’t 
> currently know the exact needs of data engineers in regards to this module 
> and am learning as I go….which is why I would very much appreciate any input 
> on the kinds of requirements data engineers would want from this regression 
> module.
>  *{color:#ff0000}GILLES SADOWSKI:{color}*
> {color:#ff0000}Basing a design on use-cases is very useful.{color}
> {color:#ff0000}You should collect a range of them (small/large datasets, 
> in-memory/stream,{color}
> {color:#ff0000}dense/sparse) in order to figure what parts of the code can be 
> common and{color}
> {color:#ff0000}what requires specialization.{color}
> From someone who has used the current implementation or will use this new 
> implementation:
>  * What would make your life easier?
>  * What should definitely be kept?
>  * What should be added/improved?
>  * Any specific features or design criterions?
>  * Any changes or radically different approaches to the following idea?
>  *{color:#ff0000}GILLES SADOWSKI:{color}*
> {color:#ff0000}Good questions!{color}
> {color:#ff0000}What are your answers? ;){color}
> Note: OLS, GLS and Logistic regression are the first to be implemented, with 
> focus to make architectural support for further additions. Changes will make 
> use of new Java 8 features, specifically the Java Streams API to improve 
> performance and readability.
>  
>  *{color:#ff0000}GILLES SADOWSKI:{color}*
> {color:#ff0000}+1{color}
> {color:#ff0000}I'd suggest to select one and start coding, without fearing 
> that you'll{color}
> {color:#ff0000}probably have to change a lot of it as more use-cases are 
> collected.{color}
> [!https://media.discordapp.net/attachments/550006517747810324/578420230850740225/unknown.png?width=219&height=300!|https://cdn.discordapp.com/attachments/550006517747810324/578420230850740225/unknown.png]
> *+Updates to this proposed implementation UML in my proposal:+*
>  * “statistics-regression-reqLinearMath” will be replaced with EJML as 
> suggested by Mr. Eric Barnhill
>  * This will include a custom matrix class extended from EJML’s SimpleBase -> 
> StatisticsMatrix
>  * So if we decide to use an Apache Commons implementation of matrices later 
> on, only this class should be changed internally.
> {color:#ff0000} *GILLES SADOWSKI:*{color}
> {color:#ff0000}Good precaution; but I doubt that we can include everything in 
> a{color}
> {color:#ff0000}single class.{color}
> {color:#ff0000}How to best encapsulate the linear algebra (external) library 
> is a{color}
> {color:#ff0000}subject on its own, worth its own thread:  Cramming many 
> questions{color}
> {color:#ff0000}in a single post makes it likely that some will be missed by 
> some{color}
> {color:#ff0000}people who might later on question the chosen path.  
> [External{color}
> {color:#ff0000}dependencies is a sensitive issue, in Commons...]{color}
> {color:#ff0000} {color}
> {color:#ff0000}Also, I remind that we need to take into account the 
> comparative{color}
> {color:#ff0000}benchmarks which I posted recently.  [Even if just to conclude 
> that{color}
> {color:#ff0000}EJML has overwhelming advantages (which?) that make it 
> more{color}
> {color:#ff0000}suitable than its "competitors".]{color}
>  * Abstract classes should have interfaces above them or perhaps just be 
> interfaces if a simpler approach is implemented (ie minimal OOP)
> *+Notes about this proposed implementation:+*
>  * AbstractVariables and it’s child classes may not be necessary, ie just 
> Estimators and Residuals classes
>  * Or perhaps it’s best to follow the current implementation’s example and 
> have a single class per regression type for hierarchy simplicity (but risking 
> redundancies)?
>  * I have not looked into specific data members or individual methods yet. So 
> far just taking notes from the current implementation and SuanShu
>  * The “statistics-regression-updating” components have quite complex 
> algorithms which will require a lot of time for me to understand completely
>  * So for now, I see myself making minimal changes to them, prioritizing the 
> new “stored” components.
> {color:#ff0000} *GILLES SADOWSKI:*{color}
> {color:#ff0000}IMHO, this will better discussed once an initial 
> implementation is shown{color}
> {color:#ff0000}(or perhaps, as Eric suggested, with unit tests).{color}
> {color:#ff0000}Again, better to start a new thread for each specific 
> question, possibly backed{color}
> {color:#ff0000}with a new JIRA report focussed on a particular task (see 
> "Create sub-tasks"{color}
> {color:#ff0000}on JIRA).{color}
>  * RegressionDataLoader’s purpose is to:
>  * provide a clean input interface
>  * and to ensure that data from say double[ ][ ] is only converted to working 
> form as a StatisticsMatrix object once
> {color:#ff0000} *GILLES SADOWSKI:*{color}
> {color:#ff0000}Until proven wrong, I'm a proponent of separating I/O from 
> "useful"{color}
> {color:#ff0000}computations.{color}
> {color:#ff0000}I.e. I suggest that we consider on the one hand what API is 
> required for all the{color}
> {color:#ff0000}intented functionalitites, and on the other (in a *different* 
> "maven{color}
> {color:#ff0000}module"), all the{color}
> {color:#ff0000}conversions that may be implemented for the convenience of 
> users.{color}
>  * while allowing multiple types of regression to be calculated via a 
> universal form….
>  * which could become a challenge once details are in order.
>  
> So this is the current state of my plan, with your input, I will move to the 
> next steps, plan more details and start creating the software flowchart.
>  
> Thank you in advance for any advice/suggestions,
> {color:#ff0000} *GILLES SADOWSKI:*{color}
> {color:#ff0000}To summarize, my main suggestion is to split this post in 
> more{color}
> {color:#ff0000}manageable chunks.{color}
> {color:#ff0000}Regards,{color}
> {color:#ff0000}Gilles{color}
> -Ben Nguyen



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to