[ https://issues.apache.org/jira/browse/STATISTICS-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ben Nguyen updated STATISTICS-11: --------------------------------- Description: *LINK TO DEVELOPMENT BRANCH:* *[https://github.com/BBenNguyenn/commons-statistics/tree/STATISTICS-8_Regression_Module/commons-statistics-regression]* +*[GSoC][STATISTICS][Regression] Architecture ImplementationSuggestions*+ Hello, I have some broad general ideas about how the regression module should be structured, as outlined in my proposal briefly with UMLs This is the current implementation inside commons-math-stat-regression: [!https://media.discordapp.net/attachments/550006517747810324/578420187460796417/unknown.png?width=400&height=295!|https://cdn.discordapp.com/attachments/550006517747810324/578420187460796417/unknown.png] *{color:#ff0000}GILLES SADOWSKI:{color}* {color:#ff0000}It seems there is/was an image here but I don't see it.{color} {color:#ff0000}For this kind of information, please use JIRA (and provide the link here).{color} This is my propsed idea, where the structure was partly inspired by SuanShu since it supported multiple types of regression (including logistic): [https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear] Disclaimer: I have only studied some econometrics and second year computer science in university, so I have zero professional data engineering experience, but am excited to start learning with this project. So, I don’t currently know the exact needs of data engineers in regards to this module and am learning as I go….which is why I would very much appreciate any input on the kinds of requirements data engineers would want from this regression module. *{color:#ff0000}GILLES SADOWSKI:{color}* {color:#ff0000}Basing a design on use-cases is very useful.{color} {color:#ff0000}You should collect a range of them (small/large datasets, in-memory/stream,{color} {color:#ff0000}dense/sparse) in order to figure what parts of the code can be common and{color} {color:#ff0000}what requires specialization.{color} >From someone who has used the current implementation or will use this new >implementation: * What would make your life easier? * What should definitely be kept? * What should be added/improved? * Any specific features or design criterions? * Any changes or radically different approaches to the following idea? *{color:#ff0000}GILLES SADOWSKI:{color}* {color:#ff0000}Good questions!{color} {color:#ff0000}What are your answers? ;){color} Note: OLS, GLS and Logistic regression are the first to be implemented, with focus to make architectural support for further additions. Changes will make use of new Java 8 features, specifically the Java Streams API to improve performance and readability. *{color:#ff0000}GILLES SADOWSKI:{color}* {color:#ff0000}+1{color} {color:#ff0000}I'd suggest to select one and start coding, without fearing that you'll{color} {color:#ff0000}probably have to change a lot of it as more use-cases are collected.{color} [!https://media.discordapp.net/attachments/550006517747810324/578420230850740225/unknown.png?width=219&height=300!|https://cdn.discordapp.com/attachments/550006517747810324/578420230850740225/unknown.png] *+Updates to this proposed implementation UML in my proposal:+* * “statistics-regression-reqLinearMath” will be replaced with EJML as suggested by Mr. Eric Barnhill * This will include a custom matrix class extended from EJML’s SimpleBase -> StatisticsMatrix * So if we decide to use an Apache Commons implementation of matrices later on, only this class should be changed internally. {color:#ff0000} *GILLES SADOWSKI:*{color} {color:#ff0000}Good precaution; but I doubt that we can include everything in a{color} {color:#ff0000}single class.{color} {color:#ff0000}How to best encapsulate the linear algebra (external) library is a{color} {color:#ff0000}subject on its own, worth its own thread: Cramming many questions{color} {color:#ff0000}in a single post makes it likely that some will be missed by some{color} {color:#ff0000}people who might later on question the chosen path. [External{color} {color:#ff0000}dependencies is a sensitive issue, in Commons...]{color} {color:#ff0000} {color} {color:#ff0000}Also, I remind that we need to take into account the comparative{color} {color:#ff0000}benchmarks which I posted recently. [Even if just to conclude that{color} {color:#ff0000}EJML has overwhelming advantages (which?) that make it more{color} {color:#ff0000}suitable than its "competitors".]{color} * Abstract classes should have interfaces above them or perhaps just be interfaces if a simpler approach is implemented (ie minimal OOP) *+Notes about this proposed implementation:+* * AbstractVariables and it’s child classes may not be necessary, ie just Estimators and Residuals classes * Or perhaps it’s best to follow the current implementation’s example and have a single class per regression type for hierarchy simplicity (but risking redundancies)? * I have not looked into specific data members or individual methods yet. So far just taking notes from the current implementation and SuanShu * The “statistics-regression-updating” components have quite complex algorithms which will require a lot of time for me to understand completely * So for now, I see myself making minimal changes to them, prioritizing the new “stored” components. {color:#ff0000} *GILLES SADOWSKI:*{color} {color:#ff0000}IMHO, this will better discussed once an initial implementation is shown{color} {color:#ff0000}(or perhaps, as Eric suggested, with unit tests).{color} {color:#ff0000}Again, better to start a new thread for each specific question, possibly backed{color} {color:#ff0000}with a new JIRA report focussed on a particular task (see "Create sub-tasks"{color} {color:#ff0000}on JIRA).{color} * RegressionDataLoader’s purpose is to: * provide a clean input interface * and to ensure that data from say double[ ][ ] is only converted to working form as a StatisticsMatrix object once {color:#ff0000} *GILLES SADOWSKI:*{color} {color:#ff0000}Until proven wrong, I'm a proponent of separating I/O from "useful"{color} {color:#ff0000}computations.{color} {color:#ff0000}I.e. I suggest that we consider on the one hand what API is required for all the{color} {color:#ff0000}intented functionalitites, and on the other (in a *different* "maven{color} {color:#ff0000}module"), all the{color} {color:#ff0000}conversions that may be implemented for the convenience of users.{color} * while allowing multiple types of regression to be calculated via a universal form…. * which could become a challenge once details are in order. So this is the current state of my plan, with your input, I will move to the next steps, plan more details and start creating the software flowchart. Thank you in advance for any advice/suggestions, {color:#ff0000} *GILLES SADOWSKI:*{color} {color:#ff0000}To summarize, my main suggestion is to split this post in more{color} {color:#ff0000}manageable chunks.{color} {color:#ff0000}Regards,{color} {color:#ff0000}Gilles{color} -Ben Nguyen was: +*[GSoC][STATISTICS][Regression] Architecture ImplementationSuggestions*+ Hello, I have some broad general ideas about how the regression module should be structured, as outlined in my proposal briefly with UMLs This is the current implementation inside commons-math-stat-regression: [!https://media.discordapp.net/attachments/550006517747810324/578420187460796417/unknown.png?width=400&height=295!|https://cdn.discordapp.com/attachments/550006517747810324/578420187460796417/unknown.png] *{color:#FF0000}GILLES SADOWSKI:{color}* {color:#FF0000}It seems there is/was an image here but I don't see it.{color} {color:#FF0000}For this kind of information, please use JIRA (and provide the link here).{color} This is my propsed idea, where the structure was partly inspired by SuanShu since it supported multiple types of regression (including logistic): [https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear] Disclaimer: I have only studied some econometrics and second year computer science in university, so I have zero professional data engineering experience, but am excited to start learning with this project. So, I don’t currently know the exact needs of data engineers in regards to this module and am learning as I go….which is why I would very much appreciate any input on the kinds of requirements data engineers would want from this regression module. *{color:#FF0000}GILLES SADOWSKI:{color}* {color:#FF0000}Basing a design on use-cases is very useful.{color} {color:#FF0000}You should collect a range of them (small/large datasets, in-memory/stream,{color} {color:#FF0000}dense/sparse) in order to figure what parts of the code can be common and{color} {color:#FF0000}what requires specialization.{color} >From someone who has used the current implementation or will use this new >implementation: * What would make your life easier? * What should definitely be kept? * What should be added/improved? * Any specific features or design criterions? * Any changes or radically different approaches to the following idea? *{color:#FF0000}GILLES SADOWSKI:{color}* {color:#FF0000}Good questions!{color} {color:#FF0000}What are your answers? ;-){color} Note: OLS, GLS and Logistic regression are the first to be implemented, with focus to make architectural support for further additions. Changes will make use of new Java 8 features, specifically the Java Streams API to improve performance and readability. *{color:#FF0000}GILLES SADOWSKI:{color}* {color:#FF0000}+1{color} {color:#FF0000}I'd suggest to select one and start coding, without fearing that you'll{color} {color:#FF0000}probably have to change a lot of it as more use-cases are collected.{color} [!https://media.discordapp.net/attachments/550006517747810324/578420230850740225/unknown.png?width=219&height=300!|https://cdn.discordapp.com/attachments/550006517747810324/578420230850740225/unknown.png] *+Updates to this proposed implementation UML in my proposal:+* * “statistics-regression-reqLinearMath” will be replaced with EJML as suggested by Mr. Eric Barnhill * This will include a custom matrix class extended from EJML’s SimpleBase -> StatisticsMatrix * So if we decide to use an Apache Commons implementation of matrices later on, only this class should be changed internally. {color:#FF0000} *GILLES SADOWSKI:*{color} {color:#FF0000}Good precaution; but I doubt that we can include everything in a{color} {color:#FF0000}single class.{color} {color:#FF0000}How to best encapsulate the linear algebra (external) library is a{color} {color:#FF0000}subject on its own, worth its own thread: Cramming many questions{color} {color:#FF0000}in a single post makes it likely that some will be missed by some{color} {color:#FF0000}people who might later on question the chosen path. [External{color} {color:#FF0000}dependencies is a sensitive issue, in Commons...]{color} {color:#FF0000} {color} {color:#FF0000}Also, I remind that we need to take into account the comparative{color} {color:#FF0000}benchmarks which I posted recently. [Even if just to conclude that{color} {color:#FF0000}EJML has overwhelming advantages (which?) that make it more{color} {color:#FF0000}suitable than its "competitors".]{color} * Abstract classes should have interfaces above them or perhaps just be interfaces if a simpler approach is implemented (ie minimal OOP) *+Notes about this proposed implementation:+* * AbstractVariables and it’s child classes may not be necessary, ie just Estimators and Residuals classes * Or perhaps it’s best to follow the current implementation’s example and have a single class per regression type for hierarchy simplicity (but risking redundancies)? * I have not looked into specific data members or individual methods yet. So far just taking notes from the current implementation and SuanShu * The “statistics-regression-updating” components have quite complex algorithms which will require a lot of time for me to understand completely * So for now, I see myself making minimal changes to them, prioritizing the new “stored” components. {color:#FF0000} *GILLES SADOWSKI:*{color} {color:#FF0000}IMHO, this will better discussed once an initial implementation is shown{color} {color:#FF0000}(or perhaps, as Eric suggested, with unit tests).{color} {color:#FF0000}Again, better to start a new thread for each specific question, possibly backed{color} {color:#FF0000}with a new JIRA report focussed on a particular task (see "Create sub-tasks"{color} {color:#FF0000}on JIRA).{color} * RegressionDataLoader’s purpose is to: * provide a clean input interface * and to ensure that data from say double[ ][ ] is only converted to working form as a StatisticsMatrix object once {color:#FF0000} *GILLES SADOWSKI:*{color} {color:#FF0000}Until proven wrong, I'm a proponent of separating I/O from "useful"{color} {color:#FF0000}computations.{color} {color:#FF0000}I.e. I suggest that we consider on the one hand what API is required for all the{color} {color:#FF0000}intented functionalitites, and on the other (in a *different* "maven{color} {color:#FF0000}module"), all the{color} {color:#FF0000}conversions that may be implemented for the convenience of users.{color} * while allowing multiple types of regression to be calculated via a universal form…. * which could become a challenge once details are in order. So this is the current state of my plan, with your input, I will move to the next steps, plan more details and start creating the software flowchart. Thank you in advance for any advice/suggestions, {color:#FF0000} *GILLES SADOWSKI:*{color} {color:#FF0000}To summarize, my main suggestion is to split this post in more{color} {color:#FF0000}manageable chunks.{color} {color:#FF0000}Regards,{color} {color:#FF0000}Gilles{color} -Ben Nguyen > OVERALL-TASK (not yet split): Designing Robust Class Structure and > Architecture > ------------------------------------------------------------------------------- > > Key: STATISTICS-11 > URL: https://issues.apache.org/jira/browse/STATISTICS-11 > Project: Apache Commons Statistics > Issue Type: Sub-task > Reporter: Ben Nguyen > Priority: Major > Attachments: Current-Implementation.png, Proposed Detailed UML.png, > Proposed-Implementation.png, image-2019-05-30-17-38-29-722.png, > image-2019-05-30-17-39-47-225.png, image-2019-05-30-17-41-07-980.png, > image-2019-05-30-17-41-50-998.png > > Original Estimate: 840h > Remaining Estimate: 840h > > *LINK TO DEVELOPMENT BRANCH:* > *[https://github.com/BBenNguyenn/commons-statistics/tree/STATISTICS-8_Regression_Module/commons-statistics-regression]* > > +*[GSoC][STATISTICS][Regression] Architecture ImplementationSuggestions*+ > Hello, > I have some broad general ideas about how the regression module should be > structured, as outlined in my proposal briefly with UMLs > This is the current implementation inside commons-math-stat-regression: > > [!https://media.discordapp.net/attachments/550006517747810324/578420187460796417/unknown.png?width=400&height=295!|https://cdn.discordapp.com/attachments/550006517747810324/578420187460796417/unknown.png] > > *{color:#ff0000}GILLES SADOWSKI:{color}* > {color:#ff0000}It seems there is/was an image here but I don't see it.{color} > {color:#ff0000}For this kind of information, please use JIRA (and provide the > link here).{color} > This is my propsed idea, where the structure was partly inspired by SuanShu > since it supported multiple types of regression (including logistic): > [https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear] > > Disclaimer: I have only studied some econometrics and second year computer > science in university, so I have zero professional data engineering > experience, but am excited to start learning with this project. So, I don’t > currently know the exact needs of data engineers in regards to this module > and am learning as I go….which is why I would very much appreciate any input > on the kinds of requirements data engineers would want from this regression > module. > *{color:#ff0000}GILLES SADOWSKI:{color}* > {color:#ff0000}Basing a design on use-cases is very useful.{color} > {color:#ff0000}You should collect a range of them (small/large datasets, > in-memory/stream,{color} > {color:#ff0000}dense/sparse) in order to figure what parts of the code can be > common and{color} > {color:#ff0000}what requires specialization.{color} > From someone who has used the current implementation or will use this new > implementation: > * What would make your life easier? > * What should definitely be kept? > * What should be added/improved? > * Any specific features or design criterions? > * Any changes or radically different approaches to the following idea? > *{color:#ff0000}GILLES SADOWSKI:{color}* > {color:#ff0000}Good questions!{color} > {color:#ff0000}What are your answers? ;){color} > Note: OLS, GLS and Logistic regression are the first to be implemented, with > focus to make architectural support for further additions. Changes will make > use of new Java 8 features, specifically the Java Streams API to improve > performance and readability. > > *{color:#ff0000}GILLES SADOWSKI:{color}* > {color:#ff0000}+1{color} > {color:#ff0000}I'd suggest to select one and start coding, without fearing > that you'll{color} > {color:#ff0000}probably have to change a lot of it as more use-cases are > collected.{color} > [!https://media.discordapp.net/attachments/550006517747810324/578420230850740225/unknown.png?width=219&height=300!|https://cdn.discordapp.com/attachments/550006517747810324/578420230850740225/unknown.png] > *+Updates to this proposed implementation UML in my proposal:+* > * “statistics-regression-reqLinearMath” will be replaced with EJML as > suggested by Mr. Eric Barnhill > * This will include a custom matrix class extended from EJML’s SimpleBase -> > StatisticsMatrix > * So if we decide to use an Apache Commons implementation of matrices later > on, only this class should be changed internally. > {color:#ff0000} *GILLES SADOWSKI:*{color} > {color:#ff0000}Good precaution; but I doubt that we can include everything in > a{color} > {color:#ff0000}single class.{color} > {color:#ff0000}How to best encapsulate the linear algebra (external) library > is a{color} > {color:#ff0000}subject on its own, worth its own thread: Cramming many > questions{color} > {color:#ff0000}in a single post makes it likely that some will be missed by > some{color} > {color:#ff0000}people who might later on question the chosen path. > [External{color} > {color:#ff0000}dependencies is a sensitive issue, in Commons...]{color} > {color:#ff0000} {color} > {color:#ff0000}Also, I remind that we need to take into account the > comparative{color} > {color:#ff0000}benchmarks which I posted recently. [Even if just to conclude > that{color} > {color:#ff0000}EJML has overwhelming advantages (which?) that make it > more{color} > {color:#ff0000}suitable than its "competitors".]{color} > * Abstract classes should have interfaces above them or perhaps just be > interfaces if a simpler approach is implemented (ie minimal OOP) > *+Notes about this proposed implementation:+* > * AbstractVariables and it’s child classes may not be necessary, ie just > Estimators and Residuals classes > * Or perhaps it’s best to follow the current implementation’s example and > have a single class per regression type for hierarchy simplicity (but risking > redundancies)? > * I have not looked into specific data members or individual methods yet. So > far just taking notes from the current implementation and SuanShu > * The “statistics-regression-updating” components have quite complex > algorithms which will require a lot of time for me to understand completely > * So for now, I see myself making minimal changes to them, prioritizing the > new “stored” components. > {color:#ff0000} *GILLES SADOWSKI:*{color} > {color:#ff0000}IMHO, this will better discussed once an initial > implementation is shown{color} > {color:#ff0000}(or perhaps, as Eric suggested, with unit tests).{color} > {color:#ff0000}Again, better to start a new thread for each specific > question, possibly backed{color} > {color:#ff0000}with a new JIRA report focussed on a particular task (see > "Create sub-tasks"{color} > {color:#ff0000}on JIRA).{color} > * RegressionDataLoader’s purpose is to: > * provide a clean input interface > * and to ensure that data from say double[ ][ ] is only converted to working > form as a StatisticsMatrix object once > {color:#ff0000} *GILLES SADOWSKI:*{color} > {color:#ff0000}Until proven wrong, I'm a proponent of separating I/O from > "useful"{color} > {color:#ff0000}computations.{color} > {color:#ff0000}I.e. I suggest that we consider on the one hand what API is > required for all the{color} > {color:#ff0000}intented functionalitites, and on the other (in a *different* > "maven{color} > {color:#ff0000}module"), all the{color} > {color:#ff0000}conversions that may be implemented for the convenience of > users.{color} > * while allowing multiple types of regression to be calculated via a > universal form…. > * which could become a challenge once details are in order. > > So this is the current state of my plan, with your input, I will move to the > next steps, plan more details and start creating the software flowchart. > > Thank you in advance for any advice/suggestions, > {color:#ff0000} *GILLES SADOWSKI:*{color} > {color:#ff0000}To summarize, my main suggestion is to split this post in > more{color} > {color:#ff0000}manageable chunks.{color} > {color:#ff0000}Regards,{color} > {color:#ff0000}Gilles{color} > -Ben Nguyen -- This message was sent by Atlassian JIRA (v7.6.3#76005)