[ https://issues.apache.org/jira/browse/STATISTICS-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849166#comment-16849166 ]
Alex D Herbert commented on STATISTICS-11: ------------------------------------------ My main point was that once the data is loaded from somewhere the number of methods required by all the downstream regression classes are small. The details of the concrete class are not relevant to the downstream classes, hence the interface to specify the data that is available. Separating the builder class from an object that holds the data would be decoupling functionality and a more standard design pattern. The data class could be immutable. The design requires more familiarity with the algorithms than is captured on your UML (and I do not know the details). For instance what is different about the input for OLSRegressionData and LogisticRegressionData? > OVERALL-TASK (not yet split): Designing Robust Class Structure and > Architecture > ------------------------------------------------------------------------------- > > Key: STATISTICS-11 > URL: https://issues.apache.org/jira/browse/STATISTICS-11 > Project: Apache Commons Statistics > Issue Type: Sub-task > Reporter: Ben Nguyen > Priority: Major > Attachments: Current-Implementation.png, Proposed Detailed UML.png, > Proposed-Implementation.png > > Original Estimate: 840h > Remaining Estimate: 840h > > +*[GSoC][STATISTICS][Regression] Architecture ImplementationSuggestions*+ > Hello, > I have some broad general ideas about how the regression module should be > structured, as outlined in my proposal briefly with UMLs > This is the current implementation inside commons-math-stat-regression: > > [!https://media.discordapp.net/attachments/550006517747810324/578420187460796417/unknown.png?width=400&height=295!|https://cdn.discordapp.com/attachments/550006517747810324/578420187460796417/unknown.png] > > *{color:#FF0000}GILLES SADOWSKI:{color}* > {color:#FF0000}It seems there is/was an image here but I don't see it.{color} > {color:#FF0000}For this kind of information, please use JIRA (and provide the > link here).{color} > This is my propsed idea, where the structure was partly inspired by SuanShu > since it supported multiple types of regression (including logistic): > [https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear] > > Disclaimer: I have only studied some econometrics and second year computer > science in university, so I have zero professional data engineering > experience, but am excited to start learning with this project. So, I don’t > currently know the exact needs of data engineers in regards to this module > and am learning as I go….which is why I would very much appreciate any input > on the kinds of requirements data engineers would want from this regression > module. > *{color:#FF0000}GILLES SADOWSKI:{color}* > {color:#FF0000}Basing a design on use-cases is very useful.{color} > {color:#FF0000}You should collect a range of them (small/large datasets, > in-memory/stream,{color} > {color:#FF0000}dense/sparse) in order to figure what parts of the code can be > common and{color} > {color:#FF0000}what requires specialization.{color} > From someone who has used the current implementation or will use this new > implementation: > * What would make your life easier? > * What should definitely be kept? > * What should be added/improved? > * Any specific features or design criterions? > * Any changes or radically different approaches to the following idea? > *{color:#FF0000}GILLES SADOWSKI:{color}* > {color:#FF0000}Good questions!{color} > {color:#FF0000}What are your answers? ;-){color} > Note: OLS, GLS and Logistic regression are the first to be implemented, with > focus to make architectural support for further additions. Changes will make > use of new Java 8 features, specifically the Java Streams API to improve > performance and readability. > > *{color:#FF0000}GILLES SADOWSKI:{color}* > {color:#FF0000}+1{color} > {color:#FF0000}I'd suggest to select one and start coding, without fearing > that you'll{color} > {color:#FF0000}probably have to change a lot of it as more use-cases are > collected.{color} > [!https://media.discordapp.net/attachments/550006517747810324/578420230850740225/unknown.png?width=219&height=300!|https://cdn.discordapp.com/attachments/550006517747810324/578420230850740225/unknown.png] > *+Updates to this proposed implementation UML in my proposal:+* > * “statistics-regression-reqLinearMath” will be replaced with EJML as > suggested by Mr. Eric Barnhill > * This will include a custom matrix class extended from EJML’s SimpleBase -> > StatisticsMatrix > * So if we decide to use an Apache Commons implementation of matrices later > on, only this class should be changed internally. > {color:#FF0000} *GILLES SADOWSKI:*{color} > {color:#FF0000}Good precaution; but I doubt that we can include everything in > a{color} > {color:#FF0000}single class.{color} > {color:#FF0000}How to best encapsulate the linear algebra (external) library > is a{color} > {color:#FF0000}subject on its own, worth its own thread: Cramming many > questions{color} > {color:#FF0000}in a single post makes it likely that some will be missed by > some{color} > {color:#FF0000}people who might later on question the chosen path. > [External{color} > {color:#FF0000}dependencies is a sensitive issue, in Commons...]{color} > {color:#FF0000} {color} > {color:#FF0000}Also, I remind that we need to take into account the > comparative{color} > {color:#FF0000}benchmarks which I posted recently. [Even if just to conclude > that{color} > {color:#FF0000}EJML has overwhelming advantages (which?) that make it > more{color} > {color:#FF0000}suitable than its "competitors".]{color} > * Abstract classes should have interfaces above them or perhaps just be > interfaces if a simpler approach is implemented (ie minimal OOP) > *+Notes about this proposed implementation:+* > * AbstractVariables and it’s child classes may not be necessary, ie just > Estimators and Residuals classes > * Or perhaps it’s best to follow the current implementation’s example and > have a single class per regression type for hierarchy simplicity (but risking > redundancies)? > * I have not looked into specific data members or individual methods yet. So > far just taking notes from the current implementation and SuanShu > * The “statistics-regression-updating” components have quite complex > algorithms which will require a lot of time for me to understand completely > * So for now, I see myself making minimal changes to them, prioritizing the > new “stored” components. > {color:#FF0000} *GILLES SADOWSKI:*{color} > {color:#FF0000}IMHO, this will better discussed once an initial > implementation is shown{color} > {color:#FF0000}(or perhaps, as Eric suggested, with unit tests).{color} > {color:#FF0000}Again, better to start a new thread for each specific > question, possibly backed{color} > {color:#FF0000}with a new JIRA report focussed on a particular task (see > "Create sub-tasks"{color} > {color:#FF0000}on JIRA).{color} > * RegressionDataLoader’s purpose is to: > * provide a clean input interface > * and to ensure that data from say double[ ][ ] is only converted to working > form as a StatisticsMatrix object once > {color:#FF0000} *GILLES SADOWSKI:*{color} > {color:#FF0000}Until proven wrong, I'm a proponent of separating I/O from > "useful"{color} > {color:#FF0000}computations.{color} > {color:#FF0000}I.e. I suggest that we consider on the one hand what API is > required for all the{color} > {color:#FF0000}intented functionalitites, and on the other (in a *different* > "maven{color} > {color:#FF0000}module"), all the{color} > {color:#FF0000}conversions that may be implemented for the convenience of > users.{color} > * while allowing multiple types of regression to be calculated via a > universal form…. > * which could become a challenge once details are in order. > > So this is the current state of my plan, with your input, I will move to the > next steps, plan more details and start creating the software flowchart. > > Thank you in advance for any advice/suggestions, > {color:#FF0000} *GILLES SADOWSKI:*{color} > {color:#FF0000}To summarize, my main suggestion is to split this post in > more{color} > {color:#FF0000}manageable chunks.{color} > {color:#FF0000}Regards,{color} > {color:#FF0000}Gilles{color} > -Ben Nguyen -- This message was sent by Atlassian JIRA (v7.6.3#76005)