[ https://issues.apache.org/jira/browse/STATISTICS-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852332#comment-16852332 ]
Ben Nguyen commented on STATISTICS-11: -------------------------------------- It's 3 am here, can't sleep, forgive me if this is a bad idea/if I fail to see something, but since there will only be one RegressionRawData class container which holds just the X, Y, hasIntercept, and maybe weights as well; shouldn't it just be a class rather than interface? Of course, if there were multiple variations of RawData, then an interface would be needed, but in this case, I don't see the need.... Unless we want to strictly define something and have a single concrete class implementing it? Thank you for your input. -Ben > OVERALL-TASK (not yet split): Designing Robust Class Structure and > Architecture > ------------------------------------------------------------------------------- > > Key: STATISTICS-11 > URL: https://issues.apache.org/jira/browse/STATISTICS-11 > Project: Apache Commons Statistics > Issue Type: Sub-task > Reporter: Ben Nguyen > Priority: Major > Attachments: Current-Implementation.png, Proposed Detailed UML.png, > Proposed-Implementation.png > > Original Estimate: 840h > Remaining Estimate: 840h > > +*[GSoC][STATISTICS][Regression] Architecture ImplementationSuggestions*+ > Hello, > I have some broad general ideas about how the regression module should be > structured, as outlined in my proposal briefly with UMLs > This is the current implementation inside commons-math-stat-regression: > > [!https://media.discordapp.net/attachments/550006517747810324/578420187460796417/unknown.png?width=400&height=295!|https://cdn.discordapp.com/attachments/550006517747810324/578420187460796417/unknown.png] > > *{color:#FF0000}GILLES SADOWSKI:{color}* > {color:#FF0000}It seems there is/was an image here but I don't see it.{color} > {color:#FF0000}For this kind of information, please use JIRA (and provide the > link here).{color} > This is my propsed idea, where the structure was partly inspired by SuanShu > since it supported multiple types of regression (including logistic): > [https://github.com/aaiyer/SuanShu/tree/master/src/main/java/com/numericalmethod/suanshu/stats/regression/linear] > > Disclaimer: I have only studied some econometrics and second year computer > science in university, so I have zero professional data engineering > experience, but am excited to start learning with this project. So, I don’t > currently know the exact needs of data engineers in regards to this module > and am learning as I go….which is why I would very much appreciate any input > on the kinds of requirements data engineers would want from this regression > module. > *{color:#FF0000}GILLES SADOWSKI:{color}* > {color:#FF0000}Basing a design on use-cases is very useful.{color} > {color:#FF0000}You should collect a range of them (small/large datasets, > in-memory/stream,{color} > {color:#FF0000}dense/sparse) in order to figure what parts of the code can be > common and{color} > {color:#FF0000}what requires specialization.{color} > From someone who has used the current implementation or will use this new > implementation: > * What would make your life easier? > * What should definitely be kept? > * What should be added/improved? > * Any specific features or design criterions? > * Any changes or radically different approaches to the following idea? > *{color:#FF0000}GILLES SADOWSKI:{color}* > {color:#FF0000}Good questions!{color} > {color:#FF0000}What are your answers? ;-){color} > Note: OLS, GLS and Logistic regression are the first to be implemented, with > focus to make architectural support for further additions. Changes will make > use of new Java 8 features, specifically the Java Streams API to improve > performance and readability. > > *{color:#FF0000}GILLES SADOWSKI:{color}* > {color:#FF0000}+1{color} > {color:#FF0000}I'd suggest to select one and start coding, without fearing > that you'll{color} > {color:#FF0000}probably have to change a lot of it as more use-cases are > collected.{color} > [!https://media.discordapp.net/attachments/550006517747810324/578420230850740225/unknown.png?width=219&height=300!|https://cdn.discordapp.com/attachments/550006517747810324/578420230850740225/unknown.png] > *+Updates to this proposed implementation UML in my proposal:+* > * “statistics-regression-reqLinearMath” will be replaced with EJML as > suggested by Mr. Eric Barnhill > * This will include a custom matrix class extended from EJML’s SimpleBase -> > StatisticsMatrix > * So if we decide to use an Apache Commons implementation of matrices later > on, only this class should be changed internally. > {color:#FF0000} *GILLES SADOWSKI:*{color} > {color:#FF0000}Good precaution; but I doubt that we can include everything in > a{color} > {color:#FF0000}single class.{color} > {color:#FF0000}How to best encapsulate the linear algebra (external) library > is a{color} > {color:#FF0000}subject on its own, worth its own thread: Cramming many > questions{color} > {color:#FF0000}in a single post makes it likely that some will be missed by > some{color} > {color:#FF0000}people who might later on question the chosen path. > [External{color} > {color:#FF0000}dependencies is a sensitive issue, in Commons...]{color} > {color:#FF0000} {color} > {color:#FF0000}Also, I remind that we need to take into account the > comparative{color} > {color:#FF0000}benchmarks which I posted recently. [Even if just to conclude > that{color} > {color:#FF0000}EJML has overwhelming advantages (which?) that make it > more{color} > {color:#FF0000}suitable than its "competitors".]{color} > * Abstract classes should have interfaces above them or perhaps just be > interfaces if a simpler approach is implemented (ie minimal OOP) > *+Notes about this proposed implementation:+* > * AbstractVariables and it’s child classes may not be necessary, ie just > Estimators and Residuals classes > * Or perhaps it’s best to follow the current implementation’s example and > have a single class per regression type for hierarchy simplicity (but risking > redundancies)? > * I have not looked into specific data members or individual methods yet. So > far just taking notes from the current implementation and SuanShu > * The “statistics-regression-updating” components have quite complex > algorithms which will require a lot of time for me to understand completely > * So for now, I see myself making minimal changes to them, prioritizing the > new “stored” components. > {color:#FF0000} *GILLES SADOWSKI:*{color} > {color:#FF0000}IMHO, this will better discussed once an initial > implementation is shown{color} > {color:#FF0000}(or perhaps, as Eric suggested, with unit tests).{color} > {color:#FF0000}Again, better to start a new thread for each specific > question, possibly backed{color} > {color:#FF0000}with a new JIRA report focussed on a particular task (see > "Create sub-tasks"{color} > {color:#FF0000}on JIRA).{color} > * RegressionDataLoader’s purpose is to: > * provide a clean input interface > * and to ensure that data from say double[ ][ ] is only converted to working > form as a StatisticsMatrix object once > {color:#FF0000} *GILLES SADOWSKI:*{color} > {color:#FF0000}Until proven wrong, I'm a proponent of separating I/O from > "useful"{color} > {color:#FF0000}computations.{color} > {color:#FF0000}I.e. I suggest that we consider on the one hand what API is > required for all the{color} > {color:#FF0000}intented functionalitites, and on the other (in a *different* > "maven{color} > {color:#FF0000}module"), all the{color} > {color:#FF0000}conversions that may be implemented for the convenience of > users.{color} > * while allowing multiple types of regression to be calculated via a > universal form…. > * which could become a challenge once details are in order. > > So this is the current state of my plan, with your input, I will move to the > next steps, plan more details and start creating the software flowchart. > > Thank you in advance for any advice/suggestions, > {color:#FF0000} *GILLES SADOWSKI:*{color} > {color:#FF0000}To summarize, my main suggestion is to split this post in > more{color} > {color:#FF0000}manageable chunks.{color} > {color:#FF0000}Regards,{color} > {color:#FF0000}Gilles{color} > -Ben Nguyen -- This message was sent by Atlassian JIRA (v7.6.3#76005)