[ https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-20468: ------------------------------------ Assignee: Apache Spark > Refactor the ALS code > --------------------- > > Key: SPARK-20468 > URL: https://issues.apache.org/jira/browse/SPARK-20468 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Affects Versions: 2.1.0 > Reporter: Daniel Li > Assignee: Apache Spark > Priority: Minor > Labels: documentation, readability, refactoring > > The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is > quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all > in one file. Here are some things I think could improve the clarity and > maintainability of the code: > * The file can be split into more manageable parts. In particular, {{ALS}}, > {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files > for better readability. > * Certain parts can be encapsulated or moved to clarify the intent. For > example: > ** The {{ALS.train}} method is currently defined in the {{ALS}} companion > object, and it seems to take 12 individual parameters that are all members of > the {{ALS}} class. This method can be made an instance method. > ** The code that creates in-blocks and out-blocks in the body of > {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods > in the {{ALS}} companion object, can be moved into a separate case class that > holds the blocks. This has the added benefit of allowing us to write > specific Scaladoc to explain the logic behind these block objects, as their > usage is certainly nontrivial yet is fundamental to the implementation. > ** The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be > hidden within {{UncompressedInBlock}} to clarify the scope of their usage. > ** Various builder classes could be encapsulated in the companion objects of > the classes they build. > * The code can be formatted more clearly. For example: > ** Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be > formatted more clearly and have comments added explaining the reasoning > behind key parts. That these methods form the core of the ALS logic makes > this doubly important for maintainability. > ** Parts of the code that use {{while}} loops with manually incremented > counters can be rewritten as {{for}} loops. > ** Where non-idiomatic Scala code is used that doesn't improve performance > much, clearer code can be substituted. (This in particular should be done > very carefully if at all, as it's apparent the original author spent much > time and pains optimizing the code to significantly improve its runtime > profile.) > * The documentation (both Scaladocs and inline comments) can be clarified > where needed and expanded where incomplete. This is especially important for > parts of the code that are written imperatively for performance, as these > parts don't benefit from the intuitive self-documentation of Scala's > higher-level language abstractions. Specifically, I'd like to add > documentation fully explaining the key functionality of the in-block and > out-block objects, their purpose, how they relate to the overall ALS > algorithm, and how they are calculated in such a way that new maintainers can > ramp up much more quickly. > The above is not a complete enumeration of improvements but a high-level > survey. All of these improvements will, I believe, add up to make the code > easier to understand, extend, and maintain. This issue will track the > progress of this refactoring so that going forward, authors will have an > easier time maintaining this part of the project. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org