Daniel Li created SPARK-20468: --------------------------------- Summary: Refactor the ALS code Key: SPARK-20468 URL: https://issues.apache.org/jira/browse/SPARK-20468 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.1.0 Reporter: Daniel Li Priority: Minor
The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all in one file. Here are some things I think could improve the clarity and maintainability of the code: * The file can be split into more manageable parts. In particular, {{ALS}}, {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files for better readability. * Certain parts can be encapsulated or moved to clarify the intent. For example: ** The {{ALS.train}} method is currently defined in the {{ALS}} companion object, and it seems to take 12 individual parameters that are all members of the {{ALS}} class. This method can be made an instance method. ** The code that creates in-blocks and out-blocks in the body of {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods in the {{ALS}} companion object, can be moved into a separate case class that holds the blocks. This has the added benefit of allowing us to write specific Scaladoc to explain the logic behind these block objects, as their usage is certainly nontrivial yet is fundamental to the implementation. ** The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be hidden within {{UncompressedInBlock}} to clarify the scope of their usage. ** Various builder classes could be encapsulated in the companion objects of the classes they build. * The code can be formatted more clearly. For example: ** Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be formatted more clearly and have comments added explaining the reasoning behind key parts. That these methods form the core of the ALS logic makes this doubly important for maintainability. ** Parts of the code that use {{while}} loops with manually incremented counters can be rewritten as {{for}} loops. ** Where non-idiomatic Scala code is used that doesn't improve performance much, clearer code can be substituted. (This in particular should be done very carefully if at all, as it's apparent the original author spent much time and pains optimizing the code to significantly improve its runtime profile.) * The documentation (both Scaladocs and inline comments) can be clarified where needed and expanded where incomplete. This is especially important for parts of the code that are written imperatively for performance, as these parts don't benefit from the intuitive self-documentation of Scala's higher-level language abstractions. Specifically, I'd like to add documentation fully explaining the key functionality of the in-block and out-block objects, their purpose, how they relate to the overall ALS algorithm, and how they are calculated in such a way that new maintainers can ramp up much more quickly. The above is not a complete enumeration of improvements but a high-level survey. All of these improvements will, I believe, add up to make the code easier to understand, extend, and maintain. This issue will track the progress of this refactoring so that going forward, authors will have an easier time maintaining this part of the project. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org