Daniel Li created SPARK-20468:
---------------------------------

             Summary: Refactor the ALS code
                 Key: SPARK-20468
                 URL: https://issues.apache.org/jira/browse/SPARK-20468
             Project: Spark
          Issue Type: Improvement
          Components: ML, MLlib
    Affects Versions: 2.1.0
            Reporter: Daniel Li
            Priority: Minor


The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all in 
one file.  Here are some things I think could improve the clarity and 
maintainability of the code:

*  The file can be split into more manageable parts.  In particular, {{ALS}}, 
{{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
for better readability.

*  Certain parts can be encapsulated or moved to clarify the intent.  For 
example:
**  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
object, and it seems to take 12 individual parameters that are all members of 
the {{ALS}} class.  This method can be made an instance method.
**  The code that creates in-blocks and out-blocks in the body of 
{{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
in the {{ALS}} companion object, can be moved into a separate case class that 
holds the blocks.  This has the added benefit of allowing us to write specific 
Scaladoc to explain the logic behind these block objects, as their usage is 
certainly nontrivial yet is fundamental to the implementation.
**  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be hidden 
within {{UncompressedInBlock}} to clarify the scope of their usage.
**  Various builder classes could be encapsulated in the companion objects of 
the classes they build.

*  The code can be formatted more clearly.  For example:
**  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
formatted more clearly and have comments added explaining the reasoning behind 
key parts.  That these methods form the core of the ALS logic makes this doubly 
important for maintainability.
**  Parts of the code that use {{while}} loops with manually incremented 
counters can be rewritten as {{for}} loops.
**  Where non-idiomatic Scala code is used that doesn't improve performance 
much, clearer code can be substituted.  (This in particular should be done very 
carefully if at all, as it's apparent the original author spent much time and 
pains optimizing the code to significantly improve its runtime profile.)

*  The documentation (both Scaladocs and inline comments) can be clarified 
where needed and expanded where incomplete.  This is especially important for 
parts of the code that are written imperatively for performance, as these parts 
don't benefit from the intuitive self-documentation of Scala's higher-level 
language abstractions.  Specifically, I'd like to add documentation fully 
explaining the key functionality of the in-block and out-block objects, their 
purpose, how they relate to the overall ALS algorithm, and how they are 
calculated in such a way that new maintainers can ramp up much more quickly.

The above is not a complete enumeration of improvements but a high-level 
survey.  All of these improvements will, I believe, add up to make the code 
easier to understand, extend, and maintain.  This issue will track the progress 
of this refactoring so that going forward, authors will have an easier time 
maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to