[ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985888#comment-15985888
 ] 

Daniel Li commented on SPARK-20468:
-----------------------------------

Thanks for the guidance, [~srowen].  I've read the Contributing guide now; my 
apologies for not reading it through sooner.

I've closed the PR and created a few new Jira issues to propose documentation 
additions and internal code clarity improvements: SPARK-20484, SPARK-20485, and 
SPARK-20486.  How do you want to proceed?

> Refactor the ALS code
> ---------------------
>
>                 Key: SPARK-20468
>                 URL: https://issues.apache.org/jira/browse/SPARK-20468
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 2.1.0
>            Reporter: Daniel Li
>            Priority: Minor
>              Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to