GitHub user danielyli opened a pull request: https://github.com/apache/spark/pull/17767
Als refactor ## What changes were proposed in this pull request? This is a non-feature-changing refactoring of the ALS code (specifically, the `org.apache.spark.ml.recommendation` package), done to improve code maintainability and to add significant documentation to the existing code. My motivation for this PR is that I've been working on an online streaming ALS implementation [[SPARK-6407](https://issues.apache.org/jira/browse/SPARK-6407)] (PR coming soon), and I've been refactoring the package to help me understand the existing code before adding to it. I've also tried my best to include a fair bit of Scaladocs and inline comments where I felt they would have helped when I was reading the code. I've done a fair bit of rebasing and sausage making to make the commits easy to follow, since no one likes to stare at a 2,700-line PR. Please let me know if I can make anything clearer. I'd be happy to answer any questions. In a few places, you'll find a `PLEASE_ADVISE(danielyli):` tag in the code. These are questions I had in the course of the refactoring. I'd appreciate if the relevant folks could help me with these. Thanks. ## How was this patch tested? As this is a non-feature-changing refactoring, existing tests were used. All existing ALS tests pass. You can merge this pull request into a Git repository by running: $ git pull https://github.com/danielyli/spark als-refactor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17767.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17767 ---- commit deca4db3f234ea60c1494265d4f3ac9375869dd6 Author: Daniel Li <d...@danielyli.com> Date: 2017-04-05T21:55:21Z Split `ALS.scala` into multiple files This commit moves the classes `ALS` and `ALSModel` and the traits `ALSParams` and `ALSModelParams` into their own files. commit 4086bc9d0c7689e0d2047ac17ada29fe236eb6e6 Author: Daniel Li <d...@danielyli.com> Date: 2017-04-05T22:20:32Z Move solver classes into their own file This commit puts the classes `LeastSquaresNESolver`, `CholeskySolver`, `NNLSSolver`, and `NormalEquation` into a mixin in a separate file in order to reduce the size and improve the readability of `ALS.scala`. commit 8aaa533df6f3c9a4b4e8c5d5023f831daf06fa9e Author: Daniel Li <d...@danielyli.com> Date: 2017-04-05T22:30:38Z Minor cleanup of imports * import java.util.Arrays * import scala.collection.mutable.ArrayBuilder commit b68680025e71ebd422087ef95d5ecb7af40fa26d Author: Daniel Li <d...@danielyli.com> Date: 2017-04-05T22:48:50Z Create a package object to hold small type and class definitions commit 83f849ee45fd7c80a1a50fcf12da1eb99d8b6346 Author: Daniel Li <d...@danielyli.com> Date: 2017-04-06T02:17:14Z Refactor `RatingBlock`-related code This commit moves the following classes and methods into new files, separating and encapsulating them as appropriate: * RatingBlock * RatingBlockBuilder * UncompressedInBlock * UncompressedInBlockBuilder * KeyWrapper * UncompressedInBlockSort * LocalIndexEncoder * partitionRatings * makeBlocks In the course of this refactoring we create a new class, `RatingBlocks`, to hold the user/item in/out block data and associated logic. commit 819a00f7fe7384e588ce78cb65e9413ac6588401 Author: Daniel Li <d...@danielyli.com> Date: 2017-04-06T07:08:43Z Pull out `RatingBlock` from `RatingBlocks` into its own file This commit puts the `RatingBlock` class into a mixin for the `RatingBlocks` companion object to extend. This is done purely to increase readability by reducing the file size of `RatingBlocks.scala`. commit 56d10ba1fa627f343e67525e2a3b08e7287bfe2f Author: Daniel Li <d...@danielyli.com> Date: 2017-04-06T08:50:06Z Tighten access modifiers where appropriate and make case classes `final` commit b861d18784ba4ce688d3eacaea10169c9ce2d091 Author: Daniel Li <d...@danielyli.com> Date: 2017-04-06T09:38:54Z Improve code hygiene of `RatingBlocks` Among other things, `while` loops that used manually incremented counters have been changed to `for` loops to increase readability. Performance should be nominally affected. commit 5dfee79a1280d0a72bbe7b8596cdf86654fa0fbc Author: Daniel Li <d...@danielyli.com> Date: 2017-04-06T09:57:11Z Spruce up `ALS#fit` This commit adds vertical whitespace to improve readability. commit 056d6d0ecc962f94c83f43a6384607bf8833d083 Author: Daniel Li <d...@danielyli.com> Date: 2017-04-25T23:31:54Z Mark `RatingBlocks` constructor as `private` commit 34df11247ec3fdcf29e220bedcdf28b58d1ac4ec Author: Daniel Li <d...@danielyli.com> Date: 2017-04-25T23:32:44Z Add Scaladocs to `RatingBlocks.scala` commit 31b0dcd843d8edc16c6d2bf982d3de753b6dc066 Author: Daniel Li <d...@danielyli.com> Date: 2017-04-06T09:59:24Z Spruce up `ALS.train` and `ALS.initFactors` * Add descriptive Scaladoc giving high-level overview of `ALS.train` algorithm * Refactor out duplicated code into variables * Add comments to clarify code where appropriate * Change `while` loops that used manually incremented counters to `for` loops to increase readability. Performance should be nominally affected. * Clarify names of variables where appropriate * Add vertical whitespace to improve readability commit 42243fef420de6d7db28194f0dc804284182dc64 Author: Daniel Li <d...@danielyli.com> Date: 2017-04-06T10:11:46Z Spruce up `ALS.computeFactors` * Refactor out duplicated code into variables * Change `while` loops that used manually incremented counters to `for` loops to increase readability. Performance should be nominally affected. * Clarify names of variables where appropriate * Add vertical whitespace to improve readability commit ac01f7c19513a35d838df1db11c91a0db6abb00f Author: Daniel Li <d...@danielyli.com> Date: 2017-04-25T08:02:40Z Update Scaladoc for `ALS` class ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org