Github user danielyli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17793#discussion_r115114733
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala 
---
    @@ -910,26 +944,143 @@ object ALS extends DefaultParamsReadable[ALS] with 
Logging {
       private type FactorBlock = Array[Array[Float]]
     
       /**
    -   * Out-link block that stores, for each dst (item/user) block, which src 
(user/item) factors to
    -   * send. For example, outLinkBlock(0) contains the local indices (not 
the original src IDs) of the
    -   * src factors in this block to send to dst block 0.
    +   * A mapping of the columns of the items factor matrix that are needed 
when calculating each row
    +   * of the users factor matrix, and vice versa.
    +   *
    +   * Specifically, when calculating a user factor vector, since only those 
columns of the items
    +   * factor matrix that correspond to the items that that user has rated 
are needed, we can avoid
    +   * having to repeatedly copy the entire items factor matrix to each 
worker later in the algorithm
    +   * by precomputing these dependencies for all users, storing them in an 
RDD of `OutBlock`s.  The
    +   * items' dependencies on the columns of the users factor matrix is 
computed similarly.
    +   *
    +   * =Example=
    +   *
    +   * Using the example provided in the `InBlock` Scaladoc, `userOutBlocks` 
would look like the
    +   * following:
    +   *
    +   * {{{ userOutBlocks.collect() == Seq(
    +   *       0 -> Array(Array(0, 1), Array(0, 1)),
    +   *       1 -> Array(Array(0), Array(0))) }}}
    +   *
    +   * The data structure encodes the following information:
    --- End diff --
    
    Updated, though I still don't like it very much.  Honestly, reading either 
of our versions would make my head spin if I weren't already acquainted with 
the encoding; I'd still have to dive into the actual code and work out an 
example for myself before I'd feel familiar with it.  Should we just leave it 
as-is?
    
    Alternatively, if you feel you can write it clearer, please don't hesitate 
to directly change the PR.  (If you do update, note that the user IDs are not 
random but are sorted ascendingly within each partition.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to