Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17793#discussion_r114893509
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala 
---
    @@ -910,26 +944,143 @@ object ALS extends DefaultParamsReadable[ALS] with 
Logging {
       private type FactorBlock = Array[Array[Float]]
     
       /**
    -   * Out-link block that stores, for each dst (item/user) block, which src 
(user/item) factors to
    -   * send. For example, outLinkBlock(0) contains the local indices (not 
the original src IDs) of the
    -   * src factors in this block to send to dst block 0.
    +   * A mapping of the columns of the items factor matrix that are needed 
when calculating each row
    +   * of the users factor matrix, and vice versa.
    +   *
    +   * Specifically, when calculating a user factor vector, since only those 
columns of the items
    +   * factor matrix that correspond to the items that that user has rated 
are needed, we can avoid
    +   * having to repeatedly copy the entire items factor matrix to each 
worker later in the algorithm
    +   * by precomputing these dependencies for all users, storing them in an 
RDD of `OutBlock`s.  The
    +   * items' dependencies on the columns of the users factor matrix is 
computed similarly.
    +   *
    +   * =Example=
    +   *
    +   * Using the example provided in the `InBlock` Scaladoc, `userOutBlocks` 
would look like the
    +   * following:
    +   *
    +   * {{{ userOutBlocks.collect() == Seq(
    +   *       0 -> Array(Array(0, 1), Array(0, 1)),
    +   *       1 -> Array(Array(0), Array(0))) }}}
    +   *
    +   * The data structure encodes the following information:
    --- End diff --
    
    This is all correct, but was still really confusing. Personally I think the 
following is clearer, but if you don't then feel free to leave it out.
    
    ````scala
      /**
       * Each user block contains a subset of users in fixed, but typically 
random order. 
       *
       * User block 0  User block 1
       *  ________      _______
       * | user12 |    | user4 |
       * | user5  |    | user2 |
       * | user33 |    |       |
       * |________|    |_______|
       *
       * Out block 0                       Out block 1
       *
       * Array(                            Array(
       *   Array(0, 2), // item block 0     Array(0),    // item block 0 
       *   Array(1, 2), // item block 1     Array(0, 1), // item block 1 
       *   Array(1))    // item block 2     Array())     // item block 2
       *
       * For outblocks, the index in the outer array correspond to the item 
block. So the first inner
       * array is item block 0, the second item block 1, and so on. The values 
in each array correspond
       * to the "local indices" of the user factors in this block that need to 
be shipped to that item
       * block. So for outblock 0, we know that user factors at index 0 and 2 
must be shipped to item 
       * block 0. That means that the user factors for user12 and user33 need 
to go to item block 0. 
       * And for outblock 1, we know that user4 must go to item block 0 and 1 
and user2 must go to item
       * block 1. None of the users in user block 1 need to go to item block 2.
       */
    ````


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to