Copying the first part from the scaladoc:
"
This is a blocked implementation of the ALS factorization algorithm that
groups the two sets of factors (referred to as "users" and "products") into
blocks and reduces communication by only sending one copy of each user
vector to each product block on each iteration, and only for the product
blocks that need that user's feature vector. This is achieved by
pre-computing some information about the ratings matrix to determine the
"out-links" of each user (which blocks of products it will contribute to)
and "in-link" information for each product (which of the feature vectors it
receives from each user block it will depend on). This allows us to send
only an array of feature vectors between each user block and product block,
and have the product block find the users' ratings and update the products
based on these messages.
"

Basically, the number of blocks can be tweaked to reduce shuffling, making
the application more efficient in both run time and disk space usage. For
example, if you have a high number of product ratings per user ratio (1
user rating 100s of products), you may choose a smaller product block
number so that that user's ratings get sent to a fewer number of
partitions, which would lead to less shuffling.

However, if you choose your number of blocks to be low, then you may run
into OOMEs.

Hope that helps.

Burak

On Thu, Dec 17, 2015 at 3:17 AM, Roberto Pagliari <roberto.pagli...@asos.com
> wrote:

> What is the meaning of the ‘blocks’ input argument in mllib ALS
> implementation, and how does that relate to the number of executors and/or
> size of the input data?
>
>
> Thank you,
>
>

Reply via email to