[ 
https://issues.apache.org/jira/browse/FLINK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15516644#comment-15516644
 ] 

ASF GitHub Bot commented on FLINK-4613:
---------------------------------------

Github user gaborhermann commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    We did not measure performance against Spark or other implementations yet. 
Those would reflect the performance of Flink ALS implementation, as there is 
not much difference between the implicit and explicit implementations.
    
    Instead, we compared the implicit case with the explicit case in the Flink 
implementation on the same datasets, to make sure the implicit case does not 
decrease the performance significantly. (Of course, we expected the implicit 
case to be slower due to the extra precomputation and broadcasting of `Xt * X`.)
    
    ```
            expl  impl
    100     8885   9196
    1000    7879  11282
    10000   8839   9220
    100000  7102  10998
    1000000 7543  10680
    ```
    
    The numbers in the left column indicate the size of the training set (I'm 
not sure about the measure, but @jfeher can tell about it). The numbers are the 
training time in milliseconds in the explicit and implicit case respectively. 
We did the measurements on a small cluster of 3 nodes.
    
    It seems, there is a large constant overhead, but it's not significantly 
slower in the implicit case.
    We could do further, more thorough measurements if needed, but maybe that 
would be another issue. Benchmarking more and optimizing both the original ALS 
algorithm and the specific `Xt * X` computation in the implicit case could be a 
separate PR.
    
    What are your thoughts on this?


> Extend ALS to handle implicit feedback datasets
> -----------------------------------------------
>
>                 Key: FLINK-4613
>                 URL: https://issues.apache.org/jira/browse/FLINK-4613
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Gábor Hermann
>            Assignee: Gábor Hermann
>
> The Alternating Least Squares implementation should be extended to handle 
> _implicit feedback_ datasets. These datasets do not contain explicit ratings 
> by users, they are rather built by collecting user behavior (e.g. user 
> listened to artist X for Y minutes), and they require a slightly different 
> optimization objective. See details by [Hu et 
> al|http://dx.doi.org/10.1109/ICDM.2008.22].
> We do not need to modify much in the original ALS algorithm. See [Spark ALS 
> implementation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala],
>  which could be a basis for this extension. Only the updating factor part is 
> modified, and most of the changes are in the local parts of the algorithm 
> (i.e. UDFs). In fact, the only modification that is not local, is 
> precomputing a matrix product Y^T * Y and broadcasting it to all the nodes, 
> which we can do with broadcast DataSets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to