[ 
https://issues.apache.org/jira/browse/FLINK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15647631#comment-15647631
 ] 

ASF GitHub Bot commented on FLINK-4613:
---------------------------------------

Github user jfeher commented on the issue:

    https://github.com/apache/flink/pull/2542
  
    Hi, we have measured the training time of als and ials with the given 
dataset.
    After filtering the data to unique item user pairs we got approximatly 64 
million rankings.
    
    We measured on a cluster with four nodes and on yarn. All of the nodes had 
16 GB of memory. 
    The taskmanagers got 12 GB and the jobmanager got 2 GB.
    We had four taskmanagers, one four each node.
    After some testing it looked like a block number between 100 and 1500 is 
the most optimal.
    And between 100 and 300 the running times were steadily low.
    
    **For ials we got the following measurments:**
    
    The average time for block numbers between 100 and 1500 and 1 iteration in 
seconds: 2000.33s
    
    The average time for block numbers between 100 and 300 and 1 iteration in 
seconds: 1729.44s
    
    More detailed results by block sizes on the diagram: 
http://imgur.com/LjJavti
    
    **For als with the same configurations we got the following measurments:**
    
    The average time for block numbers between 100 and 1500 and 1 iteration in 
seconds: 1694.04s
    
    The average time for block numbers between 100 and 300 and 1 iteration in 
seconds: 1465.77s
    
    So the ials version was 300 s slower on this data than the als.
    
    When we increased the iteration number for 10 the time difference stayed 
under 1000 s which is less than ten times 300.
    This is because the fix time cost for the whole training is big.



> Extend ALS to handle implicit feedback datasets
> -----------------------------------------------
>
>                 Key: FLINK-4613
>                 URL: https://issues.apache.org/jira/browse/FLINK-4613
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Gábor Hermann
>            Assignee: Gábor Hermann
>
> The Alternating Least Squares implementation should be extended to handle 
> _implicit feedback_ datasets. These datasets do not contain explicit ratings 
> by users, they are rather built by collecting user behavior (e.g. user 
> listened to artist X for Y minutes), and they require a slightly different 
> optimization objective. See details by [Hu et 
> al|http://dx.doi.org/10.1109/ICDM.2008.22].
> We do not need to modify much in the original ALS algorithm. See [Spark ALS 
> implementation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala],
>  which could be a basis for this extension. Only the updating factor part is 
> modified, and most of the changes are in the local parts of the algorithm 
> (i.e. UDFs). In fact, the only modification that is not local, is 
> precomputing a matrix product Y^T * Y and broadcasting it to all the nodes, 
> which we can do with broadcast DataSets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to