[ 
https://issues.apache.org/jira/browse/SPARK-33487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-33487:
---------------------------------
    Priority: Minor  (was: Major)

> Let ML ALS recommend for BOTH subsets - users and items
> -------------------------------------------------------
>
>                 Key: SPARK-33487
>                 URL: https://issues.apache.org/jira/browse/SPARK-33487
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>    Affects Versions: 3.0.1
>            Reporter: Rose Aysina
>            Priority: Minor
>
> Currently ALS in Spark ML supports next methods for getting recommendations:
>  * {{recommendForAllUsers(numItems: Int): DataFrame}}
>  * {{recommendForAllItems(numUsers: Int): DataFrame}}
>  * {{recommendForUserSubset(dataset: Dataset[_], numItems: Int): DataFrame}}
>  * {{recommendForItemSubset(dataset: Dataset[_], numUsers: Int): DataFrame}}
>  
> *Feature request:* to add a method that recommends subset of items for subset 
> of users, i.e. both users and items are selected from provided subsets. 
> *Why it is important:* in real-time recommender systems you usually make 
> predict for current users (that's why we need subset of users). And you can 
> just recommend all items that you have, but only those who satisfy some 
> business filters (that's why we need subset of items). 
> *For example:* consider real-time news recommender system. Predict is done 
> for small subset of users (say, for example, visitors for last minute), but 
> it is not allowed to recommend old news or news not related to user country 
> or etc, so at each predict we have some "white" list of items.
> So that's why it will be extremely useful to control *BOTH* for which users 
> make recommendations *AND* which items include in these recommendations. 
> *Related issues:* -SPARK-20679- , but there is just subsets either on users 
> *OR* items. 
> *What we do now:* just make additional filtering after 
> {{recommendForUserSubset}} call, but this method has significant cost - we 
> must receive all items recommendations, i.e. *{{numItems = # all available 
> items}}* and then filter and only then select top-k among them.
> *Why it is bad:* usually subset of items allowed to recommend right now is 
> much smaller than the amount of all seen items in an origin data (in my real 
> dataset it is 220k vs 500). 
> *Design:* I am sorry - I am not familiar with Spark internals so I offer 
> solution based just on my human logic :) 
> {code:scala}
> def recommendForUserItemSubsets(userDataset: Dataset[_], 
>                                 itemDataset: Dataset[_], 
>                               numItems: Int): DataFrame = {
>     val userFactorSubset = getSourceFactorSubset(dataset, userFactors, 
> $(userCol))
>     val itemFactorSubset = getSourceFactorSubset(dataset, itemFactors, 
> $(itemCol))
>     recommendForAll(userFactorSubset, itemFactorSubset, $(userCol), 
> $(itemCol), numItems, $(blockSize))
> }
> {code}
>  
> I will be glad to receive some feedback, is it reasonable request or not and 
> maybe more efficient workarounds. 
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to