Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    I think there is a fair bit of difference between cross-validating the 
model and scoring in production.
    
    In most practical live scoring situations, there may be multiple levels of 
fallbacks / defaults for the cold-start case (including e.g. "most popular", 
"newest", "content-based methods" etc etc). There may also be various 
post-processing steps applied to the results. I don't think it's feasible to 
re-create live behaviour perfectly for cross-validation scenarios (especially 
as these systems are often totally different from Spark).
    
    Even for offline bulk scoring, again there may be many different options 
for cold start. Do we intend to support all of them within Spark? Again I don't 
think that's feasible, though as discussed on the JIRA we can certainly support 
a few useful options, such as "average user" which could indeed serve for both 
CV and live scoring purposes.
    
    I actually think `NaN` for live scoring is "better" than say `0`, because 
then it is very clear that it's a missing data point (which the system can 
choose how to handle) rather than a prediction of `0`.
    
    For CV, I'd expect that predicting `0` would have a dramatic negative 
impact on RMSE. So for CV I'd say the `drop` option is more reasonable.
    
    This is not arguing against other reasonable options (average rating, 
average user vectors and so on) - we can add those later on user demand. This 
is just a start. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to