Github user MLnick commented on the issue: https://github.com/apache/spark/pull/12896 I think there is a fair bit of difference between cross-validating the model and scoring in production. In most practical live scoring situations, there may be multiple levels of fallbacks / defaults for the cold-start case (including e.g. "most popular", "newest", "content-based methods" etc etc). There may also be various post-processing steps applied to the results. I don't think it's feasible to re-create live behaviour perfectly for cross-validation scenarios (especially as these systems are often totally different from Spark). Even for offline bulk scoring, again there may be many different options for cold start. Do we intend to support all of them within Spark? Again I don't think that's feasible, though as discussed on the JIRA we can certainly support a few useful options, such as "average user" which could indeed serve for both CV and live scoring purposes. I actually think `NaN` for live scoring is "better" than say `0`, because then it is very clear that it's a missing data point (which the system can choose how to handle) rather than a prediction of `0`. For CV, I'd expect that predicting `0` would have a dramatic negative impact on RMSE. So for CV I'd say the `drop` option is more reasonable. This is not arguing against other reasonable options (average rating, average user vectors and so on) - we can add those later on user demand. This is just a start.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org