Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-65682412 @akopich Thanks for the responses! Follow-ups: (1) Users implementing their own regularizers You're right that this would be nice to have. If we include it, then perhaps we can spend more time to have a clean API for regularizers which can be re-used in other algorithms which try to optimize an objective. (E.g., ideally, Dirichlet regularization would be implemented such that any algorithm which needed a Dirichlet regularizer (or prior) could reuse your code.) Here are some thoughts on that: * All of the regularizers here operate on each Matrix element individually. If that will be the case for all useful regularizers, then the regularizer API could operate per-element (to be simpler), and the code using the regularizers could iterate over all elements as needed. * The regularizer API should use general terminology, such as: * penalty(param: Double): Double * gradient(param: Double): Double Alternatively, we could use this development path to avoid having to decide on the API right now: * In this PR, keep the regularizer types public, but make all of their internals private[mllib]. This way, the API choices are kept private for now. * In a later PR, the regularizer API can be refined and made public so that users can implement their own pluggable types. (2) Regular and Robust in the same class You should be able to extract the functionality you need. E.g., if the newIteration method knows that it has a DocumentParameters instance, it can call getNewTheta(). If the actual type of the instance is RobustDocumentParameters, then RobustDocumentParameters.getNewTheta() will be called. But I would recommend having both classes implement getNewTheta() with the same visibility (private/public), and also to use the âoverrideâ keyword in RobustDocumentParameters. You should be able to abstract any other needed functionality similarly. (3) PLSA and RobustPLSA code duplication Looking more closely, I think youâre right about it being hard to abstract further. (Iâll let you know if I have ideas.) (4) Float vs. Double True, it may be worth the trouble to use Float to save on memory and communication. I donât know enough about PLSA to know how important numerical precision is in general. Your approach sounds reasonable then. One alternative would be to user Breeze matrices (but not in public APIs!), but Iâd only suggest that if it will simplify or shorten code.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org