Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67083325 @akopich Thanks for the updates! (Much easier to see the diff now) The decision about setters vs. constructor arguments was from [this JIRA (design doc linked from JIRA)](https://issues.apache.org/jira/browse/SPARK-3530). The main issues were binary compatibility (adding a new parameter to an algorithm requires adding a new constructor when we don't use setters) and problems with Java not supporting named parameters and default arguments. For testing, I agree it's tough to test "accuracy" for topic models. It would be great to see comparisons on the same dataset, just to get a sense of whether perplexity decreases similarly and is in the same ballpark. Same with scalability & speed. As far as other implementations, your other non-Spark implementation would be helpful if it's been tested before. It would be great to see comparisons with other Spark LDA implementations. E.g., [https://github.com/apache/spark/pull/2388] is LDA using GraphX, and it would be very helpful to see how performance varies. (It might also take a bit more work though.) What sounds reasonable to you? I'm recommending this testing as separate from unit tests. (A unit test for perplexity sounds a bit unstable.) Perhaps the test code could become an example instead of a unit test.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org