Dear Spark developers, Are there any best practices or guidelines for machine learning unit tests in Spark? After taking a brief look at the unit tests in ML and MLlib, I have found that each algorithm is tested in a different way. There are few kinds of tests: 1)Partial check of internal algorithm correctness. This can be anything. 2)Generate test data with distribution specific to the algorithm, do machine learning and check the outcomes. This is also very specific. 3)Compare the parameters (weights) of machine learning model with parameters from existing implementations, such as R or SciPy. This looks more like a useful test, so that you are sure you will get the same result from the algorithm as other people get using other software.
After googling a bit, I've found the following guidelines rather relevant: http://blog.mpacula.com/2011/02/17/unit-testing-statistical-software/ I am wondering, should we come up with specific guidelines for machine learning, such as that the user is guaranteed to get the expected result? This also might be considered as additional benefit for Spark - to be standardized ML. Best regards, Alexander