Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-23 Thread Matei Zaharia
Yes, that makes sense, but just to be clear, using the same seed does *not* 
imply that the algorithm should produce “equivalent” results by some definition 
of equivalent if you change the input data. For example, in SGD, the random 
seed might be used to select the next minibatch of examples, but if you reorder 
the data or change the labels, this will result in a different gradient being 
computed. Just because the dataset transformation seems to preserve the ML 
problem at a high abstraction level does not mean that even a deterministic ML 
algorithm (MLlib with seed) will give the same result. Maybe other libraries 
do, but it doesn’t necessarily mean that MLlib is doing something wrong here.

Basically, I’m just saying that as an ML library developer I wouldn’t be super 
concerned about these particular test results (especially if just a few 
instances change classification). I would be much more interested, however, in 
results like the following:

- The algorithm’s evaluation metrics (loss, accuracy, etc) are statistically 
significant if you change these properties of the data. This probably requires 
you to run multiple times with different seeds. 
- MLlib’s evaluation metrics for a problem differ in a statistically 
significant way from other ML libraries, for algorithms configured with 
equivalent hyperparameters. (Sometimes libraries have different definitions for 
hyperparameters though).

The second one is definitely something we’ve tested for informally in the past, 
though it is not in unit tests as far as I know.

Matei

> On Aug 23, 2018, at 5:14 AM, Steffen Herbold  
> wrote:
> 
> Dear Matei,
> 
> thanks for the feedback!
> 
> I used the setSeed option for all randomized classifiers and always used the 
> same seeds for training with the hope that this deals with the 
> non-determinism. I did not run any significance tests, because I was 
> considering this from a functional perspective, assuming that the 
> nondeterminism would be dealt with if I fix the seed values. The test results 
> contain how many instances were classified differently. Sometimes these are 
> only 1 or 2 out of 100 instances, i.e., almost certainly not significant. 
> Other cases seem to be more interesting. For example, 20/100 instances were 
> classified differently by the linear SVM for informative uniformly 
> distributed data if we added 1 to each feature value.
> 
> I know that these problems should sometimes be expected. However, I was 
> actually not sure what to expect, especially after I started to look at the 
> results for different ML libraries in comparison. The random forest are a 
> good example. I expected them to be dependent on feature/instance order. 
> However, they are not in Weka, only in scikit-learn and Spark MLlib. There 
> are more such examples, like logistic regression that exhibits different 
> behavior in all three libraries. Thus, I decided to just give my results to 
> the people who know what to expect from their implementations, i.e., the devs.
> 
> I will probably expand my test generator to allow more detailed 
> specifications of the expectations of the algorithms in the future. This 
> seems to be a "must" for a potentially productive use by projects. Relaxing 
> the assertions to only react if the differences are significant would be 
> another possible change. This could be a command line option to allow 
> different strictness of testing.
> 
> Best,
> Steffen
> 
> 
> Am 22.08.2018 um 23:27 schrieb Matei Zaharia:
>> Hi Steffen,
>> 
>> Thanks for sharing your results about MLlib — this sounds like a useful 
>> tool. However, I wanted to point out that some of the results may be 
>> expected for certain machine learning algorithms, so it might be good to 
>> design those tests with that in mind. For example:
>> 
>>> - The classification of LogisticRegression, DecisionTree, and RandomForest 
>>> were not inverted when all binary class labels are flipped.
>>> - The classification of LogisticRegression, DecisionTree, GBT, and 
>>> RandomForest sometimes changed when the features are reordered.
>>> - The classification of LogisticRegression, RandomForest, and LinearSVC 
>>> sometimes changed when the instances are reordered.
>> All of these things might occur because the algorithms are nondeterministic. 
>> Were the effects large or small? Or, for example, was the final difference 
>> in accuracy statistically significant? Many ML algorithms are trained using 
>> randomized algorithms like stochastic gradient descent, so you can’t expect 
>> exactly the same results under these changes.
>> 
>>> - The classification of NaïveBayes and the LinearSVC sometimes changed if 
>>> one is added to each feature value.
>> This might be due to nondeterminism as above, but it might also be due to 
>> regularization or nonlinear effects for some algorithms. For example, some 
>> algorithms might look at the relative values of features, in which case 
>> adding 1 to each feature value 

Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-23 Thread Erik Erlandson
Behaviors at this level of detail, across different ML implementations, are
highly unlikely to ever align exactly. Statistically small changes in
logic, such as "<" versus "<=", or differences in random number generators,
etc, (to say nothing of different implementation languages) will accumulate
over training to yield different models, even if their overall performance
should be similar.

. The random forest are a good example. I expected them to be dependent on
> feature/instance order. However, they are not in Weka, only in scikit-learn
> and Spark MLlib. There are more such examples, like logistic regression
> that exhibits different behavior in all three libraries.
>


Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-23 Thread Steffen Herbold

Dear Matei,

thanks for the feedback!

I used the setSeed option for all randomized classifiers and always used 
the same seeds for training with the hope that this deals with the 
non-determinism. I did not run any significance tests, because I was 
considering this from a functional perspective, assuming that the 
nondeterminism would be dealt with if I fix the seed values. The test 
results contain how many instances were classified differently. 
Sometimes these are only 1 or 2 out of 100 instances, i.e., almost 
certainly not significant. Other cases seem to be more interesting. For 
example, 20/100 instances were classified differently by the linear SVM 
for informative uniformly distributed data if we added 1 to each feature 
value.


I know that these problems should sometimes be expected. However, I was 
actually not sure what to expect, especially after I started to look at 
the results for different ML libraries in comparison. The random forest 
are a good example. I expected them to be dependent on feature/instance 
order. However, they are not in Weka, only in scikit-learn and Spark 
MLlib. There are more such examples, like logistic regression that 
exhibits different behavior in all three libraries. Thus, I decided to 
just give my results to the people who know what to expect from their 
implementations, i.e., the devs.


I will probably expand my test generator to allow more detailed 
specifications of the expectations of the algorithms in the future. This 
seems to be a "must" for a potentially productive use by projects. 
Relaxing the assertions to only react if the differences are significant 
would be another possible change. This could be a command line option to 
allow different strictness of testing.


Best,
Steffen


Am 22.08.2018 um 23:27 schrieb Matei Zaharia:

Hi Steffen,

Thanks for sharing your results about MLlib — this sounds like a useful tool. 
However, I wanted to point out that some of the results may be expected for 
certain machine learning algorithms, so it might be good to design those tests 
with that in mind. For example:


- The classification of LogisticRegression, DecisionTree, and RandomForest were 
not inverted when all binary class labels are flipped.
- The classification of LogisticRegression, DecisionTree, GBT, and RandomForest 
sometimes changed when the features are reordered.
- The classification of LogisticRegression, RandomForest, and LinearSVC 
sometimes changed when the instances are reordered.

All of these things might occur because the algorithms are nondeterministic. 
Were the effects large or small? Or, for example, was the final difference in 
accuracy statistically significant? Many ML algorithms are trained using 
randomized algorithms like stochastic gradient descent, so you can’t expect 
exactly the same results under these changes.


- The classification of NaïveBayes and the LinearSVC sometimes changed if one 
is added to each feature value.

This might be due to nondeterminism as above, but it might also be due to 
regularization or nonlinear effects for some algorithms. For example, some 
algorithms might look at the relative values of features, in which case adding 
1 to each feature value transforms the data. Other algorithms might require 
that data be centered around a mean of 0 to work best.

I haven’t read the paper in detail, but basically it would be good to account 
for randomized algorithms as well as various model assumptions, and make sure 
the differences in results in these tests are statistically significant.

Matei



--
Dr. Steffen Herbold
Institute of Computer Science
University of Goettingen
Goldschmidtstraße 7
37077 Göttingen, Germany
mailto. herb...@cs.uni-goettingen.de
tel. +49 551 39-172037


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-22 Thread Matei Zaharia
Hi Steffen,

Thanks for sharing your results about MLlib — this sounds like a useful tool. 
However, I wanted to point out that some of the results may be expected for 
certain machine learning algorithms, so it might be good to design those tests 
with that in mind. For example:

> - The classification of LogisticRegression, DecisionTree, and RandomForest 
> were not inverted when all binary class labels are flipped.
> - The classification of LogisticRegression, DecisionTree, GBT, and 
> RandomForest sometimes changed when the features are reordered.
> - The classification of LogisticRegression, RandomForest, and LinearSVC 
> sometimes changed when the instances are reordered.

All of these things might occur because the algorithms are nondeterministic. 
Were the effects large or small? Or, for example, was the final difference in 
accuracy statistically significant? Many ML algorithms are trained using 
randomized algorithms like stochastic gradient descent, so you can’t expect 
exactly the same results under these changes.

> - The classification of NaïveBayes and the LinearSVC sometimes changed if one 
> is added to each feature value.

This might be due to nondeterminism as above, but it might also be due to 
regularization or nonlinear effects for some algorithms. For example, some 
algorithms might look at the relative values of features, in which case adding 
1 to each feature value transforms the data. Other algorithms might require 
that data be centered around a mean of 0 to work best.

I haven’t read the paper in detail, but basically it would be good to account 
for randomized algorithms as well as various model assumptions, and make sure 
the differences in results in these tests are statistically significant.

Matei


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org