> This is exactly the same as your March 19 post to which I replied on > March 22. Did you see my reply? In it I tried to clarify the ROC concept > in order to rectify what appeared to be your lack of understanding.
You are right. My newsreader decided to post two old messages again. I don't know what happened... > Please read my reply in Google Groups and respond if you disagree or > don't understand. Where we disagree is that you are saying a ROC is created by changing a single parameter, while I don't think that's necessarily true. In fact, what happens if that one parameter is coded? Suppose x is an integer, and y is a fraction. If I write z=x+y, and I create a ROC by varying z, I am actually doing the same as when I would change both x and y. Another example is when the parameter is actually an index to a list of permutations, as to select a subset of features that is to be used by the algorithm. Ofcourse, to see the influence of a parameter on the performance of the algorithm, you would change just one parameter (and preferably your algorithm is such that the results do not change 'chaotically'). However, what I'm interested in (and this is the way ROCs are used often in medical literature) is how good my algorithm is in terms of sensitivity and specificity. The thing is, I don't know the prevalence or the costs of false negatives and false positives, and therefore, I do not know the best sensitivity/specificity tradeoff. >> I am having conceptual problems with cross-validating an ROC. > Me too. I have never heard of such a concept. You surely did, between March 19 and 22 :-) (Sorry, I couldn't resist.) >> The thing is >> that for me the only reason to draw an ROC is to show the individual >> fpr/tpr pairs, so one can choose the optimal setting for a specific >> application (depending on prevalence, cost of FP/FN, etc). So, in fact, >> the ROC just shows various algorithms, and you choose one that suits >> you best. > No. [...] > The result of changing a single parameter setting should *not* be > referred to as "another algorithm". If you restrict the parameter to be a cut-off, you have a point. If the parameter can have a broader meaning (such as the previously mentioned index), I don't agree. >> The thing with validation is that it is supposed to be done on the >> final algorithm, not on some intermediate result. > According to commonly accepted neural network terminology (see the > comp.ai.neural-net FAQ), validation is performed on multiple > intermediate designs in order to pick the "best" one. Testing is then > performed on the chosen algorithm to quantify its generalization > performance on independent test data that was not used for design or > validation. I'm sorry; according to the FAQ I'm not the only one to confuse validation and test sets. >> More detailed: >> Consider algorithm A. It tests a number of algorithms (1..N) and >> chooses the best one (say number i). Even if algorithm A uses >> cross-validation to train and test all N algorithms, we cannot say that >> the error rate of algorithm A is the same as the estimated error rate >> of algorithm i. > > What you have done is to introduce a novel concept: Estimating the error > of an algorithm (A) that designs and validates other algorithms (1..N). > I don't know if you meant to do this or you have misinterpreted what is > done conventionally. If you meant to do this then you are in uncharted > land. If A chooses i based on the validation set performance (say 25%) > and testing shows that i also has the best test set performance (say > 30%), then the error rate of A is 0. > This does not appear to be a very useful concept, and is probably not > what you want. Yes and no. The concept is indeed estimating the error of the algorithm A, but what you are describing sounds more like the 'additional' error introduced by A. The algorithm I was trying to describe is more like a wrapper in that it contains all N algorithms. Therefore, if I'm talking about the performance of algorithm A, I'm talking about the performance of the algorithm i it chooses. Now, this algorithm may not be stable in the sense that it might not select the same i each time you feed it the same dataset. Therefore, it makes sense to determine the performance of A, instead of the performance of i. Before going deeper into this: Would you say that the error as indicated by the ROC is the generalization error you would actually expect to get when you select the corresponding cut-off value? I'd say no: Selecting a cut-off value is in my eyes the same as tuning some other parameter in a classification algorithm. Therefore, it is just as likely that there is some overfitting and that the generalization error should be estimated by using a separate test set. >> So, we >> cross-validate algorithm A: We use a data set to train it (and thus to >> select i) > This training data set consists of two subsets: a design data set and an > independent validation data set. > However, it is not clear how you are distinguishing algorithm i from > algorithm j. What do you mean by algorithm j? >> Now, if we compare this to the ROC, the ROC is like the outcome of all >> N algorithms. > No. > A ROC summarizes the performance of *one* algorithm w.r.t. varying a > single parameter. > You are comparing algorithm i with a ROC that comes from where? I'm not comparing ROCs, or comparing an algorithm with an ROC. However, that does help to prove my point: If algorithm 1 has a higher true positive rate for a small false negative rate and algorithm 2 has a higher true positive rate for a larger false negative rate, which is better? It depends on the prevalence and costs. Neither algorithm is better by itself. By creating a wrapper algorithm that chooses algorithm 1 for small false negative rates and algorithm 2 for larger negative rates, a combined ROC arises. Somehow, you seem to disagree with that possibility. >> Based on the application, one would choose the best algorithm. > Are you now referring to using "the" ROC to choose the best algorithm or > are you referring to choosing the algorithm that yields the best ROC? The first one; using 'the' ROC to choose the best setting of the parameter, resulting in your final algorithm. (Again, I consider setting the cut-off value the same as tuning the other parameters of the algorithm.) >> Cross-validation is therefore not possible before this selection has >> been made. > I'm lost. I'll try again: Estimating the generalization error is therefore not possible before this selection has been made. > I will comment on the alternative you propose in the next to last > paragraph in a separate post. Regardless of whether we can sort this out: Thanks for your comments! Koen . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================
