[R] Confused - better empirical results with error in data

Noah Silverman Mon, 07 Sep 2009 12:35:03 -0700

Hi,

I have a strange one for the group.

We have a system that predicts probabilities using a fairly standard svm(e1017). We are looking at probabilities of a binary outcome.

The input data is generated by a perl script that calculates a bunch ofthings, fetches data from a database, etc.

We train the system on 30,000 examples and then test the system on anunseen set of 5,000 records.

The "real world" results on the test set looked VERY good. We werereally happy with our model.

The, we noticed that there was a big error in our data generation scriptand one of the values (an average of sorts.) was being calculatedincorrectly. (The perl script failed to clear two iterators, so theyboth grew with every record.)

As an quick experiment, we removed that item from our data set andre-ran the process. The results were not very good. Perhaps 75% asgood as training with the "wrong" factor included.


So, this is really a philosophical question.  Do we:

1) Shrug and say, "who cares", the SVM figured it out and likesthat bad data item for some inexplicable reason2) Tear into the math and try to figure out WHY the SVM ispredicting more accurately


Any opinions??

Thanks!

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Confused - better empirical results with error in data

Reply via email to