Hi folks!, FYI the Scriven excerpt below... Exerpted From: Michael Scriven (1993). The Validity of Student Ratings. In: Teacher Evaluation, Evaluation & Development Group, AERA. -------------------------->> ....Begin longish Scriven excerpt ... 9. The existence of a positive correlation (even a correlation of 1.0) between the scores on several forms does not show the presence of a common property; there must also be logical or theoretical grounds for the identification and usually also further factual evidence for it. See "Fallacies of Statistical Substitution" by the present author in Argumentation, 1987, pp. 333-349, D. Reidel. 10. (i Comparative ratings of teachers can't support the claim that the worst are bad or the best good, without which conclusions few personnel actions can be supported. (ii) Friends with similar interests may be committed to other programs because of different career choices, hence no recommendation would be appropriate (there are other weaknesses in this question). (iii) Course preferences are not usually relevant when evaluating teachers. 11. More than that will stand up in legal hearings at the moment, because the law has hardly started on questions of validity in personnel actions, only questions of due process. But the barrier is eroding, as Judge Rebell points out in his contribution to the Millman and Darling-Hammond volume New Handbook of Teacher Evaluation (Sage, 1990). A 'serious' hearing is one in which the state of the art in evaluation is combined with serious ethical analysis; the law is often well behind the leading edge of these issues but we would presumably want our fellow-teachers to receive the benefit of the best investigation we can mount, not just one that meets the minimum standards the law requires. 12. For example, where the teacher is trying to match a certain, perfectly acceptable but not obligatory, model of teaching such as the Socratic or non-directive (questions-only) model). 13. Apart from validity, long forms massively increase processing costs and raise serious problems about dilution of impact. 14. Common examples of questions about optional courses to which students want to know the answer include: (i) how heavy the work load is, compared to other courses; (ii) whether grading is easy; (iii) whether it is really necessary to buy all the 'required' texts; (iv) what style of teaching is employed (discussion vs. lecturing, for example); (v) whether the course is 'relevant' or 'too academic'. In the limit, failure to attend to these concerns can lead student government to set up a duplicate system, with consequent waste of resources, especially classroom time and leads to refusals to fill in the official form, or fatigue effects from doing both. 15. The usual ones concern: (i) expected grade in this course; (ii) overall grade in school to date; (iii) whether the student has been required to take the course. These are inappropriate, partly because the evidence is that the answers are not reliable (determined at the college level by comparing the registrar's figures with the class reports; and also by asking students to say, anonymously, whether they lie on such questions), but mainly because they encourage faculty in the entirely improper response of disregarding the complaints of the weaker students. 16. The obvious example is requesting their names. One still hears faculty arguing that if students haven't the courage to sign their evaluations, the evaluations should not be taken seriously. This is reminiscent of dictators who say their door is always open for dissidents who wish to complain. But it is also extremely unprofessional unless one has very well-thought-out procedures to ensure that one's grading or letters of recommendation will not be influenced by complaints or praise and has proven to the satisfaction of the students that those procedures will be enforced. Applications for the prize for the first person to meet those conditions are welcomed. 17. It's not enough just to say this. To mean it entails that the following kinds of evidence or procedures be guaranteed: (i) there has to be space on the form for suggestions about how to improve the forms and the evaluation process; (ii) student government must have some input into procedures and content; (iii) the results are at least sometimes used to improve the course whose members are asked to fill in the forms, e.g. by running a mid-term version for improvement as well as the end-of-term one for the record; (iv) the students are informed about how their ratings are weighted in the faculty evaluation process, and student government is assisted in verifying any such claims. Absent a policy which addresses these considerations, one must face the problem that students have good reasons for running their own ratings system. This involves a considerable duplication of production resources, of class time, and a reduction of student interest. It may involve open hostility. 18. There is no great reason to object to teachers distributing the forms and appointing a student to take them in to the department head or, better, central office, as long as students are informed in some other way about the importance of the process. But having an adminstrative assistant or secretary, or the staff of a faculty development center do the whole business is in general preferable. If teachers are to do it, students must be independently counseled and asked about possible abuses of the system perhaps by means of some remarks on the form itself and provided with space on the form to register a belief that there has been an attempt to use inappropriate influence; sob stories about family circumstances are one of the problems. It is usually not necessary to ask staff to absent themselves from the room while the forms are being filled in, but it is one way to keep students informed about changes in the process, and to provide them with a chance for asking questions without risk. 19. There must of course be a warning to faculty that this is unprofessional behavior which will be treated very seriously in personnel evaluation. 20. This was a serious problem in Berkeley in the late sixties, when some of the radical left got control of the student evaluation process. There are various ways to detect and control such conspiracies, but the mere possibility of them is a strong reason for allowing appeals against student ratings. 21. In a sophisticated system, there could be reasons for unannounced visits, but in general it is better to encourage the presence of those who wish to put in rating forms. 22. This requires that the evaluators: (i) know the current enrollment figure; (ii) ensure that those present do fill out the form, if they have the authority to do this; (iii) do something about absentee ballots. My inclination is not to accept less than 95% return rates, and I have found this to be achievable; certainly anything below 80% is very hard on validity. 23. Too early rules out reactions to grading procedures and feedback; too late loses input from those who drop the course because they think it's bad. I prefer the following arrangement. Distribute forms on the first day, with envelopes, and request that those dropping the course fill them in before or after doing so, using campus mail to return them. Request returns after the mid-term tests have been handed back, and use them to improve the course. Arrange for the final to be given on the last day of class. Require attendance at the time scheduled for the final in the exam period, as a condition of getting a grade. At that session, return the marked exams for immediate study, with comments, or at least a demonstration or examples of answers that would have received an A. Allow questions and then protests about the grades. Collect the exams, for the archives, and distribute the rating forms. Collect the forms, and check off that every student has turned one in, as well as returned an exam. This procedure has a salutary effect on any headroom problems; but it is also a good way to avoid wasting the final exam as a learning experience. 24. Bi-modal distributions tell a very different story from bell-shaped curves with the same mean, so at least the standard deviation is required; but in fact, a graph is considerably more informative, especially but not only for those lacking statistical sophistication. 25. Most of the other alleged differences between difficult technical courses and the humanities courses, and between required and optional courses, turn out to be either non-existent or less than a serious threat in a well-run system. 26. An early version of this was in "Duties-Based Teacher Evaluation" in Vol. 1, No. 3 of Journal of Personnel Evaluation in Education, 1987. This list readily generates student rating forms to match the parts of it that students can rate, and converts easily to cover post-secondary duties. A considerably revised later version is in another chapter of this book. 27. It might be mentioned that one rarely sees post-secondary institutions using the hard-won discipline of program evaluation to evaluate their programs. Even the use of alumni to help with the needs assessment is extremely rare. This is a replay of colleges not using what was long known about how to evaluate teaching (and quite like their limited application of what is known about how to evaluate student performance). The track record is quite different on admissions; presumably a cynic would say the explanation is that the latter exercise does not require the institutions to look critically at themselves. 28. I do not here discuss most of the very serious technical objections to the use of gain scores that have been raised by measurement specialistsÑincluding for example the problem of regression to the meanÑand special problems of reliability and validity that arise with any tests. The problems discussed here are conceptual problems that apply even if the technical problems can be dismissed, as is possible in cases of massive or near-zero gains. 29. The 'Harvard fallacy' is the fallacy of supposing that the teaching at Harvard must be good because its graduates do very well in later life, in proportion to their numbers. All that one can infer from that data is that Harvard does not usually inflict permanent brain damage. The rest of the trick lies in selecting a talented entering class and not getting in the way of their use of the library, labs, peer tutoring, and family influence to which brand-name reverence adds a good deal. The contribution from the faculty, if any, is the residue after factoring out the non-faculty influences on the academic side, and the effect of the 'old boys network' and brand recognition, on job selection and promotion. While Harvard is demonstrably a great university, it is certainly not demonstrably a great teaching university, just a well-equipped one. 30. Self-evaluation is, of course, not evaluation without input from others, but evaluation that is self- initiated and directed; the use of anonymous student ratings is a sine qua non of any self-evaluation by teachers. Naturally, any systematic process for the evaluation of faculty should reward serious self- evaluation and systematic self-development based on it. In teaching as in any profession, the combination of these two practices are the hallmarks of professionalism, the minimum standards for social tolerance of the practitioner. 31. That is, all the teachers being compared may be weak, or the worst of them may be very good and the others better still, etc. Student ratings, especially in upper secondary and post-secondary contexts, are based on a much wider range of comparisons than this, which provides a better approximation to criterion- referencing. 32. At earlier grade levels, student evaluations need to be supported by a substantial effort at prior training for the students, (something of considerable educational merit in its own right), would not be sound well down into lower primary. But we need more experimentation in that area, as well as legitimation by leadership use. 33. There are also conceptions of teaching which make them less important, even in extreme cases completely inappropriate, for example the conception of teaching as creating a climate for learning rather than as transmitting it. 34. In a common college situation, there is one visit selected from 30Ð120 sessions: at the school level, it is one or two or, rarely, as many as five, visits, usually for less than a complete period, out of 500-1000 class meetings. Given the way in which individual class meetings vary, as every instructor knows, both for idiosyncratic reasons and as the term goes on, as the first or the final test looms, or as the topics vary in interest, or as visitors are present, this cannot be thought of as an adequate sample. One should also take into account the way in which visitors' ratings can change as they come to 'see through' the teacher's style, a process which may continue over a large number of visits (Studies at the University of California at Davis make clear that this effect can be very substantial, and I know of none that found it to be small; Wilson et al. College professors and their impact on students., (Wiley, 1975)) 35. There is some reduction of impact of these criticisms if we videotape all sessions of a course and select a substantial random sample of these to evaluate. But the cost and connotations of this approach are worrying, and we lack experience with it. 36. There are some extreme cases where the visitor can make a reliable judgement of pedagogical skill. The validity of these judgements is skewed along the merit axis; it is easier to identify deep trouble than great merit. But of course, the students can make the same judgement with the same or greater validity, so the visit is unnecessary. 37. The teaching materials and test or project work done by the students will better serve that purpose. 38. Students are similarly in a uniquely strong position to rate the presence of immediately identifiable benefits from the material and skills acquired from a teacher, but this is arguably not crucial in evaluating teaching merit. (Technically, it's relevant to rating worth not merit; one can't blame the Latin teacher because the subject isn't seen as immediately beneficial.) However, this kind of rating can be useful for formative evaluation, telling you whether you should be spending more time persuading the class of the importance of the subject to them, if you believe that it is. It is also significant for many discussions of a department's curriculum. Hence it should be considered for inclusion on a 'general purpose' form. This is a different matter from rating the eventual or long-term value of the course to the student's e.g. professional needs, about which the students are usually not in a good position to pass judgement. 39. Particularly in light of this point, it seems sensible to use student rating forms in a two-stage procedure. In the first stage, a good summative-valid form is used, administered in a summative-valid way (security procedures, etc.). Only if someone does so badly on that stage as to jeopardize their job, or offend their own sense of satisfaction with the quality of their work, should they then move to the use of a second form. That second form can simply call for a more detailed analysis of the duties (expanding on the type of questions mentioned in D). But it could also if the teacher wishes ask the student to answer questions about style (as in E), so that the information provided by the style literature as to models that have worked well for many teachers can be applied. 40 The usual distinction here is between merit and worth (or suitability). Both are legitimate in the evaluation of faculty, within limits. It is worth and not just merit that leads to a position being advertised in the first place, and to the verdict of redundancy. Worth can sometimes be used, properly, to justify gender preference; and is often used, improperly, to rationalize political discrimination. A longer discussion of it is provided in a paper on teacher selection in the New Handbook of Teacher Evaluation: Assessing Elementary & Secondary School Teachers, J. Millman and L. Darling-Hammond, eds., Sage, 1990. Handbook of Teacher Evaluation, Second Edition, J. Millman and L. Darling-Hammond, editors, (Sage, 1988) 41 For example, in "Validity in Personnel Evaluation" in Journal of Personnel Evaluation in Education , vol. 1, no. 1, 1987; and in earlier chapters in this book. 42. This viewpoint was expressed in the late sixties, by the radical left, in the bitter phrase 'the student as nigger'. 43. Provided, of course, that the results of the student evaluations do carry weight in the decisions made. A monitor from student government on the committee is desirable here, and appropriate controls of anonymity are possible. 44. A prompt is simply a hint as to something that the respondent may wish to underscore as significant, or comment on, or simply take into account when selecting an answer to the One Big Question, but which does not require a response. We get more than 50 prompts in small print on our one-page four-question rating form; but the average time to complete is still around 3 minutes compared to 10-15 for a 50-question form. Other advantages are: (i) coding for summative evaluation is simpler (and it's no more difficult for formative evaluation); (ii) the integration of multiple considerations is done by the respondent, not by the evaluator who lacks good reasons for any particular relative weighting; (iii) it uses relatively little paper, time, and computer processing. Readers are welcome to a copy of the form we use upon request. 45. Which are based on the idea that if someone is not doing well at teaching, they must be 'going about it the wrong way', i.e., they need to have their teaching style improved. The correct approach would be to see, first, if their discharge of teaching obligations needed improvement. There are many such obligations that need improvement in most teachers, and can easily be improved, as one can immediately show. You can't prove that the usual recommendations for improving style actually improve teaching in a given individual (without massive testing with follow-ups). A common sign of this is that we usually can't get good agreement between two independent observers as to the best style; but even if we could, the size of the benefits are not demonstrable. 46. Bad temper is an obvious example, but excessive repetition, reading from texts, and the failure to ask questions except in the presence of visitors are others. 47. Certainly, since the visitor cannot avoid seeing the style features, and hence cannot guarantee not being influenced by them, visitors cannot be used for input on personnel decisions. 48. As well as possible benefits for campus morale, as already mentioned. 49. The usual problem of initial faculty opposition to weighting student ratings sometimes turns into the opposite one; after a few years' use, there is a tendency towards overweighting student evaluations. A major source of benefit comes from improved faculty self-evaluation, resulting from the need to face up to and discuss the student ratings of their work, which in other institutions will only occur to those who actively and independently undertake to get their performance rated by students. 50 Some suggestions about such a system are provided in "Summative Teacher Evaluation" in Handbook of Teacher Evaluation, ed. J. Millman, (Sage, 1981). ......>> ---------------------------------------->>> + ....John C Damron, PhD * ....Douglas College, DLC * * * ....P.O Box 2503 * * * ....New Westminster, British Columbia ....Canada V3L 5B2 FAX: (604) 527-5969 ....e-mail: [EMAIL PROTECTED] Re: Student Ratings...... "It cannot be emphasized strongly enough that the evaluation questionnaires of the type that we are discussing here measure only the attitudes of the students towards the class and the instructor. They do not measure the amount of learning which has taken place...." -- From: CAUT (Canadian Association of University Teachers). ---------------------------------------------------------------->> http://www.douglas.bc.ca/ http://www.douglas.bc.ca/psychd/index.html Student Ratings Critique: http://www.mankato.msus.edu/dept/psych/Damron_politics.html