Re: [ccp4bb] very informative - Trends in Data Fabrication

James Holton Sun, 08 Apr 2012 08:49:06 -0700

On 4/2/2012 6:03 AM, herman.schreu...@sanofi.com wrote:

If James Holton had been involved, the fabrication would not have been
discovered.
Herman


Uhh.  Thanks.  I think?

Apologies for remaining uncharacteristically quiet. I have been keepingup with the discussion, but not sure how much difference one more "vote"would make on the various issues. Especially since most of this hascome up before. I agree that fraud is sick and wrong. I think backingup your data is a good idea, etc. etc. However, I seem to have beendeclared a leading "expert" on fake data, so I suppose I ought to saysomething about that. Not quite sure I want to volunteer to be theDefense Against The Dark Arts Teacher (they always seem to end badly).But, here goes:

I think the core of the "fraud problem" lies in our need for models, andI mean "models" in the general scientific sense not just PDB files.Fundamental to the practice of science is coming up with a "model" thatexplains the observations you made, preferably to within experimentalerror. One is also generally expected to estimate what the experimentalerror was. That is, if you plot a bunch of points on a graph, you needto fit some sort of curve to them, and that curve had better fit to"within the error bars", or you have some explaining to do. Proteinstructures are really nothing more than a ~50,000 parameter curve fit to~50,000 data points. So, given that the technology for constructing"models" is widely available (be it gnuplot or refmac), as is thetechnology for estimating errors and generating random numbers, all thehard work a would-be fraud needs to make a plausible forgery has alreadybeen done. This is not something unique to crystallography! It is ageneral property of any mature science.

Indeed, "fake data", is not only a common tool in science but aninextricable part of it. Simulated diffraction images appear in theliterature at least as early as Arndt and Wonacott (1976), and I'm sureeven Moseley and Darwin (1913) made some "fake data" when trying tofigure out all the sources of systematic error they were dealing withmeasuring reflected x-ray beams. At its heart, fake data is a"control". Remember "controls" from science class? They come in twoflavors: positive and negative, and you are supposed to have both. Infact, all a fraud really is is someone who in some way, shape or formtakes a positive control and calls it their "experiment". Pasting gellanes together is an example of this. I think this is why fraud is sohard to prevent in science. You can't do science without controls, butanyone who has "access to the technology" for doing a control can alsouse it for evil. The labels are everything.

Personally, I classify fraud as an "intentionally incorrect" result.This separates it from "unintentionally incorrect" results (mistakes),which are far more common. Validation is meant to catch the "incorrect"part, but can never be expected to establish intent! In fact, I expecta mildly clever fraud might actually plan to hide behind the "we made amistake in the deposition/figure/paper but now can't find the originaldata" defense. The case at hand (Zaborsky et al. 2010) may be a verygood example of this. A new validation procedure (Rupp 2012) drewattention to the fabricated 3k78 structure as well as real structureswhere Fcalc was accidentally deposited instead Fobs (there are a numberof these). Rupp's follow-up on 3k78 found troubling irregularities, butcould it still be a mistake? If there is a combination of buttons insome GUI somewhere that "lets you" do this then I imagine at least oneidiot may have "discovered" it. Perhaps even pleased with themselvesfor finding a "new way" to get their R factor down. The best evidencethat Fobs simply does not exist for 3k78 was in the response (Zaborskyet al. 2012).

The same validation procedure also drew attention to other cases. Twoof them 1n0r and 1n0q (Mosavi et al. 2002) were from my beamline (ALS8.3.1), so finding the original images was simply a matter of flippingthrough the books of old DVDs I have in my office. They cost us $0.25each in 2002. Yes, I do back up every image, primarily because figuringout which ones were "worth backing up" was actually a more expensiveproposition. Even in adjusted dollars, I think the cost of the wholearchive is still cheaper than what it would have cost Dan to re-grow hiscrystals and collect the data again in 2012. It is also nice to be ableto say that the data for 1n0r were collected on Jan 30 2002 from 9:47 pmto 11:48 pm and 1n0q was collected on Mar 15 2002 from 12:52 pm until3:48 pm. I was there! I saw the whole thing! Yes, I know, since I am"the guy who can fake images" I am not the best "witness" (the DefenseAgainst the Dark Arts Teacher never is), but for whatever it is worth IDO recommend keeping your old images around. You never know when aforgotten slip of the mouse when using AutoDep ten years ago will comeback to haunt you.

I think it very important to point out here that validation andpeer review are not arbitrary gauntlets set up to prevent the unworthyfrom achieving the nirvana of "publication". What they are are servicesmeant to help keep you from embarrassing yourself afterward. In theend, the responsibility for the veracity and validity of your paper lieswith you, the author. Not the journal, not the reviewers, anddefinitely not the PDB. They are a repository, not a police force.Annotators will strongly encourage you to deal with validation issues,but they will, in the end, deposit whatever you give them. What theywon't do is let you take it back! So before you make 10,000 copies ofyour paper and deposit your coordinates into the irrevocable memory ofthe PDB, it is a good idea to seek out the harshest critic you can findand listen to what they have to say. You don't have to DO everythingthey say, but listening is a good idea. Even a hard-working anddiligent scientist who eats all his vegetables can still do somethingdumb, like put the protein and water on different origins just beforedeposition. Not that I would know anything about that (1rb1).

I also think it important to point out that it is not possible tobuild some kind of automated "fraud catcher", nor would it beadvisable. It would only lull us into a false sense of security. Evenbranches of science that don't do a lot of curve-fitting (such asarchaeology) still have "models" inasmuch as people have a picture intheir heads of how they think all their data "should" fit together. Alla fraud need do is create some artwork (be it a stone tool or adiffraction image) that is consistent with that picture, and no alarmbells will be raised. Perhaps not for years. Long enough to get a jobanyway. And therein lies the incentive. Watching "The Apprentice" onemight think that firing someone is easy, but its not. Anyone who hasbeen in a management role long enough will tell you that giving someonea job is a lot easier than taking it away. Add to that the fact thatthe institution who hired the fraud is embarrassed about being so easilyfooled, as is the institution that "trained" him/her. I imagine thefunding agency who paid for the whole thing has some interesting PR todo as well. The sad truth of any fraud case is there are a lot ofpeople who have a strong incentive to keep it as "quiet" as possible.Most of these people are not scientists. On the other hand, the damagedone by the fraud is diluted over a very large number of people, most ofwhom are far away. They will blog on the internet about it, but fewwill take any real action. Was there ever an angry mob outside HendrickSchon's house? Does anyone even know where he is now?

Now, before all you Tom Riddles out there start downloading my software,ordering a copy of "The Prince" on Amazon and picking a "structure" thatwill land you your Dream Job, let me tell you why this will not work.Are there secret catches in MLFSOM identifying the images it produces as"fake"? ... Maybe. But far far more important than any of that is thestep that comes after fitting a curve that explains your "data" towithin experimental error: making a prediction. Do you really think youare that smart? It is one thing to build a model that is consistentwith all the biochemistry, mutagenesis, and homologous structures of aparticular molecule, but can you predict all the future results otherpeople will get? All of them? There is a reason why real scientistscollect data. As one great man said: "... even the very wise cannot seeall ends".

The problem with fraud as a career option is that you must eitherproduce a "result" so insignificant and boring that nobody will evercheck it or try to build upon it, or you must be very very lucky andactually fake something that turns out to be true. I suppose the lattervanity is the reasoning behind some of the more infamous frauds. Infact, I'm sure your average con artist might consider themselves veryclever indeed to be able to fool all those smart scientist people. Suchis the price we pay for the unparalleled level of trust that theworldwide scientific community has for one another. I mean, really, isthere another group of people who so readily take the "word" of someonethey have never met that they actually did do an experiment and are notjust making stuff up? In a way, it is amazing we don't have more fraudin science. Why is that? Part of it is because fraud really does endyour career. I'm sure HMK Murthy has a job now somewhere, but I doubtit has anything to do with science. Unless he changed his name. Butmost of all I think it is because our faith in the connection betweentruth and observation is not misplaced. Eventually, all scientificfrauds will either be exposed or are simply inconsequential.

I think the biggest problem with fraud is not that having wrongresults in the literature could lead us down the wrong path. There isno shortage of unintentionally incorrect crap out there already. Ithink the biggest problem is the breakdown of trust, which makes usbehave in "unprofessional" ways. The combination of an ill-defined andvirtually undetectable menace (intent) and a public outcry to "dosomething" is always a recipe for disaster. We do NOT want the "beststrategy" for dealing with a mistake to be trying to protect yourself.I suppose as social animals we like to think we can trust and betrusted, but I think as a scientist one must always maintain a healthyand professional skepticism about any source of information. After all,the people who wrote the paper you are reading don't trust you that mucheither (otherwise they would have their images available on the web),and the molecules and equipment you work with definitely don't trustyou. Not even a little bit.


-James Holton
MAD Scientist

Re: [ccp4bb] very informative - Trends in Data Fabrication

Reply via email to