Re: [ccp4bb] very informative - Trends in Data Fabrication

aaleshin Sun, 08 Apr 2012 13:18:14 -0700

Since I was the person who started "a public outcry to "do something"", I shell 
explain myself to my critics. Similarly to all of you, I do not care much about 
those few instances of structure fabrication. I might put too much emphases on 
them to initiate the discussion, but they are, indeed, only tiny blips on the 
ocean of science. But, could they be tips of a huge iceberg? That was my 
concern. I believe that an enormous competition in science that we experience 
nowadays  makes many of us desperate, and desperation forces people to cheat.  
Is current validation system at PDB good enough to catch various aspects of 
data cheating? Is there a simple but efficient way to make it more difficult 
and, hence, less desirable?


Good sportsmen (in terms of sport abilities) sometimes get caught with taking 
performance enhancers. I bet everyone would do it if the drug control did not 
exist. Many sportsmen would do it against their will, just because there was no 
other way to win. Do not you think a similar situation can develop in science? 

> I suppose as social animals we like to think we can trust and be trusted
Well, I suppose that these two antagonistic abilities of social animals (trust 
and cheating) developed in parallel as means to promote the evolution. In a 
very hierarchical society with no legal means to change a social status, 
cheating has been an important tool to contribute ones genes to a society. The 
socially unjust societies still exist and their members may have a slightly 
different view on morality of cheating than those from just societies. 
Moreover, ability to cheat often correlates with the intellect. Could not it be 
called cheating when someone is told to do something in one way, but he does it 
in his own way, because he believes it is more efficient? When a scientist 
feels that he is right about validity of his results, but they do not look good 
enough to be "sold" to validators, he is supposed to do more research. But he 
is out of time, why not to hide weak spots of the work if he knows that the 
major conclusions are RIGHT? Even if someone will redo the work later, they 
will be reproduced, right? In my opinion, this is the major motif for cheating 
in science.

What I suggested with respect to the PDB data validation was adding some 
additional information that would allow to independently validate such 
parameters as the resolution and data quality (catching of model fabrications 
would be a byproduct of this process). Does the current system allow to 
overestimate those parameters? I believe so (but I might be wrong, correct 
me!). Periodically, people ask at ccp4bb how to determine the resolution of 
their data, but some "idiots" may decide to do it on their own and add 30% of 
noise to their structural factors. As James mentioned, one does not need to be 
extremely smart to do so, moreover, such an "idiot" would have less restraints 
than an educated crystallographer, because the "idiot" believes that nobody 
would notice his cheating. His moral principles are not corrupted, because he 
thinks that the model is correct and no harm is done. But the harm is still 
there, because people are forced to believe the model more than it deserves.  

The question is still open to me about what percentage of PDB structures 
overestimates data quality in terms of resolution. Is it possible to make it 
less dependent on the opinion of persons submitting the data? We all have so 
different opinions about everything...  

People invented laws to create conditions when they can trust each other. 
Sociopaths who do not follow the rules get caught and excluded from a society, 
which maintains the trust. But when the trust is abused, it quickly disappears. 
Many of those who wrote on the matter expressed a strong opinion that the 
system is not broken and we should continue trusting each other. Great! I do 
not mind the status quo. 

Regards,
Alex Aleshin

On Apr 8, 2012, at 8:48 AM, James Holton wrote:

> On 4/2/2012 6:03 AM, herman.schreu...@sanofi.com wrote:
>> If James Holton had been involved, the fabrication would not have been
>> discovered.
>> Herman
> 
> Uhh.  Thanks.  I think?
> 
> Apologies for remaining uncharacteristically quiet.  I have been keeping
> up with the discussion, but not sure how much difference one more "vote"
> would make on the various issues.  Especially since most of this has
> come up before.  I agree that fraud is sick and wrong.  I think backing
> up your data is a good idea, etc. etc.  However, I seem to have been
> declared a leading "expert" on fake data, so I suppose I ought to say
> something about that.  Not quite sure I want to volunteer to be the
> Defense Against The Dark Arts Teacher (they always seem to end badly).
> But, here goes:
> 
> I think the core of the "fraud problem" lies in our need for models, and
> I mean "models" in the general scientific sense not just PDB files.
> Fundamental to the practice of science is coming up with a "model" that
> explains the observations you made, preferably to within experimental
> error.  One is also generally expected to estimate what the experimental
> error was.  That is, if you plot a bunch of points on a graph, you need
> to fit some sort of curve to them, and that curve had better fit to
> "within the error bars", or you have some explaining to do.  Protein
> structures are really nothing more than a ~50,000 parameter curve fit to
> ~50,000 data points.  So, given that the technology for constructing
> "models" is widely available (be it gnuplot or refmac), as is the
> technology for estimating errors and generating random numbers, all the
> hard work a would-be fraud needs to make a plausible forgery has already
> been done.  This is not something unique to crystallography!  It is a
> general property of any mature science.
> 
> Indeed, "fake data", is not only a common tool in science but an
> inextricable part of it.  Simulated diffraction images appear in the
> literature at least as early as Arndt and Wonacott (1976), and I'm sure
> even Moseley and Darwin (1913) made some "fake data" when trying to
> figure out all the sources of systematic error they were dealing with
> measuring reflected x-ray beams.  At its heart, fake data is a
> "control".  Remember "controls" from science class?  They come in two
> flavors: positive and negative, and you are supposed to have both.  In
> fact, all a fraud really is is someone who in some way, shape or form
> takes a positive control and calls it their "experiment".  Pasting gel
> lanes together is an example of this.  I think this is why fraud is so
> hard to prevent in science.  You can't do science without controls, but
> anyone who has "access to the technology" for doing a control can also
> use it for evil.  The labels are everything.
> 
>   Personally, I classify fraud as an "intentionally incorrect" result.
> This separates it from "unintentionally incorrect" results (mistakes),
> which are far more common.  Validation is meant to catch the "incorrect"
> part, but can never be expected to establish intent!  In fact, I expect
> a mildly clever fraud might actually plan to hide behind the "we made a
> mistake in the deposition/figure/paper but now can't find the original
> data" defense.  The case at hand (Zaborsky et al. 2010) may be a very
> good example of this.  A new validation procedure (Rupp 2012) drew
> attention to the fabricated 3k78 structure as well as real structures
> where Fcalc was accidentally deposited instead Fobs (there are a number
> of these).  Rupp's follow-up on 3k78 found troubling irregularities, but
> could it still be a mistake?  If there is a combination of buttons in
> some GUI somewhere that "lets you" do this then I imagine at least one
> idiot may have "discovered" it.  Perhaps even pleased with themselves
> for finding a "new way" to get their R factor down. The best evidence
> that Fobs simply does not exist for 3k78 was in the response (Zaborsky
> et al. 2012).
> 
> The same validation procedure also drew attention to other cases.  Two
> of them 1n0r and 1n0q (Mosavi et al. 2002) were from my beamline (ALS
> 8.3.1), so finding the original images was simply a matter of flipping
> through the books of old DVDs I have in my office.  They cost us $0.25
> each in 2002.  Yes, I do back up every image, primarily because figuring
> out which ones were "worth backing up" was actually a more expensive
> proposition.  Even in adjusted dollars, I think the cost of the whole
> archive is still cheaper than what it would have cost Dan to re-grow his
> crystals and collect the data again in 2012.  It is also nice to be able
> to say that the data for 1n0r were collected on Jan 30 2002 from 9:47 pm
> to 11:48 pm and 1n0q was collected on Mar 15 2002 from 12:52 pm until
> 3:48 pm.  I was there!  I saw the whole thing!  Yes, I know, since I am
> "the guy who can fake images" I am not the best "witness" (the Defense
> Against the Dark Arts Teacher never is), but for whatever it is worth I
> DO recommend keeping your old images around.  You never know when a
> forgotten slip of the mouse when using AutoDep ten years ago will come
> back to haunt you.
> 
>     I think it very important to point out here that validation and
> peer review are not arbitrary gauntlets set up to prevent the unworthy
> from achieving the nirvana of "publication".  What they are are services
> meant to help keep you from embarrassing yourself afterward.  In the
> end, the responsibility for the veracity and validity of your paper lies
> with you, the author.  Not the journal, not the reviewers, and
> definitely not the PDB.  They are a repository, not a police force.
> Annotators will strongly encourage you to deal with validation issues,
> but they will, in the end, deposit whatever you give them.  What they
> won't do is let you take it back!  So before you make 10,000 copies of
> your paper and deposit your coordinates into the irrevocable memory of
> the PDB, it is a good idea to seek out the harshest critic you can find
> and listen to what they have to say.  You don't have to DO everything
> they say, but listening is a good idea.  Even a hard-working and
> diligent scientist who eats all his vegetables can still do something
> dumb, like put the protein and water on different origins just before
> deposition.  Not that I would know anything about that (1rb1).
> 
>    I also think it important to point out that it is not possible to
> build some kind of automated "fraud catcher", nor would it be
> advisable.  It would only lull us into a false sense of security.  Even
> branches of science that don't do a lot of curve-fitting (such as
> archaeology) still have "models" inasmuch as people have a picture in
> their heads of how they think all their data "should" fit together.  All
> a fraud need do is create some artwork (be it a stone tool or a
> diffraction image) that is consistent with that picture, and no alarm
> bells will be raised.  Perhaps not for years.  Long enough to get a job
> anyway.  And therein lies the incentive.  Watching "The Apprentice" one
> might think that firing someone is easy, but its not.  Anyone who has
> been in a management role long enough will tell you that giving someone
> a job is a lot easier than taking it away.  Add to that the fact that
> the institution who hired the fraud is embarrassed about being so easily
> fooled, as is the institution that "trained" him/her.  I imagine the
> funding agency who paid for the whole thing has some interesting PR to
> do as well.  The sad truth of any fraud case is there are a lot of
> people who have a strong incentive to keep it as "quiet" as possible.
> Most of these people are not scientists.  On the other hand, the damage
> done by the fraud is diluted over a very large number of people, most of
> whom are far away.  They will blog on the internet about it, but few
> will take any real action.  Was there ever an angry mob outside Hendrick
> Schon's house?  Does anyone even know where he is now?
> 
> Now, before all you Tom Riddles out there start downloading my software,
> ordering a copy of "The Prince" on Amazon and picking a "structure" that
> will land you your Dream Job, let me tell you why this will not work.
> Are there secret catches in MLFSOM identifying the images it produces as
> "fake"? ... Maybe.  But far far more important than any of that is the
> step that comes after fitting a curve that explains your "data" to
> within experimental error: making a prediction.  Do you really think you
> are that smart?  It is one thing to build a model that is consistent
> with all the biochemistry, mutagenesis, and homologous structures of a
> particular molecule, but can you predict all the future results other
> people will get?  All of them?  There is a reason why real scientists
> collect data.  As one great man said: "... even the very wise cannot see
> all ends".
> 
>   The problem with fraud as a career option is that you must either
> produce a "result" so insignificant and boring that nobody will ever
> check it or try to build upon it, or you must be very very lucky and
> actually fake something that turns out to be true.  I suppose the latter
> vanity is the reasoning behind some of the more infamous frauds.  In
> fact, I'm sure your average con artist might consider themselves very
> clever indeed to be able to fool all those smart scientist people.  Such
> is the price we pay for the unparalleled level of trust that the
> worldwide scientific community has for one another.  I mean, really, is
> there another group of people who so readily take the "word" of someone
> they have never met that they actually did do an experiment and are not
> just making stuff up?  In a way, it is amazing we don't have more fraud
> in science.  Why is that?  Part of it is because fraud really does end
> your career.  I'm sure HMK Murthy has a job now somewhere, but I doubt
> it has anything to do with science.  Unless he changed his name.  But
> most of all I think it is because our faith in the connection between
> truth and observation is not misplaced.  Eventually, all scientific
> frauds will either be exposed or are simply inconsequential.
> 
>    I think the biggest problem with fraud is not that having wrong
> results in the literature could lead us down the wrong path.  There is
> no shortage of unintentionally incorrect crap out there already.  I
> think the biggest problem is the breakdown of trust, which makes us
> behave in "unprofessional" ways.  The combination of an ill-defined and
> virtually undetectable menace (intent) and a public outcry to "do
> something" is always a recipe for disaster.  We do NOT want the "best
> strategy" for dealing with a mistake to be trying to protect yourself.
> I suppose as social animals we like to think we can trust and be
> trusted, but I think as a scientist one must always maintain a healthy
> and professional skepticism about any source of information.  After all,
> the people who wrote the paper you are reading don't trust you that much
> either (otherwise they would have their images available on the web),
> and the molecules and equipment you work with definitely don't trust
> you.  Not even a little bit.
> 
> -James Holton
> MAD Scientist

Re: [ccp4bb] very informative - Trends in Data Fabrication

Reply via email to