Re: [ccp4bb] The importance of USING our validation tools
ZO has a good point - it is a pain trying to get decent simulated material - maybe there is an employment opportunity here? Eleanor Zbyszek Otwinowski wrote: James Holton wrote: How MUCH do you want to bet? ;) Any amount, as long as we are taking about real diffraction images corresponding to the deposited file with observed structure factors. I doubt that simulated diffraction images will be shown, because they are easy to be recognized as such. Independently, I value the possibility of data simulation in methods development and for teaching purposes. Zbyszek Otwinowski UT Southwestern Medical Center 5323 Harry Hines Blvd., Dallas, TX 75390-8816 (214) 645 6385 (phone) (214) 645 6353 (fax) [EMAIL PROTECTED]
Re: [ccp4bb] The importance of USING our validation tools
Think this bounced last time I tried to mail it in, a simulator exists at: http://fable.sourceforge.net/index.php/Farfield_Simulation Jon Eleanor Dodson wrote: ZO has a good point - it is a pain trying to get decent simulated material - maybe there is an employment opportunity here? Eleanor Zbyszek Otwinowski wrote: James Holton wrote: How MUCH do you want to bet? ;) Any amount, as long as we are taking about real diffraction images corresponding to the deposited file with observed structure factors. I doubt that simulated diffraction images will be shown, because they are easy to be recognized as such. Independently, I value the possibility of data simulation in methods development and for teaching purposes. Zbyszek Otwinowski UT Southwestern Medical Center 5323 Harry Hines Blvd., Dallas, TX 75390-8816 (214) 645 6385 (phone) (214) 645 6353 (fax) [EMAIL PROTECTED]
Re: [ccp4bb] The importance of USING our validation tools
I'm going to agree with Raji's observations, and fan the flames of his point a little. I count myself as lucky that I have had access to certain people during my crystallographic training who had a good understanding of the theory behind crystallography (hopefully I have exploited this luck sufficiently). Despite their tutelage, I will hold my hands up and admit that certain technical discussions on the bb leave me a little confused occasionally... However, I have seen what Raji described going on around me, and it is pretty prevalent. Structures are pushed through sometimes, without the PhD student really knowing quite what's happened. I cut my teeth on a few structures that didn't have the sort of pressure on them that others had, and this allowed me to get to grips with what was going on. I also had a few real pigs of projects - you tend to learn a lot more when stuff goes wrong - if you bang your data through program X and get textbook maps stats, you haven't really learnt anything - if you've struggled with Molecular replacement, your SeMet won't crystallise and your heavy atoms won't stick, then you tend to learn how to make Phaser run the last half yard etc etc - and that half yard often come about from thinking about your problems in the right way - having an old-school crystallographer to bang ideas off can be invaluable at this point. Learning the theory is not always encouraged, and, given that to do it properly takes some time and application, it is often towards the bottom of the priority list. It would take a ballsy student to say to their boss, no I can't do experiments X,Y Z until I have read understood this paper on Maximum Likelihood! However, in a system in which there exists a fair amount of pressure and competition (exacerbated in the US system, I think), the temptation to hand off ALL data to a structure-solver can be great. However, if this practice continues, as Raji suggests, there will be a lack of properly trained crystallographers - and mistakes will be more likely to occur. The suggestion of explicitly stating X crystallised, Y collected data, Z phased and refined in a paper is good one, and some journals (eg Nature) like an author contributions section. However, if a group is willing to 'overlook' problems in their data as recently seen, maybe they cannot be trusted to make these statements accurately. I think, that the only water-tight way of preventing such mistakes again is to have every paper that contains a structure to be reviewed by at least one properly-trained crystallographer, and to have the data (pdbs SFs) made available to them. just my lunchtime ramble... Dave On 29/08/2007, Raji Edayathumangalam [EMAIL PROTECTED] wrote: I would like to mention some other issues now that Ajees et al. has stirred all sorts of discussions. I hope I haven't opened Pandora's box. From what I have learned around here, very often, there seems to be little time allowed or allocated to actually learn--a bit beyond the surface--some of the crystallography or what the crystallographic software is doing during the structure solution process. A good deal of the postdocs and students here are under incredible pressure to get the structure DONE asap. For some of them, it is their first time solving a crystal structure. Yes, the same heapful of reasons: because it's hot, competitive, grant deadline, PI tenure pressure etc. etc. Learning takes the backseat and this is total rubbish and very scary, in my biased personal opinion. Although I think it is the person's responsibility to take the time and initiative to learn, I also see that the pressure often is insurmountable. Often, the PI and/or assigned structure solver in the lab pretty much takes charge at some early stage of structure determination and solves the structure with much lesser contribution from the scientist in training (student/postdoc). All that slog to clone, purify, crystallize, optimize diffraction only to realize someone else will come along, process the data and finish up the structure for you. Such 'training' (or lack thereof) is a recipe for generating 'bad' structures in future and part of the reason for this endless thread. I think it is NOT as common for someone else to, say, run all the Western blots for you, maintain your tissue cell lines for you, do your protein preps for you. Is it because it is much easier to upload someone else's crystallographic data on one's machine and solve the structure (since this does not demand the same kind of physical labor and effort and is also a lot of fun) that this happens? I understand when the PI or structure solver does the above as part of a teamwork and allows for the person in question to learn. But often, I see the person is somewhat left overwhelmed and clueless in the end. I bring this issue to the forum since I do not know if this phenomenon is ubiquitous. If this practice is a rampant weed, can we as a
Re: [ccp4bb] The importance of USING our validation tools
Wow! Those are two pretty amazing structures. For those of you who haven't had a look, the ordered molecules are in layers with *huge* gaps in between, much greater than in 2hr0. And yet both of these structures were solved with experimental phasing (SIRAS) unlike 2hr0, and the data is to higher resolution. Mark J. van Raaij wrote: With regards to our structures 1H6W (1.9A) and 1OCY (1.5A), rather than faith, I think the structure is held together by a real mechanism, which however I can't explain. Like in the structure Axel Brunger mentioned, there is appreciable diffuse scatter, which imo deserves to be analysed by someone expert in the matter (to whom, or anyone else, I would gladly supply the images which I should still have on a tape or CD in the cupboard...). For low-res version of one image see http://web.usc.es/~vanraaij/diff45kd.png and http://web.usc.es/~vanraaij/diff45kdzoom.png two possibilities I have been thinking about:
Re: [ccp4bb] The importance of USING our validation tools
In general, I think we should be careful about too strong statements, while in general structures with high solvent diffract to low-res, there are a few examples where they diffract to high res. Obviously, high solvent content means fewer crystal contacts, but if these few are very stable? Similarly, there are probably a few structures with a high percentage of Ramachandran outliers which are real and similarly for all other structural quality indicators. However, combinations of various of these probably do not exist and in any case, every unusual feature like this should be described and an attempt made to explain/analyse it, which in the case of the Nature paper that started this thread was apparently not done, apart from the rebuttal later (and perhaps in unpublished replies to the referees?). With regards to our structures 1H6W (1.9A) and 1OCY (1.5A), rather than faith, I think the structure is held together by a real mechanism, which however I can't explain. Like in the structure Axel Brunger mentioned, there is appreciable diffuse scatter, which imo deserves to be analysed by someone expert in the matter (to whom, or anyone else, I would gladly supply the images which I should still have on a tape or CD in the cupboard...). For low-res version of one image see http://web.usc.es/~vanraaij/diff45kd.png and http://web.usc.es/~vanraaij/diff45kdzoom.png two possibilities I have been thinking about: 1. only a few of the tails are ordered, rather like a stack of identical tables in which four legs hold the table surfaces stably together, but the few ordered tails/legs do not contribute much to the diffraction. This raises the question why some tails should be stiff and others not; perhaps traces of a metal or other small molecule stabilise some tails (although crystal optimisation trials did not show up such a molecule)? 2. three-fold disorder, either individually or in microdomains too small to have been resolved by the beam used. For this I have been told to expect better density than observed, but maybe this is not true. we did try integrating in lower space groups P3, P2 instead of P321 with no improvement of the density, we tried a RT dataset to see if freezing caused the disorder and we tried improving the phases by MAD on the mercury derivative, but with no improvement in the density for the tail. Mark J. van Raaij Unidad de Bioquímica Estructural Dpto de Bioquímica, Facultad de Farmacia and Unidad de Rayos X, Edificio CACTUS Universidad de Santiago 15782 Santiago de Compostela Spain http://web.usc.es/~vanraaij/ On 24 Aug 2007, at 03:01, Petr Leiman wrote: - Original Message - From: Jenny Martin [EMAIL PROTECTED] To: CCP4BB@JISCMAIL.AC.UK Sent: Thursday, August 23, 2007 5:46 PM Subject: Re: [ccp4bb] The importance of USING our validation tools My question is, how could crystals with 80% or more solvent diffract so well? The best of the three is 1.9A resolution with I/ sigI 48 (top shell 2.5). My experience is that such crystals diffract very weakly. You must be thinking about Mark van Raaij's T4 short tail fibre structures. Yes, the disorder in those crystals is extreme. There are ~100-150 A thick disordered layers between the ~200 A thick layers of ordered structure. The diffraction pattern does not show any anomalies (as far as I can remember from 6 years ago). The spots are round, there are virtually no spots not covered by predictions, and the crystals diffract to 1.5A resolution. The disordered layers are perpendicular to the threefold axis of the crystal. The molecule is a trimer and sits on the threefold axis. It appears that the ordered layers somehow know how to position themselves across the disordered layers. I agree here with Michael Rossmann that in these crystals the ordered layers are held together by faith. Mark integrated the dataset in lower space groups, but the disordered stuff was not visible anyway. He will probably add more to the discussion. Petr Any thoughts? Cheers, Jenny
Re: [ccp4bb] The importance of USING our validation tools
Mischa, I don't think that the field of nanotechnology crumbled when allegations against Jan Hendrik Schon (21 papers withdrawn, 15 in Science/Nature) turned out to be true. I don't think that nobody trusts biologists anymore because of Eric Poehlman (17 falsified grants, 10 papers with fabricated data, 12 month in prison). We are still excited to hear about stem cell research despite of what Hwang Woo-suk did or didn't do. What recent events demonstrate is that in macromolecular crystallography (and in science in general) mistakes, deliberate or not, will be discovered. Ed. Mischa Machius wrote: Due to these recent, highly publicized irregularities and ample (snide) remarks I hear about them from non-crystallographers, I am wondering if the trust in macromolecular crystallography is beginning to erode. It is often very difficult even for experts to distinguish fake or wishful thinking from reality. Non-crystallographers will have no chance at all and will consequently not rely on our results as much as we are convinced they could and should. If that is indeed the case, something needs to be done, and rather sooner than later. Best - MM Mischa Machius, PhD Associate Professor UT Southwestern Medical Center at Dallas 5323 Harry Hines Blvd.; ND10.214A Dallas, TX 75390-8816; U.S.A. Tel: +1 214 645 6381 Fax: +1 214 645 6353 -- Edwin Pozharski, PhD, Assistant Professor University of Maryland, Baltimore -- When the Way is forgotten duty and justice appear; Then knowledge and wisdom are born along with hypocrisy. When harmonious relationships dissolve then respect and devotion arise; When a nation falls to chaos then loyalty and patriotism are born. -- / Lao Tse /
Re: [ccp4bb] The importance of USING our validation tools
Dear colleagues, 1) I think Ajees et al. should make available the raw diffraction images of the structure in paper that has caused so much literary commotion, unless they haven't already done so. Perhaps simply put them in an open ftp server? As I imagine, unless I have missed something, these diffraction images were obtained with grant money, so they should be available to the community. Isn't it? This would allow other scientists to evaluate them as much as they wanted and publish many more papers about the validity or falsehood of the conclusions drawn in the original and (now) infamous Ajees et al. paper. That's how Science -in my opinion- ought to be. 2) I agree that depositing raw images in the PDB or elsewhere would be a great thing for everybody - I usually and happily deposit all the structure factors that I've used to obtain and refine a structure -. However, raw images are becoming larger and larger with the newer and fancier detectors and this trend might not stabilize in quite a while. Although disk space is becoming as well cheaper as time goes by, I think the ratio between these two factors is still unpractical for huge storage purposes. Unless a major development in data storage is accomplished. As an anecdote: during a trip to a synchrotron in the American Midwest, a single dataset (1 degree x 360) was something like 27GB of raw images!!! We managed to collect 1.5 TB of data in about 2 days (having to run -of course always in a hurry- to the nearest computer store to get a few more external hard-drives to backup and take with us all our data). Albeit of being a great option for many of us, I insist. I cannot imagine the burden that storing so much data would be for the PDB or any public database. Not only for taking care of the amount of disk space or storage support required, but as people have mentioned here taking care of them (curating them, since disks do crash, as we know, and optical media get irremediably scratched) would be a tremendous and likely expensive endeavor. 3) Perhaps, we should responsibly store the data ourselves, nicely stored in media that should allow us to retrieve it after many years (quite a task by itself already; forget the clay tablets though). As probably many of us have done for quite some time. And when asked, send the data to anyone who is interested. But... don't they have already problems accessing the tapes from the first lunar landing? 4) In any case, we should not forget the subject of storing and accessibility of the crystallographic raw images in a public database. Perhaps more journals should accept open letters about this subject, which is important as well as complicated, and create a much larger discussion than this one. All the best, Jordi __ Jordi Benach, PhD MX Beamline Scientist ALBA Synchrotron Light Facility Edifici Ciències. Mòdul C-3 Central Campus Universitat Autònoma de Barcelona 08193 Bellaterra, Barcelona, SPAIN Phone: +34 93 592 4333 FAX: +34 93 592 4302 E-mail: [EMAIL PROTECTED] __
Re: [ccp4bb] The importance of USING our validation tools
I've been reading the contributions on this topic with much interest. It's been very timely in that I've been giving 3rd year u/g lectures on protein X-ray structures and their validation over the past week. As part of the preparation for the lectures, I searched the PDB for structures with high solvent content. To my surprise, I found 376 crystal structures with solvent content 75% (about 1% of all crystal structures) and 120 structures with solvent content 80% (about 0.3% of all crystal structures) However, there were only 3 other structures that (like 2HR0) had 80% AND Rcryst Rfree less than 20%. All three structures are solved to better than 3A Resolution. One is from a weak data set from a virus crystal, the other two PDB files report very strong crystallographic data. The Rmerge values are more typical than for 2HR0 and none of the three appear to have the geometry or crystal contact problems of 2HR0. My question is, how could crystals with 80% or more solvent diffract so well? The best of the three is 1.9A resolution with I/sigI 48 (top shell 2.5). My experience is that such crystals diffract very weakly. There are another 15 structures with solvent content 75-80% and Rcryst/Rfree 20%. I didn't check them in any detail, just to see that the structure was consistent with a high solvent content. Any thoughts? Cheers, Jenny
Re: [ccp4bb] The importance of USING our validation tools
In the cases you list, it is clearly recognized that the fault lies with the investigator and not the method. In most of the cases where serious problems have been identified in published models the authors have stonewalled by saying that the method failed them. The methods of crystallography are so weak that we could not detect (for years) that our program was swapping F+ and F-. The scattering of X-rays by bulk solvent is a contentious topic. We should have pointed out that the B factors of the peptide are higher then those of the protein. It appears that the problems occurred because these authors were not following established procedures in this field. They are, as near as I can tell, somehow immune from the consequences of their errors. Usually the paper isn't even retracted, when the model is clearly wrong. They can dump blame on the technique and escape personal responsibility. This is what upsets so many of us. It would be so refreshing to read in one of these responses We were under a great deal of pressure to get our results out before our competitors and cut corners that we shouldn't have, and that choice resulted in our failure to detect the obvious errors in our model. If we did see papers retracted, if we did see nonrenewal of grants, if we did see people get fired, if we did see prison time (when the line between carelessness and fraud is crossed), then we could be comforted that there is practical incentive to perform quality work. Dale Tronrud Edwin Pozharski wrote: Mischa, I don't think that the field of nanotechnology crumbled when allegations against Jan Hendrik Schon (21 papers withdrawn, 15 in Science/Nature) turned out to be true. I don't think that nobody trusts biologists anymore because of Eric Poehlman (17 falsified grants, 10 papers with fabricated data, 12 month in prison). We are still excited to hear about stem cell research despite of what Hwang Woo-suk did or didn't do. What recent events demonstrate is that in macromolecular crystallography (and in science in general) mistakes, deliberate or not, will be discovered. Ed. Mischa Machius wrote: Due to these recent, highly publicized irregularities and ample (snide) remarks I hear about them from non-crystallographers, I am wondering if the trust in macromolecular crystallography is beginning to erode. It is often very difficult even for experts to distinguish fake or wishful thinking from reality. Non-crystallographers will have no chance at all and will consequently not rely on our results as much as we are convinced they could and should. If that is indeed the case, something needs to be done, and rather sooner than later. Best - MM Mischa Machius, PhD Associate Professor UT Southwestern Medical Center at Dallas 5323 Harry Hines Blvd.; ND10.214A Dallas, TX 75390-8816; U.S.A. Tel: +1 214 645 6381 Fax: +1 214 645 6353
Re: [ccp4bb] The importance of USING our validation tools
- Original Message - From: Jenny Martin [EMAIL PROTECTED] To: CCP4BB@JISCMAIL.AC.UK Sent: Thursday, August 23, 2007 5:46 PM Subject: Re: [ccp4bb] The importance of USING our validation tools My question is, how could crystals with 80% or more solvent diffract so well? The best of the three is 1.9A resolution with I/sigI 48 (top shell 2.5). My experience is that such crystals diffract very weakly. You must be thinking about Mark van Raaij's T4 short tail fibre structures. Yes, the disorder in those crystals is extreme. There are ~100-150 A thick disordered layers between the ~200 A thick layers of ordered structure. The diffraction pattern does not show any anomalies (as far as I can remember from 6 years ago). The spots are round, there are virtually no spots not covered by predictions, and the crystals diffract to 1.5A resolution. The disordered layers are perpendicular to the threefold axis of the crystal. The molecule is a trimer and sits on the threefold axis. It appears that the ordered layers somehow know how to position themselves across the disordered layers. I agree here with Michael Rossmann that in these crystals the ordered layers are held together by faith. Mark integrated the dataset in lower space groups, but the disordered stuff was not visible anyway. He will probably add more to the discussion. Petr Any thoughts? Cheers, Jenny
Re: [ccp4bb] The importance of USING our validation tools
Another example of a structure with intervening layers of weak electron density at 1.75 A resolution is Pb2+ bound calmodulin that Mark Wilson solved in my laboratory: M.A. Wilson and A.T. Brunger, / Acta Cryst./D59, 1782-1792 (2003), PDB ID 1NOY. The intervening layers are not entirely disordered since PB2+ positions show up in difference maps in these layers, so it could indicate motion around these positions rather than complete disorder. However, apart from the Pb2+ positions, the electron density in these layers is weak and un-interpretable. Apart from the weak layers, the structure behaves completely normal, i.e., we observe the expected bulk solvent contribution at low resolution, and the B-factor distributions are as expected. Axel Petr Leiman wrote: - Original Message - From: Jenny Martin [EMAIL PROTECTED] To: CCP4BB@JISCMAIL.AC.UK Sent: Thursday, August 23, 2007 5:46 PM Subject: Re: [ccp4bb] The importance of USING our validation tools My question is, how could crystals with 80% or more solvent diffract so well? The best of the three is 1.9A resolution with I/sigI 48 (top shell 2.5). My experience is that such crystals diffract very weakly. You must be thinking about Mark van Raaij's T4 short tail fibre structures. Yes, the disorder in those crystals is extreme. There are ~100-150 A thick disordered layers between the ~200 A thick layers of ordered structure. The diffraction pattern does not show any anomalies (as far as I can remember from 6 years ago). The spots are round, there are virtually no spots not covered by predictions, and the crystals diffract to 1.5A resolution. The disordered layers are perpendicular to the threefold axis of the crystal. The molecule is a trimer and sits on the threefold axis. It appears that the ordered layers somehow know how to position themselves across the disordered layers. I agree here with Michael Rossmann that in these crystals the ordered layers are held together by faith. Mark integrated the dataset in lower space groups, but the disordered stuff was not visible anyway. He will probably add more to the discussion. Petr Any thoughts? Cheers, Jenny -- Axel T. Brunger Investigator, Howard Hughes Medical Institute Professor of Molecular and Cellular Physiology Stanford University Web:http://atb.slac.stanford.edu Email: [EMAIL PROTECTED] Phone: +1 650-736-1031 Fax:+1 650-745-1463
Re: [ccp4bb] The importance of USING our validation tools
Dear Alex, Of course a simplified one page summary would not be the last word, but I think that it would be a big step in the right direction. For example a value of Rfree that is 'too good' because the reflection set for it has been chosen wrongly can be detected statistically (Tickle et al., Acta D56 (2000) 443-450). And it would be not be too difficult to distinguish between three possible causes of incomplete data: (a) there is a dead cone of data because it was a single scan of a low symmetry crystal, (b) a large number of 'overloads' were rejected (they would all have fairly low resolution and high Fc values) or (c) the missing reflections are fairly randomly distributed because they have been removed by hand to improve the R-values. I think that there is a very good case for making this Rinformation available to referees in an easily comprehensible form. George Prof. George M. Sheldrick FRS Dept. Structural Chemistry, University of Goettingen, Tammannstr. 4, D37077 Goettingen, Germany Tel. +49-551-39-3021 or -3068 Fax. +49-551-39-2582 On Sun, 19 Aug 2007, Alexander Aleshin wrote: I do not think the small molecule approach proposed by George Sheldrick is sufficient for validation of protein structures, as misrepresentation of experimental statistics/resolution is hard to detect with it, and these factors appear to play crucial role in defining the fate of many hot structures. The bad statistics hurts publication more than mistakes in a model, and improving the experiment is often too hard. I know my structure is right. Why should I spend another year growing better crystals only to make the statistics look right? - sounds as a strong argument for a desperate researcher. Making up an artificial data set overkills the task. There are easier and less amoral ways such as rejection of outliers and incorrect assignment of the Rfree test set. Ironically, an undereducated crystallographer may not recognize wrongdoing in such data treatment, which makes it even more likely to occur. Do I sound paranoid? And please do not suggest that I have shared personal experiences. Alex Aleshin On Sat, 18 Aug 2007, George M. Sheldrick wrote: There are good reasons for preserving frames, but most of all for the crystals that appeared to diffract but did not lead to a successful structure solution, publication, and PDB deposition. Maybe in the future there will be improved data processing software (for example to integrate non-merohedral twins) that will enable good structures to be obtained from such data. At the moment most such data is thrown away. However, forcing everyone to deposit their frames each time they deposit a structure with the PDB would be a thorough nuisance and major logistic hassle. It is also a complete illusion to believe that the reviewers for Nature etc. would process or even look at frames, even if they could download them with the manuscript. For small molecules, many journals require an 'ORTEP plot' to be submitted with the paper. As older readers who have experienced Dick Harlow's 'ORTEP of the year' competition at ACA Meetings will remember, even a viewer with little experience of small-molecule crystallography can see from the ORTEP plot within seconds if something is seriously wrong, and many non-crystallographic referees for e.g. the journal Inorganic Chemistry can even make a good guess as to what is wrong (e.g wrong element assigned to an atom). It would be nice if we could find something similar for macromolecules that the author would have to submit with the paper. One immediate bonus is that the authors would look at it carefully themselves before submitting, which could lead to an improvement of the quality of structures being submitted. My suggestion is that the wwPDB might provide say a one-page diagnostic summary when they allocate each PDB ID that could be used for this purpose. A good first pass at this would be the output that the MolProbity server http://molprobity.biochem.duke.edu/ sends when is given a PDB file. It starts with a few lines of summary in which bad things are marked red and the structure is assigned to a pecentile: a percentile of 6% means that 93% of the sturcture in the PDB with a similar resolution are 'better' and 5% are 'worse'. This summary can be understood with very little crystallographic background and a similar summary can of course be produced for NMR structures. The summary is followed by diagnostics for each residue, normally if the summary looks good it would not be necessary for the editor or referee to look at the rest. Although this server was intended to help us to improve our structures rather than detect manipulated or fabricated data, I asked it for a report on 2HR0 to see what it would do (probably many other people were trying to do exactly the
Re: [ccp4bb] The importance of USING our validation tools
I worry a bit about some of this discussion, in that I wouldn't like the free-R-factor police to get too powerful. I imagine that many of us have struggled with datasets which are sub-optimal for all sorts of reasons (all crystals are multiple/split/twinned; substantial disordered regions; low resolution, etc) - and it is not possible to get better data. I have certainly fought hard to get free-R below (the magic) 30%, when I know the structure is _essentially_ right, but the details are a little blurred in places, even when I have done the best I can. Anyway the important things are not the statistics, but the maps. Does this make the structure unpublishable? No, provided that we remember a basic tenet of science, that the conclusions drawn should be supported by the evidence available. With limited data, the conclusions may be more limited, but still often illuminate the biology, which is the reason for solving the structure in the first place. The evidence should be available to readers referees, so deposition at least structure factors should be compulsory (why isn't it already?). Unmerged data or images would be nice, but I doubt that many people would use them (great for developers though) Phil On 20 Aug 2007, at 08:24, George M. Sheldrick wrote: Dear Alex, Of course a simplified one page summary would not be the last word, but I think that it would be a big step in the right direction. For example a value of Rfree that is 'too good' because the reflection set for it has been chosen wrongly can be detected statistically (Tickle et al., Acta D56 (2000) 443-450). And it would be not be too difficult to distinguish between three possible causes of incomplete data: (a) there is a dead cone of data because it was a single scan of a low symmetry crystal, (b) a large number of 'overloads' were rejected (they would all have fairly low resolution and high Fc values) or (c) the missing reflections are fairly randomly distributed because they have been removed by hand to improve the R-values. I think that there is a very good case for making this Rinformation available to referees in an easily comprehensible form. George Prof. George M. Sheldrick FRS Dept. Structural Chemistry, University of Goettingen, Tammannstr. 4, D37077 Goettingen, Germany Tel. +49-551-39-3021 or -3068 Fax. +49-551-39-2582 On Sun, 19 Aug 2007, Alexander Aleshin wrote: I do not think the small molecule approach proposed by George Sheldrick is sufficient for validation of protein structures, as misrepresentation of experimental statistics/resolution is hard to detect with it, and these factors appear to play crucial role in defining the fate of many hot structures. The bad statistics hurts publication more than mistakes in a model, and improving the experiment is often too hard. I know my structure is right. Why should I spend another year growing better crystals only to make the statistics look right? - sounds as a strong argument for a desperate researcher. Making up an artificial data set overkills the task. There are easier and less amoral ways such as rejection of outliers and incorrect assignment of the Rfree test set. Ironically, an undereducated crystallographer may not recognize wrongdoing in such data treatment, which makes it even more likely to occur. Do I sound paranoid? And please do not suggest that I have shared personal experiences. Alex Aleshin On Sat, 18 Aug 2007, George M. Sheldrick wrote: There are good reasons for preserving frames, but most of all for the crystals that appeared to diffract but did not lead to a successful structure solution, publication, and PDB deposition. Maybe in the future there will be improved data processing software (for example to integrate non-merohedral twins) that will enable good structures to be obtained from such data. At the moment most such data is thrown away. However, forcing everyone to deposit their frames each time they deposit a structure with the PDB would be a thorough nuisance and major logistic hassle. It is also a complete illusion to believe that the reviewers for Nature etc. would process or even look at frames, even if they could download them with the manuscript. For small molecules, many journals require an 'ORTEP plot' to be submitted with the paper. As older readers who have experienced Dick Harlow's 'ORTEP of the year' competition at ACA Meetings will remember, even a viewer with little experience of small-molecule crystallography can see from the ORTEP plot within seconds if something is seriously wrong, and many non-crystallographic referees for e.g. the journal Inorganic Chemistry can even make a good guess as to what is wrong (e.g wrong element assigned to an atom). It would be nice if we could find something similar for macromolecules that the author would have to submit with the paper. One immediate bonus is that the authors
Re: [ccp4bb] The importance of USING our validation tools
PS. A completely unimportant correction to my comment on the MolProbity output for 2HR0: every residue is indeed an outlier in at least one test, but in three cases it is only the CB-deviation test, not the other three tests that I mentioned. George Prof. George M. Sheldrick FRS Dept. Structural Chemistry, University of Goettingen, Tammannstr. 4, D37077 Goettingen, Germany Tel. +49-551-39-3021 or -3068 Fax. +49-551-39-2582 On Sat, 18 Aug 2007, George M. Sheldrick wrote: There are good reasons for preserving frames, but most of all for the crystals that appeared to diffract but did not lead to a successful structure solution, publication, and PDB deposition. Maybe in the future there will be improved data processing software (for example to integrate non-merohedral twins) that will enable good structures to be obtained from such data. At the moment most such data is thrown away. However, forcing everyone to deposit their frames each time they deposit a structure with the PDB would be a thorough nuisance and major logistic hassle. It is also a complete illusion to believe that the reviewers for Nature etc. would process or even look at frames, even if they could download them with the manuscript. For small molecules, many journals require an 'ORTEP plot' to be submitted with the paper. As older readers who have experienced Dick Harlow's 'ORTEP of the year' competition at ACA Meetings will remember, even a viewer with little experience of small-molecule crystallography can see from the ORTEP plot within seconds if something is seriously wrong, and many non-crystallographic referees for e.g. the journal Inorganic Chemistry can even make a good guess as to what is wrong (e.g wrong element assigned to an atom). It would be nice if we could find something similar for macromolecules that the author would have to submit with the paper. One immediate bonus is that the authors would look at it carefully themselves before submitting, which could lead to an improvement of the quality of structures being submitted. My suggestion is that the wwPDB might provide say a one-page diagnostic summary when they allocate each PDB ID that could be used for this purpose. A good first pass at this would be the output that the MolProbity server http://molprobity.biochem.duke.edu/ sends when is given a PDB file. It starts with a few lines of summary in which bad things are marked red and the structure is assigned to a pecentile: a percentile of 6% means that 93% of the sturcture in the PDB with a similar resolution are 'better' and 5% are 'worse'. This summary can be understood with very little crystallographic background and a similar summary can of course be produced for NMR structures. The summary is followed by diagnostics for each residue, normally if the summary looks good it would not be necessary for the editor or referee to look at the rest. Although this server was intended to help us to improve our structures rather than detect manipulated or fabricated data, I asked it for a report on 2HR0 to see what it would do (probably many other people were trying to do exactly the same, the server was slower than usual). Although the structure got poor marks on most tests, MolProbity generously assigned it overall to the 6th pecentile, I suppose that this is about par for structures submitted to Nature (!). However there was one feature that was unlike anything I have ever seen before although I have fed the MolProbity server with some pretty ropey PDB files in the past: EVERY residue, including EVERY WATER molecule, made either at least one bad contact or was a Ramachandran outlier or was a rotamer outlier (or more than one of these). This surely would ring all the alarm bells! So I would suggest that the wwPDB could coordinate, with the help of the validation experts, software to produce a short summary report that would be automatically provided in the same email that allocates the PDB ID. This email could make the strong recommendation that the report file be submitted with the publication, and maybe in the fullness of time even the Editors of high profile journals would require this report for the referees (or even read it themselves!). To gain acceptance for such a procedure the report would have to be short and comprehensible to non-crystallographers; the MolProbity summary is an excellent first pass in this respect, but (partially with a view to detecting manipulation of the data) a couple of tests could be added based on the data statistics as reported in the PDB file or even better the reflection data if submitted). Most of the necessary software already exists, much of it produced by regular readers of this bb, it just needs to be adapted so that the results can be digested by referees and editors with little or no crystallographic experience. And most important, a PDB ID should
Re: [ccp4bb] The importance of USING our validation tools
Curiously enough, when I've recently submitted a coordinates file to RCSB with this Molprobity summary (as remark 42; it is added onto the analyzed file by the Molprobity program) it was deleted by the RCSB team. Boaz - Original Message - From: George M. Sheldrick [EMAIL PROTECTED] Date: Saturday, August 18, 2007 15:27 Subject: Re: [ccp4bb] The importance of USING our validation tools To: CCP4BB@JISCMAIL.AC.UK There are good reasons for preserving frames, but most of all for the crystals that appeared to diffract but did not lead to a successful structure solution, publication, and PDB deposition. Maybe in the future there will be improved data processing software (for example to integrate non-merohedral twins) that will enable good structures to be obtained from such data. At the moment most such data is thrown away. However, forcing everyone to deposit their frames each time they deposit a structure with the PDB would be a thorough nuisance and major logistic hassle. It is also a complete illusion to believe that the reviewers for Nature etc. would process or even look at frames, even if they could download them with the manuscript. For small molecules, many journals require an 'ORTEP plot' to be submitted with the paper. As older readers who have experienced Dick Harlow's 'ORTEP of the year' competition at ACA Meetings will remember, even a viewer with little experience of small-molecule crystallography can see from the ORTEP plot within seconds if something is seriously wrong, and many non-crystallographic referees for e.g. the journal Inorganic Chemistry can even make a good guess as to what is wrong (e.g wrong element assigned to an atom). It would be nice if we could find something similar for macromolecules that the author would have to submit with the paper. One immediate bonus is that the authors would look at it carefully themselves before submitting, which could lead to an improvement of the quality of structures being submitted. My suggestion is that the wwPDB might provide say a one-page diagnostic summary when they allocate each PDB ID that could be used for this purpose. A good first pass at this would be the output that the MolProbity server http://molprobity.biochem.duke.edu/ sends when is given a PDB file. It starts with a few lines of summary in which bad things are marked red and the structure is assigned to a pecentile: a percentile of 6% means that 93% of the sturcture in the PDB with a similar resolution are 'better' and 5% are 'worse'. This summary can be understood with very little crystallographic background and a similar summary can of course be produced for NMR structures. The summary is followed by diagnostics for each residue, normally if the summary looks good it would not be necessary for the editor or referee to look at the rest. Although this server was intended to help us to improve our structures rather than detect manipulated or fabricated data, I asked it for a report on 2HR0 to see what it would do (probably many other people were trying to do exactly the same, the server was slower than usual). Although the structure got poor marks on most tests, MolProbity generously assigned it overall to the 6th pecentile, I suppose that this is about par for structures submitted to Nature (!). However there was one feature that was unlike anything I have ever seen before although I have fed the MolProbity server with some pretty ropey PDB files in the past: EVERY residue, including EVERY WATER molecule, made either at least one bad contact or was a Ramachandran outlier or was a rotamer outlier (or more than one of these). This surely would ring all the alarm bells! So I would suggest that the wwPDB could coordinate, with the help of the validation experts, software to produce a short summary report that would be automatically provided in the same email that allocates the PDB ID. This email could make the strong recommendation that the report file be submitted with the publication, and maybe in the fullness of time even the Editors of high profile journals would require this report for the referees (or even read it themselves!). To gain acceptance for such a procedure the report would have to be short and comprehensible to non-crystallographers; the MolProbity summary is an excellent first pass in this respect, but (partially with a view to detecting manipulation of the data) a couple of tests could be added based on the data statistics as reported in the PDB file or even better the reflection data if submitted). Most of the necessary software already exists, much of it produced by regular readers of this bb, it just needs to be adapted so that the results can be digested by referees and editors with little
Re: [ccp4bb] The importance of USING our validation tools
I do not think the small molecule approach proposed by George Sheldrick is sufficient for validation of protein structures, as misrepresentation of experimental statistics/resolution is hard to detect with it, and these factors appear to play crucial role in defining the fate of many hot structures. The bad statistics hurts publication more than mistakes in a model, and improving the experiment is often too hard. I know my structure is right. Why should I spend another year growing better crystals only to make the statistics look right? - sounds as a strong argument for a desperate researcher. Making up an artificial data set overkills the task. There are easier and less amoral ways such as rejection of outliers and incorrect assignment of the Rfree test set. Ironically, an undereducated crystallographer may not recognize wrongdoing in such data treatment, which makes it even more likely to occur. Do I sound paranoid? And please do not suggest that I have shared personal experiences. Alex Aleshin On Sat, 18 Aug 2007, George M. Sheldrick wrote: There are good reasons for preserving frames, but most of all for the crystals that appeared to diffract but did not lead to a successful structure solution, publication, and PDB deposition. Maybe in the future there will be improved data processing software (for example to integrate non-merohedral twins) that will enable good structures to be obtained from such data. At the moment most such data is thrown away. However, forcing everyone to deposit their frames each time they deposit a structure with the PDB would be a thorough nuisance and major logistic hassle. It is also a complete illusion to believe that the reviewers for Nature etc. would process or even look at frames, even if they could download them with the manuscript. For small molecules, many journals require an 'ORTEP plot' to be submitted with the paper. As older readers who have experienced Dick Harlow's 'ORTEP of the year' competition at ACA Meetings will remember, even a viewer with little experience of small-molecule crystallography can see from the ORTEP plot within seconds if something is seriously wrong, and many non-crystallographic referees for e.g. the journal Inorganic Chemistry can even make a good guess as to what is wrong (e.g wrong element assigned to an atom). It would be nice if we could find something similar for macromolecules that the author would have to submit with the paper. One immediate bonus is that the authors would look at it carefully themselves before submitting, which could lead to an improvement of the quality of structures being submitted. My suggestion is that the wwPDB might provide say a one-page diagnostic summary when they allocate each PDB ID that could be used for this purpose. A good first pass at this would be the output that the MolProbity server http://molprobity.biochem.duke.edu/ sends when is given a PDB file. It starts with a few lines of summary in which bad things are marked red and the structure is assigned to a pecentile: a percentile of 6% means that 93% of the sturcture in the PDB with a similar resolution are 'better' and 5% are 'worse'. This summary can be understood with very little crystallographic background and a similar summary can of course be produced for NMR structures. The summary is followed by diagnostics for each residue, normally if the summary looks good it would not be necessary for the editor or referee to look at the rest. Although this server was intended to help us to improve our structures rather than detect manipulated or fabricated data, I asked it for a report on 2HR0 to see what it would do (probably many other people were trying to do exactly the same, the server was slower than usual). Although the structure got poor marks on most tests, MolProbity generously assigned it overall to the 6th pecentile, I suppose that this is about par for structures submitted to Nature (!). However there was one feature that was unlike anything I have ever seen before although I have fed the MolProbity server with some pretty ropey PDB files in the past: EVERY residue, including EVERY WATER molecule, made either at least one bad contact or was a Ramachandran outlier or was a rotamer outlier (or more than one of these). This surely would ring all the alarm bells! So I would suggest that the wwPDB could coordinate, with the help of the validation experts, software to produce a short summary report that would be automatically provided in the same email that allocates the PDB ID. This email could make the strong recommendation that the report file be submitted with the publication, and maybe in the fullness of time even the Editors of high profile journals would require this report for the referees (or even read it themselves!). To gain acceptance for such a procedure the report would have to be short and comprehensible
Re: [ccp4bb] The importance of USING our validation tools
Hi Mischa, I think you are right with ligand structures and it would be very difficult if not impossible to distinguish between real measured data and faked data. You just need to run a docking program dock the ligand calculate new structure factors add some noise and combine that with your real data of the unliganded structure. I'm not an expert, but how would one be able to detect say a molecule which is in the order of 300-600 Da within an average protein of perhaps 40 kDa if it's true data or faked + noise ? In Germany we have to keep data (data meaning everything, from clones, scans of gels, sizing profiles to xray diffraction images etc.) for 10 years. Not sure how this is in US. Juergen Mischa Machius wrote: I agree. However, I am personally not so much worried about entire protein structures being wrong or fabricated. I am much more worried about co-crystal structures. Capturing a binding partner, a reaction intermediate or a substrate in an active site is often as spectacular an achievement as determining a novel membrane protein structure. The threshold for over-interpreting densities for ligands is rather low, and wishful thinking can turn into model bias much more easily than for a protein structure alone; not to mention making honest mistakes. Just for plain and basic scientific purposes, it would be helpful every now and then to have access to the orginal images. As to the matter of fabricating ligand densities, I surmise, that is much easier than fabricating entire protein structures. The potential rewards (in terms of high-profile publications and obtaining grants) are just as high. There is enough incentive to apply lax scientific standards. If a simple means exists, beyond what is available today, that can help tremendously in identifying honest mistakes, and perhaps a rare fabrication, I think it should seriously be considered. Best - MM On Sat, 18 Aug 2007, George M. Sheldrick wrote: There are good reasons for preserving frames, but most of all for the crystals that appeared to diffract but did not lead to a successful structure solution, publication, and PDB deposition. Maybe in the future there will be improved data processing software (for example to integrate non-merohedral twins) that will enable good structures to be obtained from such data. At the moment most such data is thrown away. However, forcing everyone to deposit their frames each time they deposit a structure with the PDB would be a thorough nuisance and major logistic hassle. It is also a complete illusion to believe that the reviewers for Nature etc. would process or even look at frames, even if they could download them with the manuscript. For small molecules, many journals require an 'ORTEP plot' to be submitted with the paper. As older readers who have experienced Dick Harlow's 'ORTEP of the year' competition at ACA Meetings will remember, even a viewer with little experience of small-molecule crystallography can see from the ORTEP plot within seconds if something is seriously wrong, and many non-crystallographic referees for e.g. the journal Inorganic Chemistry can even make a good guess as to what is wrong (e.g wrong element assigned to an atom). It would be nice if we could find something similar for macromolecules that the author would have to submit with the paper. One immediate bonus is that the authors would look at it carefully themselves before submitting, which could lead to an improvement of the quality of structures being submitted. My suggestion is that the wwPDB might provide say a one-page diagnostic summary when they allocate each PDB ID that could be used for this purpose. A good first pass at this would be the output that the MolProbity server http://molprobity.biochem.duke.edu/ sends when is given a PDB file. It starts with a few lines of summary in which bad things are marked red and the structure is assigned to a pecentile: a percentile of 6% means that 93% of the sturcture in the PDB with a similar resolution are 'better' and 5% are 'worse'. This summary can be understood with very little crystallographic background and a similar summary can of course be produced for NMR structures. The summary is followed by diagnostics for each residue, normally if the summary looks good it would not be necessary for the editor or referee to look at the rest. Although this server was intended to help us to improve our structures rather than detect manipulated or fabricated data, I asked it for a report on 2HR0 to see what it would do (probably many other people were trying to do exactly the same, the server was slower than usual). Although the structure got poor marks on most tests, MolProbity generously assigned it overall to the 6th pecentile, I suppose that this is about par for structures submitted to Nature (!). However there was one feature that was unlike anything I have ever seen before although I have fed
Re: [ccp4bb] The importance of USING our validation tools
To complete your analogy to the ORTEP of the year, the summary page could be accompanied by a backbone ribbon drawing of the macromolecule, with a red sphere at each residue that has an error. You could get fancy and scale the sphere according to the severity of the error. -Tom -Original Message- From: CCP4 bulletin board on behalf of George M. Sheldrick Sent: Sat 8/18/2007 6:26 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] The importance of USING our validation tools There are good reasons for preserving frames, but most of all for the crystals that appeared to diffract but did not lead to a successful structure solution, publication, and PDB deposition. Maybe in the future there will be improved data processing software (for example to integrate non-merohedral twins) that will enable good structures to be obtained from such data. At the moment most such data is thrown away. However, forcing everyone to deposit their frames each time they deposit a structure with the PDB would be a thorough nuisance and major logistic hassle. It is also a complete illusion to believe that the reviewers for Nature etc. would process or even look at frames, even if they could download them with the manuscript. For small molecules, many journals require an 'ORTEP plot' to be submitted with the paper. As older readers who have experienced Dick Harlow's 'ORTEP of the year' competition at ACA Meetings will remember, even a viewer with little experience of small-molecule crystallography can see from the ORTEP plot within seconds if something is seriously wrong, and many non-crystallographic referees for e.g. the journal Inorganic Chemistry can even make a good guess as to what is wrong (e.g wrong element assigned to an atom). It would be nice if we could find something similar for macromolecules that the author would have to submit with the paper. One immediate bonus is that the authors would look at it carefully themselves before submitting, which could lead to an improvement of the quality of structures being submitted. My suggestion is that the wwPDB might provide say a one-page diagnostic summary when they allocate each PDB ID that could be used for this purpose. A good first pass at this would be the output that the MolProbity server http://molprobity.biochem.duke.edu/ sends when is given a PDB file. It starts with a few lines of summary in which bad things are marked red and the structure is assigned to a pecentile: a percentile of 6% means that 93% of the sturcture in the PDB with a similar resolution are 'better' and 5% are 'worse'. This summary can be understood with very little crystallographic background and a similar summary can of course be produced for NMR structures. The summary is followed by diagnostics for each residue, normally if the summary looks good it would not be necessary for the editor or referee to look at the rest. Although this server was intended to help us to improve our structures rather than detect manipulated or fabricated data, I asked it for a report on 2HR0 to see what it would do (probably many other people were trying to do exactly the same, the server was slower than usual). Although the structure got poor marks on most tests, MolProbity generously assigned it overall to the 6th pecentile, I suppose that this is about par for structures submitted to Nature (!). However there was one feature that was unlike anything I have ever seen before although I have fed the MolProbity server with some pretty ropey PDB files in the past: EVERY residue, including EVERY WATER molecule, made either at least one bad contact or was a Ramachandran outlier or was a rotamer outlier (or more than one of these). This surely would ring all the alarm bells! So I would suggest that the wwPDB could coordinate, with the help of the validation experts, software to produce a short summary report that would be automatically provided in the same email that allocates the PDB ID. This email could make the strong recommendation that the report file be submitted with the publication, and maybe in the fullness of time even the Editors of high profile journals would require this report for the referees (or even read it themselves!). To gain acceptance for such a procedure the report would have to be short and comprehensible to non-crystallographers; the MolProbity summary is an excellent first pass in this respect, but (partially with a view to detecting manipulation of the data) a couple of tests could be added based on the data statistics as reported in the PDB file or even better the reflection data if submitted). Most of the necessary software already exists, much of it produced by regular readers of this bb, it just needs to be adapted so that the results can be digested by referees and editors with little or no crystallographic experience. And most important, a PDB ID should always be released only in combination
Re: [ccp4bb] The importance of USING our validation tools
The literature already contains quite a few papers discussing ligand-protein interactions derived from low-resolution data, noisy data, etc. It's relatively easy to take a low-quality map; dock the molecule willy-nilly into the poorly defined 'blobule' of density, and derive spectacular conclusions. However, in order for such conclusions to be credible one needs to support them with orthogonal data such as biological assay results, mutagenesis, etc. This is not limited to crystallography as such, and it's the referee's job to be thorough in such cases. To the author's credit, in *most* cases the questionable crystallographic data is supported by biological data of high quality. So, even with the images, etc. - it's still quite possible to be honestly mislead. Which is why we value biological data. Consequently, if one's conclusions are wrong - this will inevitably show up later in the results of other experiments (such as SAR inconsistencies for example). Science tends to be self-correcting - our errors (whether honest or malicious) are not going to withstand the test of time. Assuming that the proportion of deliberate faking in scientific literature is quite small (and we really have no reason to think otherwise!), I really see no reason to worry too much about the ligand-protein interactions. Any referee evaluating ligand-based structural papers can ask to see an omit map (or a difference density map before any ligand was built) and a decent biological data set supporting the structural conclusions. In the case of *sophisticated deliberate faking*, there is not much a reviewer can do except trying to actually reproduce the claimed results. On the other hand, the 'wholesale' errors can be harder to catch, since the dataset and the resulting structure are typically the *only* evidence available. If both are suspect, the reviewer needs to rely on something else to make a judgement, which is where a one-page summary would come handy. Artem -Original Message- From: CCP4 bulletin board [mailto:[EMAIL PROTECTED] On Behalf Of Juergen Bosch Sent: Saturday, August 18, 2007 12:20 PM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] The importance of USING our validation tools Hi Mischa, I think you are right with ligand structures and it would be very difficult if not impossible to distinguish between real measured data and faked data. You just need to run a docking program dock the ligand calculate new structure factors add some noise and combine that with your real data of the unliganded structure. I'm not an expert, but how would one be able to detect say a molecule which is in the order of 300-600 Da within an average protein of perhaps 40 kDa if it's true data or faked + noise ? In Germany we have to keep data (data meaning everything, from clones, scans of gels, sizing profiles to xray diffraction images etc.) for 10 years. Not sure how this is in US. Juergen Mischa Machius wrote: I agree. However, I am personally not so much worried about entire protein structures being wrong or fabricated. I am much more worried about co-crystal structures. Capturing a binding partner, a reaction intermediate or a substrate in an active site is often as spectacular an achievement as determining a novel membrane protein structure. The threshold for over-interpreting densities for ligands is rather low, and wishful thinking can turn into model bias much more easily than for a protein structure alone; not to mention making honest mistakes. Just for plain and basic scientific purposes, it would be helpful every now and then to have access to the orginal images. As to the matter of fabricating ligand densities, I surmise, that is much easier than fabricating entire protein structures. The potential rewards (in terms of high-profile publications and obtaining grants) are just as high. There is enough incentive to apply lax scientific standards. If a simple means exists, beyond what is available today, that can help tremendously in identifying honest mistakes, and perhaps a rare fabrication, I think it should seriously be considered. Best - MM On Sat, 18 Aug 2007, George M. Sheldrick wrote: There are good reasons for preserving frames, but most of all for the crystals that appeared to diffract but did not lead to a successful structure solution, publication, and PDB deposition. Maybe in the future there will be improved data processing software (for example to integrate non-merohedral twins) that will enable good structures to be obtained from such data. At the moment most such data is thrown away. However, forcing everyone to deposit their frames each time they deposit a structure with the PDB would be a thorough nuisance and major logistic hassle. It is also a complete illusion to believe that the reviewers for Nature etc. would process or even look at frames, even if they could download them with the manuscript. For small molecules, many journals require
Re: [ccp4bb] The importance of USING our validation tools
Dear all, I agree with MM about the ligand and complex structures. Even in the most honest circumstances, it is easy to get carried away with hopes and excitement. My personal embarassing experience was some years ago. It involved a protein that I had crystallized in a different space group in the presence of inhibitor- 2.5A data. The MR model had some gaps a moderate distance from the binding pocket. Lo and behold, some new, very rough density appeared very very close to a binding site- close enough to get my hopes up. I communicated my elation to the PI, handed over pictures of the rough blobs of density, and started trying to build the ligand in. I should have moderated my emotions in light of the early state of the refinement. After finding a somewhat plausible fit in the density, I ran several rounds of the Wonderful Amazing Revealer of Proteindensity program. By the end I was almost in tears. The difference density began to take on a helical shape, and then the connections started growing, leading all the way up to one of the gaps. Side chains too, so I had no trouble with the register. The R-factors didn't change too much, but the geometries and maps in the area started looking really nice. Or should I say, proper. Very nice silver platter (that my head was on when it was handed it back to me). Lisa
Re: [ccp4bb] The importance of USING our validation tools
Dominika is entirely correct, the F and (especially) sigma(F) values are clearly inconsistent with my naive suggestion that columns could have been swapped accidentally in an mtz file. George Prof. George M. Sheldrick FRS Dept. Structural Chemistry, University of Goettingen, Tammannstr. 4, D37077 Goettingen, Germany Tel. +49-551-39-3021 or -3068 Fax. +49-551-39-2582 On Thu, 16 Aug 2007, Dominika Borek wrote: There are several issues under current discussion. We outline a few of these below, in order of importance. The structure 2hr0 is unambiguously fake. Valid arguments have already been published in a Brief Communication by Janssen et. al (Nature, 448:E1-E2, 9 August 2007). However, the published response from the authors of the questioned deposit may sound to unfamiliar person as an issue of a scientific controversy. There are many additional independent signs of intentional data fabrication in this case, above and beyond those already mentioned. One diagnostic is related to the fact that fabricating data will not show proper features of proteins with respect to disorder. The reported case has a very high ratio of “Fobs”/atom parameters, thus the phase uncertainty is small. In real structures fully solvent exposed chains without stabilizing interactions display intrinsically high disorder, yet in this structure these residues (e.g., Arg932B, Met1325B, Glu1138B, Arg459A, etc.) are impossibly well ordered. The second set of diagnostics is the observation of perfect electron density around impossible geometries. For example, the electron density is perfect (visible even at the 4 sigma level in a 2Fo-Fc map) with no significant negative peaks in an Fo-Fc map around the guanidinium group of Arg1112B, which is in an outrageously close contact to carbon atoms of Lys1117B. This observation appears in many other places in the map as well. The issue is not the presence of bad contacts, but the lack of disorder (high B-factors) or negative peaks in an Fo-Fc map in this region that could explain why the bad contacts remain in the model. The third set of diagnostics are statistics that do not occur in real structures. The ones mentioned previously are already very convincing (moments, B-factor plots, bulk solvent issues, etc.). We can add more evidence from a round of Refmac refinement of the deposited model versus the deposited structure factors. The anisotropic scaling factor obtained, which for a structure in a low symmetry space group such as C2 that has an inherent lack of constraint in packing symmetry, is unreasonable (particularly in view of the problems with lattice contacts already mentioned). The values from a Refmac refinement for a typical structure in space group C2 are: B11 = 0.72 B22 = 1.15 B33 = -2.12 B12 = 0.00 B13 = -1.40 B23 = 0.00 (B12 and B23 are zero due to C2 space group symmetry). For structure 2hr0: B11 = -0.02 B22 = 0.00 B33 = 0.02 B12 = 0.00 B13 = 0.01 B23 = 0.00. Statistical reasoning can lead to P-values in the range of 10exp(-6) for such values to be produced by chance in a real structure, but they are highly likely in a fabricated case. The fourth set of diagnostics are significant inconsistencies in published methods, e.g. the authors claim that they collected data from four crystals, yet their data merging statistics show an R-merge = 0.11 in the last resolution shell. It is simply impossible to get such values particularly when I/sigma(I) for the last resolution shell was stated as 1.32. Moreover, the overall I/sigma(I) for all data is 5.36 and the overall R-merge is 0.07 – values highly inconsistent with the reported data resolution, quality of map and high data completeness (97.3%). Overall this is just a short list of problems, the indicators of data fabrication/falsification are plentiful and if needed can be easily provided to interested parties. We fully support Randy Read's excellent comments with our view of retraction and public discussion of this problem: “Originally I expected that the publication of our Brief Communication in Nature would stimulate a lot of discussion on the bulletin board, but clearly it hasn't. One reason is probably that we couldn't be as forthright as we wished to be. For its own good reasons, Nature did not allow us to use the word fabricated. Nor were we allowed to discuss other structures from the same group, if they weren't published in Nature.” One needs to address this policy with publishers in cases of intentional fraud that can be proven simply by an analysis of the published results. At this point the article needs to be retracted by Nature after Nature's internal investigation with input from crystallographic community rather then after obtaining results of any potential administrative investigation of fraud. “Another reason is an understandable reluctance to make allegations in public, and the CCP4 bulletin board probably isn't the
Re: [ccp4bb] The importance of USING our validation tools
While the topic of fabrication is still hot, I thought I too could add a few thoughts. Our Mathematician friends always make fun of us (Biologists/ Biochemists/ crystallographers!) that our papers are accepted within 4-8 weeks of submission. This is not to talk of Science/ Nature/ Cell, where even more rapid reviews are the norms. In the Mathematics world it is customary to have one year review of manuscripts, and prior announcements of the work on respective web sites. The one year review, and the prior announcements on web sites, allows others to review the results independently. That perhaps brings in the required rigor in the results. Consequently, there are not as many retractions in Mathematics as what we see in our area. It is perhaps not possible in our (crystallographic) World to check every strcture independently by others. Yet, longer review along with access to raw data might allow reviewers to check the finer details of the structures. I would strongly suggest that raw data be made available to reviewers, and that reviewers should check the structures before the papers are accepted. Any error in the final published structures, blame should also lie partially on the reviewer. The back-to-back controversies are bound to hurt crystallogrophic community as a whole, and IUCr should ponder over to better checks for the future. Shekhar Mande Hyderabad, INDIA -REPLY TO- Date:Thu Aug 16 21:22:20 GMT+08:00 2007 FROM: Randy J. Read [EMAIL PROTECTED] To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] The importance of USING our validation tools On Aug 16 2007, Eleanor Dodson wrote: The weighting in REFMAC is a function of SigmA ( plotted in log file). For this example it will be nearly 1 for all resolutions ranges so the weights are pretty constant. There is also a contribution from the experimental sigma, which in this case seems to be proportional to |F| Originally I expected that the publication of our Brief Communication in Nature would stimulate a lot of discussion on the bulletin board, but clearly it hasn't. One reason is probably that we couldn't be as forthright as we wished to be. For its own good reasons, Nature did not allow us to use the word fabricated. Nor were we allowed to discuss other structures from the same group, if they weren't published in Nature. Another reason is an understandable reluctance to make allegations in public, and the CCP4 bulletin board probably isn't the best place to do that. But I think the case raises essential topics for the community to discuss, and this is a good forum for those discussions. We need to consider how to ensure the integrity of the structural databases and the associated publications. So here are some questions to start a discussion, with some suggestions of partial answers. 1. How many structures in the PDB are fabricated? I don't know, but I think (or at least hope) that the number is very small. 2. How easy is it to fabricate a structure? It's very easy, if no-one will be examining it with a suspicious mind, but it's extremely difficult to do well. No matter how well a structure is fabricated, it will violate something that is known now or learned later about the properties of real macromolecules and their diffraction data. If you're clever enough to do this really well, then you should be clever enough to determine the real structure of an interesting protein. 3. How can we tell whether structures in the PDB are fabricated, or just poorly refined? The current standard validation tools are aimed at detecting errors in structure determination or the effects of poor refinement practice. None of them are aimed at detecting specific signs of fabrication because we assume (almost always correctly) that others are acting in good faith. The more information that is available, the easier it will be to detect fabrication (because it is harder to make up more information convincingly). For instance, if the diffraction data are deposited, we can check for consistency with the known properties of real macromolecular crystals, e.g. that they contain disordered solvent and not vacuum. As Tassos Perrakis has discovered, there are characteristic ways in which the standard deviations depend on the intensities and the resolution. If unmerged data are deposited, there will probably be evidence of radiation damage, weak effects from intrinsic anomalous scatterers, etc. Raw images are probably even harder to simulate convincingly. If a structure is fabricated by making up a new crystal form, perhaps a complex of previously-known components, then the crystal packing interactions should look like the interactions seen in real crystals. If it's fabricated by homology modelling, then the internal packing is likely to be suboptimal. I'm told by David Baker (who knows a thing or two about this) that it is extremely difficult to make a homology model that both obeys what
Re: [ccp4bb] The importance of USING our validation tools
Storing all the images *is* expensive but it can be done - the JCSG do this and make available a good chunk of their raw diffraction data. The cost is, however, in preparing this to make the data useful for the person who downloads it. If we are going to store and publish the raw experimental measurements (e.g. the images) which I think would be spectacular, we will also need to define a minimum amount of metadata which should be supplied with this to allow a reasonable chance of reproduction of the results. This is clearly not trivial, but there is probably enough information in the harvest and log files from e.g. CCP4, HKL2000, Phenix to allow this. The real problem will be in getting people to dig out that tape / dvd with the images on, prepare the required metadata and deposit this information somewhere. Actually storing it is a smaller challenge, though this is a long way from being trivial. On an aside - firewire disks are indeed a very cheap way of storing the data. There is a good reason why they are much cheaper than the equivalent RAID array. They fail. Ever lost 500GB of data in one go? Ouch. ;o) Just MHO. Cheers, Graeme -Original Message- From: CCP4 bulletin board [mailto:[EMAIL PROTECTED] On Behalf Of Phil Evans Sent: 16 August 2007 15:13 To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] The importance of USING our validation tools What do you count as raw data? Rawest are the images - everything beyond that is modellling - but archiving images is _expensive_! Unmerged intensities are probably more manageable Phil On 16 Aug 2007, at 15:05, Ashley Buckle wrote: Dear Randy These are very valid points, and I'm so glad you've taken the important step of initiating this. For now I'd like to respond to one of them, as it concerns something I and colleagues in Australia are doing: The more information that is available, the easier it will be to detect fabrication (because it is harder to make up more information convincingly). For instance, if the diffraction data are deposited, we can check for consistency with the known properties of real macromolecular crystals, e.g. that they contain disordered solvent and not vacuum. As Tassos Perrakis has discovered, there are characteristic ways in which the standard deviations depend on the intensities and the resolution. If unmerged data are deposited, there will probably be evidence of radiation damage, weak effects from intrinsic anomalous scatterers, etc. Raw images are probably even harder to simulate convincingly. After the recent Science retractions we realised that its about time raw data was made available. So, we have set about creating the necessary IT and software to do this for our diffraction data, and are encouraging Australian colleagues to do the same. We are about a week away from launching a web-accessible repository for our recently published (eg deposited in PDB) data, and this should coincide with an upcoming publication describing a new structure from our labs. The aim is that publication occurs simultaneously with release in PDB as well as raw diffraction data on our website. We hope to house as much of our data as possible, as well as data from other Australian labs, but obviously the potential dataset will be huge, so we are trying to develop, and make available freely to the community, software tools that allow others to easily setup their own repositories. After brief discussion with PDB the plan is that PDB include links from coordinates/SF's to the raw data using a simple handle that can be incorporated into a URL. We would hope that we can convince the journals that raw data must be made available at the time of publication, in the same way as coordinates and structure factors. Of course, we realise that there will be many hurdles along the way but we are convinced that simply making the raw data available ASAP is a 'good thing'. We are happy to share more details of our IT plans with the CCP4BB, such that they can be improved, and look forward to hearing feedback cheers
Re: [ccp4bb] The importance of USING our validation tools
Dear colleagues, the recent discussion on the necessity and feasibility of storing raw data for all our structures raises a second point, I think. For the current discussion it is only a matter of storage place that has to be assigned somehow to make fobs, unmerged data, or raw images available to everybody who want's to download, but there are other science fields out there as well. Do we want to collect also gels, plots, plasmids, bacterial strains, mice, dollies, at some central place? Or should rather the scientific ethics bind all of us to practice good science and to be an objective reviewer when asked? The usefulness for software developers and future experiments with our data is a completely different issue of course. Just wanting to raise this point. Manuel Than -- ** Dr. Manuel E. Than Protein Crystallography Group Leibniz Institute for Age Research - Fritz Lipmann Institute (FLI) Beutenbergstraße 11 D-07745 Jena Germany Tel.: ++49 3641 65 6170 Fax.: ++49 3641 65 6335 e-mail: [EMAIL PROTECTED] http://www.fli-leibniz.de/groups/than.php
Re: [ccp4bb] The importance of USING our validation tools
Hi Matrin, On Fri, Aug 17, 2007 at 11:09:28AM +0200, Martin Walsh wrote: For 2006 at BM14 we and our users generated 266997 images/frames from our MAR225 CCD (18mb files) or in other words ~4.8Tbyte (if you have patience to do so then bzip2 will reduce these raw images to between 5.5 and 7Mb -depending on how many diffraction spots /image) Looking at http://www.esrf.eu/exp_facilities/BM14/publications/publications-new.html it seems that 56 papers have been published in 2006 using BM14 data (directly). Lets say (for arguments sake) that each paper deposited 2 structures (and structure factors) into the PDB: this would mean about 2400 images/frames per structure (and about 40 Gb of data per structurte). There must be a large amount of junk in there not directly related to the deposited structure factors (images from screening or test crystals, basically useless crystals etc). I don't think anyone would want all images from every beamline deposited in a public database. I think if only the images related to the deposited structure factors are deposited, the data from BM14 would be at least a factor of 10 smaller (4Gb or 240 images per dataset). So this would mean 480 Gb of BM14 data for 2006 - or 54Tb for all 115 PX beamlines ... if they all would be as productive as BM14! Anyway, compared to astronomy and other fields it is fairly small (as Peter Keller mentioned in his post). If we think it is necessary (and I think we should) it will need to be done. It doesn't need to be perfect - but compared to e.g. the currently deposited structure factors, at least diffraction images have headers with useful information in them (even if the beam-centre, distance or wavelength etc are often wrong: but there are ways of getting at the correct values ... even if it is by trial and error). Cheers Clemens -- *** * Clemens Vonrhein, Ph.D. vonrhein AT GlobalPhasing DOT com * * Global Phasing Ltd. * Sheraton House, Castle Park * Cambridge CB3 0AX, UK *-- * BUSTER Development Group (http://www.globalphasing.com) ***
Re: [ccp4bb] The importance of USING our validation tools
On Aug 17, 2007, at 8:36, George M. Sheldrick wrote: Dominika is entirely correct, the F and (especially) sigma(F) values are clearly inconsistent with my naive suggestion that columns could have been swapped accidentally in an mtz file. Since the sigma(f) issue has been raised, let me elaborate on that. Faking observations is difficult. Faking the experimental uncertainties is even more difficult. If one would fake a dataset, there would almost always be an implicit imprint of the procedure. I am told for example that some journals now use a company that claims they can see gels and pictures that were 'photo-shopped'. That is - i am told by friends - the reason that some journals ask for 400 dpi pictures, while the Nature printers can do about 120 dpi in real life. Thus, I analyzed the distribution of the experimental sigmas in three structures: 1E3M and two structures of mine at the same resolution (1CTN, 1E3M) The results are in: http://xtal.nki.nl/nature-debate/ Thats also a response to Tom Hurley's email ... I think we are obliged to look at this case and show to all crystallographers that read the board what the evidence are. This has no lawful consequences. I think the debate is healthy and I have not seen anyone asking to lynch or crucify anybody. As long as the discussion is about evidence and not passing ethical or other judgement, I think its good to go on. Also its a good lesson for everybody to learn: === *** Keep your images, you gels, your logbooks. Its your obligation. Make sure all your colleagues do so. === (especially if you are the PI you carry the primary responsibility for all primary data that support your publication to be available on request) If you do not keep to that principle, some mean mob might lynch you, even if you are right. So, be correct in your approaches. I am making the web site public with my analysis for people to see one more evidence that there are doubts and Murthy et al should provide primary data, as many others have said. Statements of certain innocence or certain guilt, should indeed not be public. So, i will wait now for the data - as simple as that. Tassos
Re: [ccp4bb] The importance of USING our validation tools
While all of the comments on this situation have been entertaining, I've been most impressed by comments from Bill Scott, Gerard Bricogne and Kim Hendricks. I think due process is called for in considering problem structures that may or may not be fabricated. Public discussion of technical or craftsmanship issues is fine, but questions of intent, etc are best discussed in private or in more formal settings. We owe that to all involved. Gerard's comments concerning publishing in journals/magazines like Nature and Science are correct. The pressure to publish there is not consistent with careful, well-documented science. For many years, we've been teaching our graduate students about some of the problems with short papers in those types of journals. The space limitations and the need for relevance force omission of important details, so it's very hard to judge the merit of those papers. But, don't assume that other real journals do much better with this. There's a lot of non-reproducible science in the journals. Much of it comes from not recognizing or reporting important experimental or computational details, but some of it is probably simply false. Kim's comments about the technical aspects of archiving data make a lot of sense to me. The costs of making safe and secure archives are not insignificant. And we need to ask if the added value of such archives is worth the added costs. I'm not yet convinced of this. The comments about Richard Reid, shoes, and air-travel are absolutely true. We should be very careful about requiring yet more information for submitted manuscripts. Publishing a paper is becoming more and more like trying to get through a crowded air-terminal. Every time you turn around, there's another requirement for some additional detail about your work. In the vast majority of cases, those details won't matter at all. In a few cases, a very careful and conscious referee might figure out something significant based on that little detail. But is the inconvenience for most us worth that little benefit? Clearly, enough information was available to Read, et al. for making the case that the original structure has problems. What evidence is there that additional data, like raw data images, would have made any difference to the original referees and reviewers? Refereeing is a human endeavor of great importance, but it is not going to be error-free. And nothing can make it error-free. You simply need to trust that people will be honest and do the best job possible in reviewing things. And that errors that make it through the process and are deemed important enough will be corrected by the next layer of reviewers. I believe this current episode, just like those in the past, are terrific indicators that our science is strong and functioning well. If other fields aren't reporting and correcting problems like these, maybe it's because they simply haven't found them yet. That statement might be a sign of my crystallographic arrogance, but it might also be true. Ron Stenkamp
Re: [ccp4bb] The importance of USING our validation tools
It seems that a public discussion with points and counterpoints presented openly and fairly is in complete adherence to the ideals of due process. Since this discussion is not deciding the criminal fate of any individual, it does not seem necessary to defer it to any political government. Also, were any criminal charges ever brought forth, one might think an innocent defendent would appreciate the benefit of the world's experts pondering the facts in an open forum. James William Scott wrote: But I agree, it is important to keep in mind that the proper venue for determining guilt or innocence in the case of fraud is the court system. Until fairly recently, the idea of presumed innocence and the right to cross-examine accusers and witnesses has been considered fundamental to civil society. The case certainly sounds compelling, but this is all the more reason to adhere to these ideals. -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/
Re: [ccp4bb] The importance of USING our validation tools
I believe that is so. In this case the Rfactor against the deposited data is low. The question to be addressed is whether the deposited data is of acceptable quality. There are some poor distances but not many - the asymmetric unit is very empty. The Ramachandran plot is not good, and an author would be queried about that. However you can choose to ignore their warnings. Eleanor Gina Clayton wrote: I thought that when a structure is deposited the databank does run its own refinement validation and geometry checks and gives you back what it finds i.e distance problems etc and rfactor? Quoting Eleanor Dodson [EMAIL PROTECTED]: The weighting in REFMAC is a function of SigmA ( plotted in log file). For this example it will be nearly 1 for all resolutions ranges so the weights are pretty constant. There is also a contribution from the experimental sigma, which in this case seems to be proportional to |F| Yesterday I attached the wrong TRUNCATE log file - here is the correct one, and if you look at the plot Amplitude Analysis against resolution it also includes a plot of F SigF Eleanor Dominika Borek wrote: There are many more interesting things about this structure - obvious fake - refined against fabricated data. After running refmac I have noticed discrepancies between R and weighted R-factors. However, I do not know how the weights are calculated and applied - it could maybe help to find out how these data were created. Could you help? M(4SSQ/LL) NR_used %_obs M(Fo_used) M(Fc_used) Rf_used WR_used NR_free M(Fo_free) M(Fc_free) Rf_free WR_free $$ $$ 0.0052205 98.77 3800.5 3687.2 0.12 0.30 121 4133.9 4042.7 0.12 0.28 0.0153952 99.90 1932.9 1858.7 0.20 0.60 197 2010.5 1880.5 0.21 0.40 0.0255026 99.81 1577.9 1512.3 0.23 0.62 283 1565.0 1484.6 0.26 0.54 0.0345988 99.76 1598.0 1541.5 0.23 0.61 307 1625.7 1555.6 0.23 0.42 0.0446751 99.79 1521.2 1481.6 0.18 0.41 338 1550.3 1523.8 0.18 0.61 0.0547469 99.81 1314.5 1291.2 0.14 0.29 391 1348.3 1337.7 0.15 0.27 0.0648078 99.87 .5 1089.1 0.16 0.36 465 1096.1 1077.9 0.18 0.42 0.0738642 99.84 976.7 959.2 0.15 0.32 488 995.3 988.4 0.16 0.50 0.0839255 99.88 866.4 848.0 0.16 0.36 490 856.8 846.0 0.17 0.38 0.0939778 99.88 747.6 731.4 0.16 0.36 515 772.8 747.3 0.18 0.38 0.103 10225 99.86 662.6 649.1 0.17 0.38 547 658.9 643.6 0.20 0.36 0.113 10768 99.83 597.2 584.7 0.18 0.42 538 593.4 590.0 0.20 0.49 0.122 11121 99.86 535.5 521.9 0.19 0.48 607 556.2 542.0 0.20 0.47 0.132 11692 99.85 489.3 479.2 0.19 0.46 607 476.4 467.3 0.23 0.42 0.142 11999 99.83 453.9 443.1 0.19 0.48 621 455.3 440.6 0.22 0.55 0.152 12463 99.79 419.2 407.3 0.19 0.44 655 435.3 424.3 0.22 0.53 0.162 12885 99.78 384.0 373.9 0.20 0.53 632 384.1 376.1 0.22 0.43 0.171 12698 95.96 357.2 348.5 0.21 0.57 686 353.9 338.6 0.24 0.51 0.181 11926 87.78 332.0 323.3 0.21 0.66 590 333.4 322.6 0.24 0.57 0.191 11204 80.39 309.9 299.6 0.22 0.59 600 302.1 296.3 0.26 0.77 $$ Eleanor Dodson wrote: There is a correspondence in last weeks Nature commenting on the disparities between three C3B structures. These are: 2icf solved at 4.0A resolution, 2i07 at 4.1A resolution, and 2hr0 at 2.26A resolution. The A chains of all 3 structures agree closely, with each other and other deposited structures. The B chains of 2icf and 2i07 are in reasonable agreement, but there are enormous differences to the B chain of 2hr0. This structure is surprisingly out of step, and by many criteria likely to be wrong. There has been many articles written on validation and it seems worth reminding crystallographers of some of tests which make 2hr0 suspect. 1) The cell content analysis suggests there is 80% solvent in the asymmetric unit. Such crystals have been observed but they rarely diffract to 2.26A. 2) Data Analysis: The reflection data has been deposited so it can be analysed. The plots provided by TRUNCATE showing intensity statistic features are not compatible with such a high solvent ratio. They are too perfect; the moments are perfectly linear, unlikely with such large volumes of the crystal containing solvent, and there is absolutely no evidence of anisotropy, again unlikely with high solvent content. 3) Structure analysis a) The Ramachandran plot is very poor ( 84% allowed) with many residues in disallowed regions. b) The distribution of residue B values is quite unrealistic. There is a very low spread, which is most unusual for a structure with long stretches of exposed chain. The baverage log file is attached. c) There does not seem to be enough contacts to maintain the crystalline
Re: [ccp4bb] The importance of USING our validation tools
By raw data I mean images. We think this is only manageable using a distributed data grid model (eg Universities/institutions setup their own repositories using open standards, and PDB aggregate the links to them. URL persistence will be a hurdle I admit). You are right in that a single-repository solution would be impractical. We would hope that the PDB could store the unmerged intensities. cheers ashley On 17/08/2007, at 12:13 AM, Phil Evans wrote: What do you count as raw data? Rawest are the images - everything beyond that is modellling - but archiving images is _expensive_! Unmerged intensities are probably more manageable Phil On 16 Aug 2007, at 15:05, Ashley Buckle wrote: Dear Randy These are very valid points, and I'm so glad you've taken the important step of initiating this. For now I'd like to respond to one of them, as it concerns something I and colleagues in Australia are doing: The more information that is available, the easier it will be to detect fabrication (because it is harder to make up more information convincingly). For instance, if the diffraction data are deposited, we can check for consistency with the known properties of real macromolecular crystals, e.g. that they contain disordered solvent and not vacuum. As Tassos Perrakis has discovered, there are characteristic ways in which the standard deviations depend on the intensities and the resolution. If unmerged data are deposited, there will probably be evidence of radiation damage, weak effects from intrinsic anomalous scatterers, etc. Raw images are probably even harder to simulate convincingly. After the recent Science retractions we realised that its about time raw data was made available. So, we have set about creating the necessary IT and software to do this for our diffraction data, and are encouraging Australian colleagues to do the same. We are about a week away from launching a web-accessible repository for our recently published (eg deposited in PDB) data, and this should coincide with an upcoming publication describing a new structure from our labs. The aim is that publication occurs simultaneously with release in PDB as well as raw diffraction data on our website. We hope to house as much of our data as possible, as well as data from other Australian labs, but obviously the potential dataset will be huge, so we are trying to develop, and make available freely to the community, software tools that allow others to easily setup their own repositories. After brief discussion with PDB the plan is that PDB include links from coordinates/SF's to the raw data using a simple handle that can be incorporated into a URL. We would hope that we can convince the journals that raw data must be made available at the time of publication, in the same way as coordinates and structure factors. Of course, we realise that there will be many hurdles along the way but we are convinced that simply making the raw data available ASAP is a 'good thing'. We are happy to share more details of our IT plans with the CCP4BB, such that they can be improved, and look forward to hearing feedback cheers *NOTE* My new tel. no: (03) 9902 0269 Ashley Buckle Ph.D NHMRC Senior Research Fellow The Department of Biochemistry and Molecular Biology School of Biomedical Sciences, Faculty of Medicine Victorian Bioinformatics Consortium (VBC) Monash University, Clayton, Vic 3800 Australia http://www.med.monash.edu.au/biochem/staff/abuckle.html iChat/AIM: blindcaptaincat skype: ashley.buckle Tel: (613) 9902 0269 (office) Tel: (613) 9905 1653 (lab) Fax : (613) 9905 4699
Re: [ccp4bb] The importance of USING our validation tools
I don't think archiving images would be that expensive. For one, I have found that most formats can be compressed quite substantially using simple, standard procedures like bzip2. If optimized, raw images won't take up that much space. Also, initially, only those images that have been used to obtain phases and to refine finally deposited structures could be archived. If the average structure takes up 20GB of space, 5,000 structures would be 1TB, which fits on a single hard drive for less than $400. If the community thinks this is a worthwhile endeavor, money should be available from granting agencies to establish a central repository (e.g., at the RCSB). Imagine what could be done with as little as $50,000. For large detectors, binning could be used, but giving current hard drive prices and future developments, that won't be necessary. Best - MM On Aug 16, 2007, at 9:13 AM, Phil Evans wrote: What do you count as raw data? Rawest are the images - everything beyond that is modellling - but archiving images is _expensive_! Unmerged intensities are probably more manageable Phil On 16 Aug 2007, at 15:05, Ashley Buckle wrote: Dear Randy These are very valid points, and I'm so glad you've taken the important step of initiating this. For now I'd like to respond to one of them, as it concerns something I and colleagues in Australia are doing: The more information that is available, the easier it will be to detect fabrication (because it is harder to make up more information convincingly). For instance, if the diffraction data are deposited, we can check for consistency with the known properties of real macromolecular crystals, e.g. that they contain disordered solvent and not vacuum. As Tassos Perrakis has discovered, there are characteristic ways in which the standard deviations depend on the intensities and the resolution. If unmerged data are deposited, there will probably be evidence of radiation damage, weak effects from intrinsic anomalous scatterers, etc. Raw images are probably even harder to simulate convincingly. After the recent Science retractions we realised that its about time raw data was made available. So, we have set about creating the necessary IT and software to do this for our diffraction data, and are encouraging Australian colleagues to do the same. We are about a week away from launching a web-accessible repository for our recently published (eg deposited in PDB) data, and this should coincide with an upcoming publication describing a new structure from our labs. The aim is that publication occurs simultaneously with release in PDB as well as raw diffraction data on our website. We hope to house as much of our data as possible, as well as data from other Australian labs, but obviously the potential dataset will be huge, so we are trying to develop, and make available freely to the community, software tools that allow others to easily setup their own repositories. After brief discussion with PDB the plan is that PDB include links from coordinates/SF's to the raw data using a simple handle that can be incorporated into a URL. We would hope that we can convince the journals that raw data must be made available at the time of publication, in the same way as coordinates and structure factors. Of course, we realise that there will be many hurdles along the way but we are convinced that simply making the raw data available ASAP is a 'good thing'. We are happy to share more details of our IT plans with the CCP4BB, such that they can be improved, and look forward to hearing feedback cheers Mischa Machius, PhD Associate Professor UT Southwestern Medical Center at Dallas 5323 Harry Hines Blvd.; ND10.214A Dallas, TX 75390-8816; U.S.A. Tel: +1 214 645 6381 Fax: +1 214 645 6353
Re: [ccp4bb] The importance of USING our validation tools
Hmm - I think I miscalculated, by a factor of 100 even!... need more coffee. In any case, I still think it would be doable. Best - MM On Aug 16, 2007, at 9:30 AM, Mischa Machius wrote: I don't think archiving images would be that expensive. For one, I have found that most formats can be compressed quite substantially using simple, standard procedures like bzip2. If optimized, raw images won't take up that much space. Also, initially, only those images that have been used to obtain phases and to refine finally deposited structures could be archived. If the average structure takes up 20GB of space, 5,000 structures would be 1TB, which fits on a single hard drive for less than $400. If the community thinks this is a worthwhile endeavor, money should be available from granting agencies to establish a central repository (e.g., at the RCSB). Imagine what could be done with as little as $50,000. For large detectors, binning could be used, but giving current hard drive prices and future developments, that won't be necessary. Best - MM On Aug 16, 2007, at 9:13 AM, Phil Evans wrote: What do you count as raw data? Rawest are the images - everything beyond that is modellling - but archiving images is _expensive_! Unmerged intensities are probably more manageable Phil On 16 Aug 2007, at 15:05, Ashley Buckle wrote: Dear Randy These are very valid points, and I'm so glad you've taken the important step of initiating this. For now I'd like to respond to one of them, as it concerns something I and colleagues in Australia are doing: The more information that is available, the easier it will be to detect fabrication (because it is harder to make up more information convincingly). For instance, if the diffraction data are deposited, we can check for consistency with the known properties of real macromolecular crystals, e.g. that they contain disordered solvent and not vacuum. As Tassos Perrakis has discovered, there are characteristic ways in which the standard deviations depend on the intensities and the resolution. If unmerged data are deposited, there will probably be evidence of radiation damage, weak effects from intrinsic anomalous scatterers, etc. Raw images are probably even harder to simulate convincingly. After the recent Science retractions we realised that its about time raw data was made available. So, we have set about creating the necessary IT and software to do this for our diffraction data, and are encouraging Australian colleagues to do the same. We are about a week away from launching a web-accessible repository for our recently published (eg deposited in PDB) data, and this should coincide with an upcoming publication describing a new structure from our labs. The aim is that publication occurs simultaneously with release in PDB as well as raw diffraction data on our website. We hope to house as much of our data as possible, as well as data from other Australian labs, but obviously the potential dataset will be huge, so we are trying to develop, and make available freely to the community, software tools that allow others to easily setup their own repositories. After brief discussion with PDB the plan is that PDB include links from coordinates/SF's to the raw data using a simple handle that can be incorporated into a URL. We would hope that we can convince the journals that raw data must be made available at the time of publication, in the same way as coordinates and structure factors. Of course, we realise that there will be many hurdles along the way but we are convinced that simply making the raw data available ASAP is a 'good thing'. We are happy to share more details of our IT plans with the CCP4BB, such that they can be improved, and look forward to hearing feedback cheers -- -- Mischa Machius, PhD Associate Professor UT Southwestern Medical Center at Dallas 5323 Harry Hines Blvd.; ND10.214A Dallas, TX 75390-8816; U.S.A. Tel: +1 214 645 6381 Fax: +1 214 645 6353 Mischa Machius, PhD Associate Professor UT Southwestern Medical Center at Dallas 5323 Harry Hines Blvd.; ND10.214A Dallas, TX 75390-8816; U.S.A. Tel: +1 214 645 6381 Fax: +1 214 645 6353
Re: [ccp4bb] The importance of USING our validation tools
Hello all, I started to write a response to this thread yesterday. I thought the title was great even the content of Eleanor's email was very helpful. What I didn't like was the indictment in the next to last paragraph. This has been followed up with the word fabrication by others. No one knows definitively if this was fabricated. You have your suspicions, but you don't know. Fabrication suggests malicious wrong-doing. I actually don't think this was the case. I'm probably a bit biased because the work comes from an office down the hall from my own. I'd like to think that if the structure is wrong that it could be chalked up to inexperience rather than malice. To me, this scenario of inexperience seems like one that could become more and more prevalent as our field opens up to more and more scientists doing structural work who are not dedicated crystallographers. Having said that, I think Eleanor started an extremely useful thread as a way of avoiding the pitfalls of crystallography whether you are a novice or an expert. There's no question that this board is the best way to advance one's knowledge of crystallography. I actually gave a homework assignment that was simply to sign up for the ccp4bb. In reference to the previously mentioned work, I'd also like to hear discussion concurring or not the response letter some of which seems plausible to me. I hope I don't ruffle anyones feathers by my email, but I just thought that it should be said. Cheers- Todd -Original Message- From: CCP4 bulletin board on behalf of Randy J. Read Sent: Thu 8/16/2007 8:22 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] The importance of USING our validation tools On Aug 16 2007, Eleanor Dodson wrote: The weighting in REFMAC is a function of SigmA ( plotted in log file). For this example it will be nearly 1 for all resolutions ranges so the weights are pretty constant. There is also a contribution from the experimental sigma, which in this case seems to be proportional to |F| Originally I expected that the publication of our Brief Communication in Nature would stimulate a lot of discussion on the bulletin board, but clearly it hasn't. One reason is probably that we couldn't be as forthright as we wished to be. For its own good reasons, Nature did not allow us to use the word fabricated. Nor were we allowed to discuss other structures from the same group, if they weren't published in Nature.
Re: [ccp4bb] The importance of USING our validation tools
Dear all, With regards to the possible fabrication of the 2hr0 structure, why would the authors have deposited the structure factors if this is not required by the journal? Also, why would they have fabricated a structure with gaps along c if they could have done so without the gap? I few years ago, I had to cope with two structures with gaps along c, pdb codes 1h6w and 1ocy those of you who are interested, structure factors are available from the pdb, unmerged intensities/raw images I will look for and provide if requested... Without further evidence, I suspect their structure is real, perhaps not optimally refined and treated though, but then again, this seems commonplace in Nature structures, perhaps due to lack of time/ experience and, in some cases, putting too much pressure on the PhD students/postdocs involved instead of mentoring and checking them. I hope the authors provide the raw diffraction images to dispel any doubts and would be curious to learn about the other structures of the same group - anyone has a comprehensive, annotated list of them? Greetings, Mark J. van Raaij Unidad de Bioquímica Estructural Dpto de Bioquímica, Facultad de Farmacia and Unidad de Rayos X, Edificio CACTUS Universidad de Santiago 15782 Santiago de Compostela Spain http://web.usc.es/~vanraaij/ On 16 Aug 2007, at 15:22, Randy J. Read wrote: On Aug 16 2007, Eleanor Dodson wrote: The weighting in REFMAC is a function of SigmA ( plotted in log file). For this example it will be nearly 1 for all resolutions ranges so the weights are pretty constant. There is also a contribution from the experimental sigma, which in this case seems to be proportional to |F| Originally I expected that the publication of our Brief Communication in Nature would stimulate a lot of discussion on the bulletin board, but clearly it hasn't. One reason is probably that we couldn't be as forthright as we wished to be. For its own good reasons, Nature did not allow us to use the word fabricated. Nor were we allowed to discuss other structures from the same group, if they weren't published in Nature. Another reason is an understandable reluctance to make allegations in public, and the CCP4 bulletin board probably isn't the best place to do that. But I think the case raises essential topics for the community to discuss, and this is a good forum for those discussions. We need to consider how to ensure the integrity of the structural databases and the associated publications. So here are some questions to start a discussion, with some suggestions of partial answers. 1. How many structures in the PDB are fabricated? I don't know, but I think (or at least hope) that the number is very small. 2. How easy is it to fabricate a structure? It's very easy, if no-one will be examining it with a suspicious mind, but it's extremely difficult to do well. No matter how well a structure is fabricated, it will violate something that is known now or learned later about the properties of real macromolecules and their diffraction data. If you're clever enough to do this really well, then you should be clever enough to determine the real structure of an interesting protein. 3. How can we tell whether structures in the PDB are fabricated, or just poorly refined? The current standard validation tools are aimed at detecting errors in structure determination or the effects of poor refinement practice. None of them are aimed at detecting specific signs of fabrication because we assume (almost always correctly) that others are acting in good faith. The more information that is available, the easier it will be to detect fabrication (because it is harder to make up more information convincingly). For instance, if the diffraction data are deposited, we can check for consistency with the known properties of real macromolecular crystals, e.g. that they contain disordered solvent and not vacuum. As Tassos Perrakis has discovered, there are characteristic ways in which the standard deviations depend on the intensities and the resolution. If unmerged data are deposited, there will probably be evidence of radiation damage, weak effects from intrinsic anomalous scatterers, etc. Raw images are probably even harder to simulate convincingly. If a structure is fabricated by making up a new crystal form, perhaps a complex of previously-known components, then the crystal packing interactions should look like the interactions seen in real crystals. If it's fabricated by homology modelling, then the internal packing is likely to be suboptimal. I'm told by David Baker (who knows a thing or two about this) that it is extremely difficult to make a homology model that both obeys what we know about torsion angle preferences and is packed as well as a real protein structure. I'm very interested in hearing about new ideas along these lines. The
Re: [ccp4bb] The importance of USING our validation tools
On Aug 16, 2007, at 15:22, Randy J. Read wrote: Raw images are probably even harder to simulate convincingly. If i was to fabricate a structure, I would get first 'Fobs', then expand, then get the images (I am sure one can hack 'strategy' or 'predict' or even 'mosflm' to tell you in which image every reflection is) and then add noise in the images themselves. The process the images and go on from there ;-) The thing that is certainly stopping me is that its much more difficult to do that, than solving the structure ... but it would admittedly be quite some fun doing it right if one would ignore the tiny issue of the ethical side of such activity. About archiving images, I have a feeling that the cost per Gb is the same as it was for structure factors in early 90's. Last but not least, some EDS data mining we did here, agrees with Randy: very very few other structures, if any, appear to have really strange statistics in the subset of the PDB with structure factors (aka EDS...). That is a relief. As for the Nature debate, I am only disappointed and confused by one thing: Randy et al, ask for the images, like one can ask for the dated logbook, in any other scientific discipline. For me that qualifies only two reactions from the group of Murthy: 1. Make the images available and demand a public apology for spoiling their name. 2. Shut up, retract the paper, buy property in Alaska and disappear. The mumbo jumbo of the reply is so tragically irrelevant that I fail to understand how Nature tolerated it. Tassos PS the algorithm for the calculation of the sigmas (assuming they were calculated) does not look that naive actually. Far from a simple linear relationship. They put some thought on it, but lets say that if you want to apply a 2D function to simulate noise, don't do it along the principle axes ;-)
Re: [ccp4bb] The importance of USING our validation tools
On Thu, Aug 16, 2007 at 03:13:29PM +0100, Phil Evans wrote: What do you count as raw data? Rawest are the images - everything beyond that is modellling - but archiving images is _expensive_! Hmmm - not sure: let's say that a typical dataset requires about 180 images with 10Mb each image. With the current amount of roughly 4 X-ray structures in the PDB this is: 4 * 180 * 10Mb = ~ 70 Tb of data With simple 1TB external disk at about GBP 200 we get a price of GBP 14000, i.e. 35 pence per dataset. Ok, this is not a proper calculation (more data collected, fine-phi slicing, MAD datasets etc etc) and lets apply a 'safety factor' of 10: but even then I think this is easily doable. As Tassos remarked as well: if we could store/deposit and manage PDB files in the 70s we should be able to do the same now (30 years later!) with images ... easily. Cheers Clemens Unmerged intensities are probably more manageable Phil On 16 Aug 2007, at 15:05, Ashley Buckle wrote: Dear Randy These are very valid points, and I'm so glad you've taken the important step of initiating this. For now I'd like to respond to one of them, as it concerns something I and colleagues in Australia are doing: The more information that is available, the easier it will be to detect fabrication (because it is harder to make up more information convincingly). For instance, if the diffraction data are deposited, we can check for consistency with the known properties of real macromolecular crystals, e.g. that they contain disordered solvent and not vacuum. As Tassos Perrakis has discovered, there are characteristic ways in which the standard deviations depend on the intensities and the resolution. If unmerged data are deposited, there will probably be evidence of radiation damage, weak effects from intrinsic anomalous scatterers, etc. Raw images are probably even harder to simulate convincingly. After the recent Science retractions we realised that its about time raw data was made available. So, we have set about creating the necessary IT and software to do this for our diffraction data, and are encouraging Australian colleagues to do the same. We are about a week away from launching a web-accessible repository for our recently published (eg deposited in PDB) data, and this should coincide with an upcoming publication describing a new structure from our labs. The aim is that publication occurs simultaneously with release in PDB as well as raw diffraction data on our website. We hope to house as much of our data as possible, as well as data from other Australian labs, but obviously the potential dataset will be huge, so we are trying to develop, and make available freely to the community, software tools that allow others to easily setup their own repositories. After brief discussion with PDB the plan is that PDB include links from coordinates/SF's to the raw data using a simple handle that can be incorporated into a URL. We would hope that we can convince the journals that raw data must be made available at the time of publication, in the same way as coordinates and structure factors. Of course, we realise that there will be many hurdles along the way but we are convinced that simply making the raw data available ASAP is a 'good thing'. We are happy to share more details of our IT plans with the CCP4BB, such that they can be improved, and look forward to hearing feedback cheers -- *** * Clemens Vonrhein, Ph.D. vonrhein AT GlobalPhasing DOT com * * Global Phasing Ltd. * Sheraton House, Castle Park * Cambridge CB3 0AX, UK *-- * BUSTER Development Group (http://www.globalphasing.com) ***
Re: [ccp4bb] The importance of USING our validation tools
Validation aside, access to raw data is also helpful for method development (eg integration and scaling algorithms), on which we all rely. Ashley On 17/08/2007, at 1:04 AM, Santarsiero, Bernard D. wrote: Sorry, I think it's a waste of resources to store the raw images. I think we should trust people to be able to at least process their own data set. Besides, you would need to include beamline parameters, beam position, detector distances, etc. that may or may not be correct in the image headers. I'm all for storage and retrieval of a primary intensity data file (I or F^2 with esds). Bernie Santarsiero On Thu, August 16, 2007 9:46 am, Mischa Machius wrote: Hmm - I think I miscalculated, by a factor of 100 even!... need more coffee. In any case, I still think it would be doable. Best - MM On Aug 16, 2007, at 9:30 AM, Mischa Machius wrote: I don't think archiving images would be that expensive. For one, I have found that most formats can be compressed quite substantially using simple, standard procedures like bzip2. If optimized, raw images won't take up that much space. Also, initially, only those images that have been used to obtain phases and to refine finally deposited structures could be archived. If the average structure takes up 20GB of space, 5,000 structures would be 1TB, which fits on a single hard drive for less than $400. If the community thinks this is a worthwhile endeavor, money should be available from granting agencies to establish a central repository (e.g., at the RCSB). Imagine what could be done with as little as $50,000. For large detectors, binning could be used, but giving current hard drive prices and future developments, that won't be necessary. Best - MM On Aug 16, 2007, at 9:13 AM, Phil Evans wrote: What do you count as raw data? Rawest are the images - everything beyond that is modellling - but archiving images is _expensive_! Unmerged intensities are probably more manageable Phil On 16 Aug 2007, at 15:05, Ashley Buckle wrote: Dear Randy These are very valid points, and I'm so glad you've taken the important step of initiating this. For now I'd like to respond to one of them, as it concerns something I and colleagues in Australia are doing: The more information that is available, the easier it will be to detect fabrication (because it is harder to make up more information convincingly). For instance, if the diffraction data are deposited, we can check for consistency with the known properties of real macromolecular crystals, e.g. that they contain disordered solvent and not vacuum. As Tassos Perrakis has discovered, there are characteristic ways in which the standard deviations depend on the intensities and the resolution. If unmerged data are deposited, there will probably be evidence of radiation damage, weak effects from intrinsic anomalous scatterers, etc. Raw images are probably even harder to simulate convincingly. After the recent Science retractions we realised that its about time raw data was made available. So, we have set about creating the necessary IT and software to do this for our diffraction data, and are encouraging Australian colleagues to do the same. We are about a week away from launching a web-accessible repository for our recently published (eg deposited in PDB) data, and this should coincide with an upcoming publication describing a new structure from our labs. The aim is that publication occurs simultaneously with release in PDB as well as raw diffraction data on our website. We hope to house as much of our data as possible, as well as data from other Australian labs, but obviously the potential dataset will be huge, so we are trying to develop, and make available freely to the community, software tools that allow others to easily setup their own repositories. After brief discussion with PDB the plan is that PDB include links from coordinates/SF's to the raw data using a simple handle that can be incorporated into a URL. We would hope that we can convince the journals that raw data must be made available at the time of publication, in the same way as coordinates and structure factors. Of course, we realise that there will be many hurdles along the way but we are convinced that simply making the raw data available ASAP is a 'good thing'. We are happy to share more details of our IT plans with the CCP4BB, such that they can be improved, and look forward to hearing feedback cheers -- -- Mischa Machius, PhD Associate Professor UT Southwestern Medical Center at Dallas 5323 Harry Hines Blvd.; ND10.214A Dallas, TX 75390-8816; U.S.A. Tel: +1 214 645 6381 Fax: +1 214 645 6353 - --- Mischa Machius, PhD Associate Professor UT Southwestern Medical Center at Dallas 5323 Harry Hines Blvd.; ND10.214A Dallas, TX 75390-8816; U.S.A. Tel: +1 214 645 6381 Fax: +1 214 645 6353
Re: [ccp4bb] The importance of USING our validation tools
I'm glad that the discussion has finally set in, and would only like to comment on the practicability of storing images. Mischa Machius schrieb: I don't think archiving images would be that expensive. For one, I have found that most formats can be compressed quite substantially using simple, standard procedures like bzip2. If optimized, raw images won't take up that much space. Also, initially, only those images that have been used to obtain phases and to refine finally deposited structures could be archived. If the average structure takes up 20GB of space, that's on the high side I'd say; I would have estimated 1.5 GB (native alone) to 5 GB for e.g. a native and 3 wavelengths (after bzip2). 5,000 structures would be 1TB, which fits on a single hard drive for 5,000 structures of 20GB would be 100 TB If the PDB would require all images of a _single_ dataset for molecular-replacement structures or mutant studies, and all images of all wavelengths/derivatives for experimentally phased structures, that would come to roughly (40,000 X-ray structures) * (on average 2 GB per structure) = 80 TB of data. At €250 per TB, that would be 20,000 € - an estimate of what it takes to store all the raw data for _all_ the X-ray structures in the PDB - less than what a single a single protein cloning/purification/crystallization/structure project costs per year. less than $400. If the community thinks this is a worthwhile endeavor, money should be available from granting agencies to establish a central repository (e.g., at the RCSB). Imagine what could be done with as little as $50,000. For large detectors, binning could be used, but giving current hard drive prices and future developments, that won't be necessary. Best - MM Archiving images is quite practical even for those data that do not directly correspond to deposited PDB entries. In 1999 we abandoned tape storage of raw data in favor of disk storage. Everything we collected at synchrotrons since then still fits on two 750GB disks. In 2000 we also needed two disks, and have been upgrading the disks when the old ones were full. To have these data online means that one can easily look at them again, for testing data reduction and phasing programs, and for trying to solve, using new programs, those structures where crystals could never be reproduced. just my 2 cents - Kay Diederichs -- Kay Diederichs http://strucbio.biologie.uni-konstanz.de email: [EMAIL PROTECTED] Tel +49 7531 88 4049 Fax 3183 Fachbereich Biologie, Universitaet Konstanz, Box M647, D-78457 Konstanz smime.p7s Description: S/MIME Cryptographic Signature
Re: [ccp4bb] The importance of USING our validation tools
This structure (1h6w) provides an interesting comparison; it looks just as I would expect though for such an interesting extended fold. There are big peaks on the 3-fold axis; there is wispy density which would be very hard to model - I found an ILE in the wrong rotamer (341A) - (there is ALWAYS something you can improve) - in other words it looks like a real map.. And the intensity plots look as expected too.. Eleanor Mark J. van Raaij wrote: Dear all, With regards to the possible fabrication of the 2hr0 structure, why would the authors have deposited the structure factors if this is not required by the journal? Also, why would they have fabricated a structure with gaps along c if they could have done so without the gap? I few years ago, I had to cope with two structures with gaps along c, pdb codes 1h6w and 1ocy those of you who are interested, structure factors are available from the pdb, unmerged intensities/raw images I will look for and provide if requested... Without further evidence, I suspect their structure is real, perhaps not optimally refined and treated though, but then again, this seems commonplace in Nature structures, perhaps due to lack of time/experience and, in some cases, putting too much pressure on the PhD students/postdocs involved instead of mentoring and checking them. I hope the authors provide the raw diffraction images to dispel any doubts and would be curious to learn about the other structures of the same group - anyone has a comprehensive, annotated list of them? Greetings, Mark J. van Raaij Unidad de Bioquímica Estructural Dpto de Bioquímica, Facultad de Farmacia and Unidad de Rayos X, Edificio CACTUS Universidad de Santiago 15782 Santiago de Compostela Spain http://web.usc.es/~vanraaij/ http://web.usc.es/%7Evanraaij/ On 16 Aug 2007, at 15:22, Randy J. Read wrote: On Aug 16 2007, Eleanor Dodson wrote: The weighting in REFMAC is a function of SigmA ( plotted in log file). For this example it will be nearly 1 for all resolutions ranges so the weights are pretty constant. There is also a contribution from the experimental sigma, which in this case seems to be proportional to |F| Originally I expected that the publication of our Brief Communication in Nature would stimulate a lot of discussion on the bulletin board, but clearly it hasn't. One reason is probably that we couldn't be as forthright as we wished to be. For its own good reasons, Nature did not allow us to use the word fabricated. Nor were we allowed to discuss other structures from the same group, if they weren't published in Nature. Another reason is an understandable reluctance to make allegations in public, and the CCP4 bulletin board probably isn't the best place to do that. But I think the case raises essential topics for the community to discuss, and this is a good forum for those discussions. We need to consider how to ensure the integrity of the structural databases and the associated publications. So here are some questions to start a discussion, with some suggestions of partial answers. 1. How many structures in the PDB are fabricated? I don't know, but I think (or at least hope) that the number is very small. 2. How easy is it to fabricate a structure? It's very easy, if no-one will be examining it with a suspicious mind, but it's extremely difficult to do well. No matter how well a structure is fabricated, it will violate something that is known now or learned later about the properties of real macromolecules and their diffraction data. If you're clever enough to do this really well, then you should be clever enough to determine the real structure of an interesting protein. 3. How can we tell whether structures in the PDB are fabricated, or just poorly refined? The current standard validation tools are aimed at detecting errors in structure determination or the effects of poor refinement practice. None of them are aimed at detecting specific signs of fabrication because we assume (almost always correctly) that others are acting in good faith. The more information that is available, the easier it will be to detect fabrication (because it is harder to make up more information convincingly). For instance, if the diffraction data are deposited, we can check for consistency with the known properties of real macromolecular crystals, e.g. that they contain disordered solvent and not vacuum. As Tassos Perrakis has discovered, there are characteristic ways in which the standard deviations depend on the intensities and the resolution. If unmerged data are deposited, there will probably be evidence of radiation damage, weak effects from intrinsic anomalous scatterers, etc. Raw images are probably even harder to simulate convincingly. If a structure is fabricated by making up a new crystal form, perhaps a complex of previously-known components, then the crystal packing interactions should look like the interactions seen
Re: [ccp4bb] The importance of USING our validation tools
Hello All, This debacle is actually quite reminiscent of a similar incident that Wayne Hendrickson caught in the 1970's concerning purported tRNA crystals. Turned out to be completely fabricated, and the guy's career went down the drain, I think. A good example to tell your trainees. Jacob Keller The ref's: 1. True identity of a diffraction pattern attributed to valyl tRNA WAYNE A. HENDRICKSON, BROR E. STRANDBERG, ANDERS LILJAS, L. MARIO AMZEL, EATON E. LATTMAN CONTEXT: SIR - We have examined in detail several publications by H.H. Paradies. One is a report in Nature on 11 April 1970 about single crystals of a valine-specific tRNA from yeast1. We find that the diffraction pattern attributed to valyl tRNA... Nature 303, 195 - 195 (19 May 1983) Correspondence 2. A reply from Paradies H.H. PARADIES Nature 303, 196 - 196 (19 May 1983) Correspondence
Re: [ccp4bb] The importance of USING our validation tools
On Thu, Aug 16, 2007 at 03:13:29PM +0100, Phil Evans wrote: What do you count as raw data? Rawest are the images - everything beyond that is modellling - but archiving images is _expensive_! Maybe we should contact Google to let them do it for us ;-) http://news.bbc.co.uk/2/hi/technology/6425975.stm I doubt every crystallographer would want access to all raw datasets - but for developers it would be ABSOLUTELY FANTASTIC (similar to things like the JCSG archive). And just imagine all those well collected datasets of 10 years ago and what we could learn from those (and the better structures we could determine) with the modern tools and programs ... Clemens -- *** * Clemens Vonrhein, Ph.D. vonrhein AT GlobalPhasing DOT com * * Global Phasing Ltd. * Sheraton House, Castle Park * Cambridge CB3 0AX, UK *-- * BUSTER Development Group (http://www.globalphasing.com) ***
Re: [ccp4bb] The importance of USING our validation tools
Dear All, Without passing any judgement on the veracity of C3b structure 2hr0, I note that the Ca RMSD of this structure with C3 structure 2a73 was unusually low, compared to the RMSD of 2a73 to the related entries 2a74 and 2i07 by the same group, bovine C3 structure 2b39 and C3b and C3c structures 2ice and 2icef. If one took a high resolution structure as a molecular replacement solution of a new structure at lower resolution this might be expected, but not vice versa? As to whether the structures problem arise from malfeasance or neglect, I do not understand why the journal did not require the raw images be made available given the evidence presented against the published data, isn't that what is done in other fields when such issues are raised? Isn't it more practical to make the availability of raw data upon request a requirement of publication more practical than trying to set up a vast repository of images when submission to that repository is still a matter of choice? I have several questions regarding the reply that I would like to hear an answer to, perhaps Todd can help obtain them: 1. Could the statement Statistical disorder resulting in apparent 'gaps' in the lattice has been observed for other proteins not be referenced by citation to numerous deposited structures if they indeed exist? 2. I was not convinced that the Z-scores of the PHASER solutions were significant, shouldn't they be greater than 6.0? It didn't look like density at 0.7 sigma was contiguous over the main chain. 3. Can the domain suggested to fill the void in the asymmetric unit be a contaminant when it must be present in stoichiometric ratio in order to provide lattice contacts? Why not present a SDS/PAGE gel of a redissolved crystal, surely that domain would show up. 3. I don't understand why the statement Bulk-solvent modelling is contentious, making many refinements necessary to constrain parameters to obtain acceptable values was considered an acceptable response to the question of the low resolution data. Whether one chooses to include low-resolution data with bulk solvent modelling or to truncate the low res data is a separate issue from the physical effect of solvent on intensities at low resolution. One point in the reply that seemed reasonable is the issue of B-factor variation, because the deposited C3 structures do exhibit a wide range in the average B, also resolution, and whether TLS refinement was used and how heavily restraints were set. However, that does not really address the issue of seemingly random coil without other contacts having such great contours at 2.5 sigma. I would look forward to learning from people with more experience on these matters. sincerely, Richard Baxter On Thu, 2007-08-16 at 10:11, Green, Todd wrote: Hello all, I started to write a response to this thread yesterday. I thought the title was great even the content of Eleanor's email was very helpful. What I didn't like was the indictment in the next to last paragraph. This has been followed up with the word fabrication by others. No one knows definitively if this was fabricated. You have your suspicions, but you don't know. Fabrication suggests malicious wrong-doing. I actually don't think this was the case. I'm probably a bit biased because the work comes from an office down the hall from my own. I'd like to think that if the structure is wrong that it could be chalked up to inexperience rather than malice. To me, this scenario of inexperience seems like one that could become more and more prevalent as our field opens up to more and more scientists doing structural work who are not dedicated crystallographers. Having said that, I think Eleanor started an extremely useful thread as a way of avoiding the pitfalls of crystallography whether you are a novice or an expert. There's no question that this board is the best way to advance one's knowledge of crystallography. I actually gave a homework assignment that was simply to sign up for the ccp4bb. In reference to the previously mentioned work, I'd also like to hear discussion concurring or not the response letter some of which seems plausible to me. I hope I don't ruffle anyones feathers by my email, but I just thought that it should be said. Cheers- Todd -Original Message- From: CCP4 bulletin board on behalf of Randy J. Read Sent: Thu 8/16/2007 8:22 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] The importance of USING our validation tools On Aug 16 2007, Eleanor Dodson wrote: The weighting in REFMAC is a function of SigmA ( plotted in log file). For this example it will be nearly 1 for all resolutions ranges so the weights are pretty constant. There is also a contribution from the experimental sigma, which in this case seems to be proportional to |F| Originally I expected that the publication of our Brief Communication in Nature would stimulate a lot of discussion on the bulletin board, but clearly
Re: [ccp4bb] The importance of USING our validation tools
I like to emphasize that the infamous table 1 alone should have immediately tipped off any competent reviewer. The last shell I/Isig is 1.3 and rmerge 0.11 (!). Rfree and R have extraorinarily low gaps. And all that for a large, porportedly flexible multidomain molecule. Enough to ask more questions, even without initially having model, data, frames available. Maybe the infamous Table 1 is still good for something after all. Hiding it in supplemental material does not promote reading it. br From: CCP4 bulletin board [mailto:[EMAIL PROTECTED] On Behalf Of Anastassis Perrakis Sent: Thursday, August 16, 2007 8:13 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] The importance of USING our validation tools 1. Make the images available and demand a public apology for spoiling their name. 2. Shut up, retract the paper, buy property in Alaska and disappear.
Re: [ccp4bb] The importance of USING our validation tools
The deposited structure 2HR0 shows all the signs of having been refined, deliberately or accidentally, against 'calculated' data. The model used to 'calculate' the data had (almost) constant B-values in a rather empty cell containing no solvent. For example, it could have been a (partial?) molecular replacement solution obtained using real data. It seems to me that it is perfectly possible that two reflection files (or two columns in an mtz file) were carelessly exchanged by a crystallographically inexperienced researcher. This even explains the low CA RMSD to the 2A73 structure, if that had been used as a search fragment; even the suspiciously poor Phaser Z scores can be explained (maybe it was only a partially correct MR solution against the real data). So although my first reaction was that there was overwhelming evidence of fraud, on reflection a relatively benign explanation is still possible. The situation could be clarified fairly quickly if the frames or a crystal or even the original HKL2000 .sca file could be found. What I really don't understand is how the Editors of the revered journal Nature allowed a 'reply' to be printed which made no reference to the request for the essential experimental evidence, i.e. the raw diffraction data, to be produced. Protein crystallography is an experimental science just like any other, even if the results it produces usually stand the test of time better. George Prof. George M. Sheldrick FRS Dept. Structural Chemistry, University of Goettingen, Tammannstr. 4, D37077 Goettingen, Germany Tel. +49-551-39-3021 or -3068 Fax. +49-551-39-2582
Re: [ccp4bb] The importance of USING our validation tools
There are several issues under current discussion. We outline a few of these below, in order of importance. The structure 2hr0 is unambiguously fake. Valid arguments have already been published in a Brief Communication by Janssen et. al (Nature, 448:E1-E2, 9 August 2007). However, the published response from the authors of the questioned deposit may sound to unfamiliar person as an issue of a scientific controversy. There are many additional independent signs of intentional data fabrication in this case, above and beyond those already mentioned. One diagnostic is related to the fact that fabricating data will not show proper features of proteins with respect to disorder. The reported case has a very high ratio of “Fobs”/atom parameters, thus the phase uncertainty is small. In real structures fully solvent exposed chains without stabilizing interactions display intrinsically high disorder, yet in this structure these residues (e.g., Arg932B, Met1325B, Glu1138B, Arg459A, etc.) are impossibly well ordered. The second set of diagnostics is the observation of perfect electron density around impossible geometries. For example, the electron density is perfect (visible even at the 4 sigma level in a 2Fo-Fc map) with no significant negative peaks in an Fo-Fc map around the guanidinium group of Arg1112B, which is in an outrageously close contact to carbon atoms of Lys1117B. This observation appears in many other places in the map as well. The issue is not the presence of bad contacts, but the lack of disorder (high B-factors) or negative peaks in an Fo-Fc map in this region that could explain why the bad contacts remain in the model. The third set of diagnostics are statistics that do not occur in real structures. The ones mentioned previously are already very convincing (moments, B-factor plots, bulk solvent issues, etc.). We can add more evidence from a round of Refmac refinement of the deposited model versus the deposited structure factors. The anisotropic scaling factor obtained, which for a structure in a low symmetry space group such as C2 that has an inherent lack of constraint in packing symmetry, is unreasonable (particularly in view of the problems with lattice contacts already mentioned). The values from a Refmac refinement for a typical structure in space group C2 are: B11 = 0.72 B22 = 1.15 B33 = -2.12 B12 = 0.00 B13 = -1.40 B23 = 0.00 (B12 and B23 are zero due to C2 space group symmetry). For structure 2hr0: B11 = -0.02 B22 = 0.00 B33 = 0.02 B12 = 0.00 B13 = 0.01 B23 = 0.00. Statistical reasoning can lead to P-values in the range of 10exp(-6) for such values to be produced by chance in a real structure, but they are highly likely in a fabricated case. The fourth set of diagnostics are significant inconsistencies in published methods, e.g. the authors claim that they collected data from four crystals, yet their data merging statistics show an R-merge = 0.11 in the last resolution shell. It is simply impossible to get such values particularly when I/sigma(I) for the last resolution shell was stated as 1.32. Moreover, the overall I/sigma(I) for all data is 5.36 and the overall R-merge is 0.07 – values highly inconsistent with the reported data resolution, quality of map and high data completeness (97.3%). Overall this is just a short list of problems, the indicators of data fabrication/falsification are plentiful and if needed can be easily provided to interested parties. We fully support Randy Read's excellent comments with our view of retraction and public discussion of this problem: “Originally I expected that the publication of our Brief Communication in Nature would stimulate a lot of discussion on the bulletin board, but clearly it hasn't. One reason is probably that we couldn't be as forthright as we wished to be. For its own good reasons, Nature did not allow us to use the word fabricated. Nor were we allowed to discuss other structures from the same group, if they weren't published in Nature.” One needs to address this policy with publishers in cases of intentional fraud that can be proven simply by an analysis of the published results. At this point the article needs to be retracted by Nature after Nature's internal investigation with input from crystallographic community rather then after obtaining results of any potential administrative investigation of fraud. “Another reason is an understandable reluctance to make allegations in public, and the CCP4 bulletin board probably isn't the best place to do that.” The discussion of fraud allegation was initiated by public reply to a question addressed to a single person, so it happened by chance rather than by intention, but with no complaint from our side. On a different aspect of the discussion – namely, data preservation—currently, funding agencies as well as scientific responsibility requires authors of any publication to preserve and
Re: [ccp4bb] The importance of USING our validation tools
A few thoughts following on Richard Baxter and George Sheldrick . . . Re: gaps in the lattice see the tyr-tRNA synthase structures (1tya for example). Fersht has written a whole book full of insights from these structures. Re: Phaser Z scores. For some MR work with two xtal forms of a structure, I got Z scores of 4.0 and 4.3 for the rotation and translation searches in one form, and 8.7 and 3.5 for the other, using a model with 18% sequence identity. So you don't need great Z scores for the solution to be right. The map calculated with MR phases had a correlation coefficient of 0.22 with the final model. Re: confusing columns in an mtz file. I had the same thought. If the column types were different for experimental versus calculated F's, and refmac only allowed you to refine against an experimental F, could this kind of trouble be avoided? Of course you'd want an option to override the default, for people doing weird things. Dunno about cns or phenix, but didn't we recently see messages about how hard it was to work with cns reflection files, leading to a new conversion program from Kevin? It seems possible to get the wrong column there as well. Re: images. Be careful what you sign - the user agreements with synchrotron facilities in the USA may state that the data are public, and not private (as the funding is from the public). Pete
Re: [ccp4bb] The importance of USING our validation tools
Due to these recent, highly publicized irregularities and ample (snide) remarks I hear about them from non-crystallographers, I am wondering if the trust in macromolecular crystallography is beginning to erode. It is often very difficult even for experts to distinguish fake or wishful thinking from reality. Non-crystallographers will have no chance at all and will consequently not rely on our results as much as we are convinced they could and should. If that is indeed the case, something needs to be done, and rather sooner than later. Best - MM Mischa Machius, PhD Associate Professor UT Southwestern Medical Center at Dallas 5323 Harry Hines Blvd.; ND10.214A Dallas, TX 75390-8816; U.S.A. Tel: +1 214 645 6381 Fax: +1 214 645 6353
Re: [ccp4bb] The importance of USING our validation tools
I like to emphasize that the infamous table 1 alone should have immediately tipped off any competent reviewer. The last shell I/Isig is 1.3 and rmerge 0.11 (!). And keep in mind that this statistics comes from merging data from FOUR different crystals! (That's clearly and unambigously stated in Methods section). Dima
Re: [ccp4bb] The importance of USING our validation tools
No one knows definitively if this was fabricated. Well, at least one person does. But I agree, it is important to keep in mind that the proper venue for determining guilt or innocence in the case of fraud is the court system. Until fairly recently, the idea of presumed innocence and the right to cross-examine accusers and witnesses has been considered fundamental to civil society. The case certainly sounds compelling, but this is all the more reason to adhere to these ideals. Bill Scott
Re: [ccp4bb] The importance of USING our validation tools
On Thu, 16 Aug 2007, Clemens Vonrhein wrote: Maybe we should contact Google to let them do it for us ;-) Better yet, simply download your images to a computer that uses ATT as an internet service provider. All the information will be automatically copied and stored by the NSA. cf: http://www.eff.org/legal/cases/att/faq.php Bill