Dear Martin,

     Thank you for this very clear message about your views on this topic.
There is nothing like well articulated dissenting views to force a real
assessment of the initial arguments, and you have certainly provided that.

     As your presentation is "modular", I will interleave my comments with
your text, if you don't mind.

--
> Still, after hundreds (?) of emails to this topic, I haven't seen any 
> convincing argument in favor of archiving data. The only convincing 
> arguments are against, and are from Gerard K and Tassos.
>
> Why?
> The question is not what to archive, but still why should we archive all 
> the data.
>
> Because software developers need more data? Should we raise all these 
> efforts and costs because 10 developers worldwide need the data to ALL 
> protein structures? Do they really need so much data, wouldn't it be enough 
> to build a repository of maybe 1000 datasets for developments?

     A first impression is that your remark rather looks down on those "10
developers worldwide", a view not out of keeping with that of structural
biologists who have moved away from ground-level crystallography and view
the latter as a "mature technique" - a euphemism for saying that no further
improvements are likely nor even necessary. As Clemens Vonrhein has just
written, it may be the very success of those developers that has given the
benefit of what software can do to users who don't have the faintest idea of
what it does, nor of how it does it, nor of what its limitations are and how
to overcome those limitations - and therefore take it for granted.

     Another side of the "mature technique" kiss of death is the underlying
assumption that the demands placed on crystallographic methods are
themselves static, and nothing could be more misleading. We get caught time
and again by rushed shifts in technology without proper precautions in case
the first adaptations of the old methods do not perform as well as they
might later. Let me quote an example: 3x3 CCD detectors. It was too quickly
and hurriedly assumed that, after correcting the images recorded on these
instruments for geometric distortions and flat-field response, one would get
images that could be processed as if they came from image plates (or film).
This turned out to be a mistake: "corner effects" were later diagnosed, that
were partially correctible by a position-dependent modulation factor,
applied for instance by XDS in response to the problem. That correction is
not just detector-dependent and applicable to all datasets recorded on a
given detector, unfortunately, as it is related to a spatial variation in
the point-spread function. - so you really need to reprocess each set of
images to determine the necessary corrections. The tragic thing is that for
a typical resolution limit and detector distance, these corners cut into the
quick of your strongest secondary-structure defining data. If you have kept
your images, you can try and recover from that; otherwise, you are stuck
with what can be seriously sub-optimal data. Imagine what this can do to SAD
anomalous difference when Bijvoet pairs fall on detector positions where
these corner effects are vastly different ... . 

     Another example it that of the recent use of numerous microcrystals,
each giving a very small amount of data, to assemble datasets for solving
GPCR structures. The methods for doing this, for getting the indexing and
integration of such thin slices of data and getting the overall scaling to
behave, are still very rough. It would be pure insanity to throw these
images away and not to count on better algorithms to come along and improve
the final data extractible from them. 

--
> Does really someone believe that our view on the actual problem, the 
> function of the proteins, changes with the analysis of whatsoever 
> scattering is still in the images but not used by today's software? Crystal 
> structures are static, snapshots, and obtained under artificial conditions. 
> In solution (still the physiologic state) they might look different, not 
> much, but at least far more dynamic. Does it therefore matter whether we 
> know some sidechain positions better (in the crystal structure) when 
> re-analysing the data? In turn, are our current software programs such bad 
> that we would expect strong difference when re-analysing the data? No. And 
> if the structures change upon reanalysis (more or less) who does 
> re-interpret the structures, re-writes the papers?

     I think that, rather than asking rhetorical questions about people's
beliefs regarding such a general question, one needs testimonies about real
life situations. We have helped a great many academic groups in the last 15
years: in every case, they ended up feeling really overjoyed that they had
kept their images when they had, and immensely regretful when they hadn't. 
I noticed, for example, that your last PDB entry, 1LKX (2002) does not have
structure factor data associated with it. It is therefore impossible for
anyone to do anything about its 248 REMARK 500 records complaining about bad
(PHI,PSI) values; whereas if the structure factors had been deposited, all
our own experience in this area suggests that today's refinement programs
would have helped a great deal towards this.

     Otherwise you invoke the classical arguments about the possible
artificiality of crystal structures because they are "static" etc. . Even if
this is the case, it does not diminish the usefulness of characterising what
they enable us to see with the maximum possible care and precision. The
"dynamic" aspect of NMR structure ensembles can hide a multitude of factors
that are inaccuracies rather than a precise characterisation of dynamics.
Shall I dare mention that a favourite reverse acronym for NMR is "Needs More
Resolution"? (Sorry, crystallographer's joke ;-) ...). 

     Finally, it isn't because no one would have the time to write a paper
after having corrected or improved a PDB entry that he/she would not have
the benefit of those corrections or improvements when using that entry for
modelling or for molecular replacement.

--
> There are many many cases where researchers re-did structures (or did 
> closely related structures to already available structures like mutants, 
> structures of closely related species, etc.), also after 10 years. I guess 
> they used the latest software in the different cases, thus they 
> incorporated all the software development of the 10 years. And are the 
> structures really different (beyond the introduced changes, mutations, 
> etc.)? Different because of the software used?

     Again this is a very broad question, to which answers would constitute
a large and varied sample. To use an absence of organised detailed evidence
to justify not doing something is not the best kind of argument.

--
> The comparison with next-generation sequencing data is useful here, but 
> only in the sense Tassos explained. Well, of course not every position in 
> the genomic sequence is fixed. Therefore it is sometimes useful to look at 
> the original data (the traces, as Gerard B pointed out). But we already 
> know, that every single organism is different (especially eukaryotes) and 
> therefore it is absolutely enough to store the computationally reduced and 
> merged data. If one needs better, position-specific data, sequencing and 
> comparing single species becomes necessary, like in the ENCODE project, the 
> sequencing of about 100 Saccharomyces strains, the sequencing of 1000 
> Arabidopsis strains, etc. Discussion about single positions are useless if 
> they are not statistically relevant. They need to be analysed in the 
> context of populations, large cohorts of patients, etc. If we need 
> personalized medicine adapted to personal genomes, we would also need 
> personal sets of protein structures which we cannot provide yet. Therefore, 
> storing the DNA in the freezer is better and cheaper than storing all the 
> sequencing raw data. Do you think a reviewer re-sequences, or re-assembles, 
> or re-annotates a genome, even if access to the raw reads would be 
> available? If you trust these data why don't we trust our structure 
> factors? Do you trust electron microscopy images, movies of GFP-tagged 
> proteins? Do you think what is presented for a single or a few visible 
> cells is also found in all cells?

     I think we are straying into facts and figures completely unrelated to
the initial topic. Characteristically it comes from areas in which fuzziness
is rampant - I do not see why this should deter crystallography from
treasuring the high level of accurate detail reachable by their own methods
in their own area.

--
> And now, who many of you (if not everybody) uses structures from yeast, 
> Drosophila, mouse etc. as MODEL for human proteins? If we stick to this 
> thinking, who would care about potential minor changes in the structures 
> upon re-analysis (and in the light of this discussion, arguing about 
> specific genomic sequence positions becomes unimportant as well)?
>
> Is any of the archived data useful without manual evaluation upon 
> archiving? This is especially relevant for structures not solved yet. Do 
> the images belong to the structure factors, if only images are available, 
> where is the corresponding protein sequence, has it been sequenced, what 
> has been in the buffer/crystallization condition, what has been used during 
> protein purification, what was the intention for crystallization - e.g. a 
> certain functional state, that the protein was forced to by artificial 
> conditions, etc. etc. Who want's to evaluate that, and how? The question is 
> not that we could do it. We could do it, but wouldn't it advance science 
> far more if we would spend the time and money in new projects rather than 
> evaluation, administration, etc?

     There are many ways of advancing science, and perhaps every
specialist's views of this question are biased towards his/her own. We agree
that archiving of images without all the context within which they were
recorded would be futile. Gerard K raised the issue of all the manual work
one might have to contemplate if the PDB staff were to manually check that
the images do belong where they are supposed to. I think this is a false
problem. We need to get the synchrotron beamlines to do a better, more
consistent, more standardised job of keeping interconnected records linking
user projects to sample decriptors to image sets to processing results. The
pharma industry do that successfully: when they file the contents of their
hard disk after a synchrotron trip, they do not rely on extra staff to check
that the image do go with the targets and the ligands, as if they had just
received them in a package dellivered in the morning post: consistency is
built into their book-keeping system, that includes the relevant segment of
that process that gets executed at the synchrotron. 

--
> Be honest: How many of you have really, and completely, reanalysed your own 
> data, that you have deposited 10 years ago, with the latest software? What 
> changes did you find? Did you have to re-write your former discussions in 
> the publications? Do you think that the changes justify the efforts and 
> costs of worldwide archiving of all data?

     OK, good question, but the answer might not be what you expect. It is
the possibility of going back to raw data if some "auditing" of an old
result is required that is the most important. It is like an insurance
policy: would you ask people "How many of you have made calls on your
policies recently" and use the smallness of the proportion of YESs as an
argument for not getting one?

--
> Well, for all cases there are always (and have been mentioned in earlier 
> emails) single cases where these things matter or mattered. But does this 
> really justify all the future efforts and costs to archive the 
> exponentially (!) increasing amount of data? Do we need all this effort for 
> better statistics tables? Do you believe the standard lab biologist will 
> look into all the images at all? Is the effort just for us 
> crystallographers? As long as just a few dozen users would re-analyse the 
> data it is not worth it.

     I think that here again you are calling upon the argument of the
"market" for structural results among "standard lab biologist(s)". This is
important of course, and laudible efforts are being made by the PDB to make
its contents more aprroachable and digestible by that audience. That is a
different question, though, from that of continuing to improving the
standards of quality of crystallographic results produced by the community,
and in particular of the software tools produced by methods developers. On
that side of the divide, different criteria apply from those that matter the
most in the "consumer market" of lab biologists. The shift to
maximum-likelihood methods in phasing and refinement, for instance, did not
take place in response to popular demand from that mass market, if I recall
- and yet it made a qualitative difference to the quality and quantity of
the results they now have at their disposal.

--
> I like question marks, and maybe someone can give me an argument for 
> archiving images. At the moment I would vote for not archiving.

     I think that the two examples I gave at the beginning should begin to
answer your "why" question: because each reduced dataset might have fallen
victim to unanticipated shortcomings of the software (and underlying
assumptions) available at the time. I will be hard to convince that one can
anticipate the absence of unanticipated pitfalls of this kind ;-) . 


     With best wishes,
     
          Gerard.



> With best regards,
>
> Martin
>
>
> P.S. For the next-gen sequencing data, they have found a new way of 
> transferring the data, called VAN (the newbies might google for it) in 
> analogy to the old-fashioned and slow LAN and WLAN. Maybe we will also 
> adopt to this when archiving our data?
>
> -- 
> Priv. Doz. Dr. Martin Kollmar
>
> Max-Planck-Institute for Biophysical Chemistry
> Group Systems Biology of Motor Proteins
> Department NMR-based Structural Biology
> Am Fassberg 11
> 37077 Goettingen
> Deutschland
>
> Tel.: +49 551 2012260 / 2235
> Fax.: +49 551 2012202
>
> www.motorprotein.de (Homepage)
> www.cymobase.org (Database of Cytoskeletal and Motor Proteins)
> www.diark.org (diArk - a resource for eukaryotic genome research)
> www.webscipio.org (Scipio - eukaryotic gene identification)

-- 

     ===============================================================
     *                                                             *
     * Gerard Bricogne                     g...@globalphasing.com  *
     *                                                             *
     * Global Phasing Ltd.                                         *
     * Sheraton House, Castle Park         Tel: +44-(0)1223-353033 *
     * Cambridge CB3 0AX, UK               Fax: +44-(0)1223-366889 *
     *                                                             *
     ===============================================================

Reply via email to