Dear James, This is technically ingenious stuff. Perhaps it could be applied to help the 'full archive challenge' ie containing many data sets that will never lead to publication/ database deposition?
However for the latter,publication/deposition, subset you would surely not 'tamper' with the raw measurements? The 'grey area' between the two clearcut cases ie where eventually publication/deposition MAY result then becomes the challenge as to whether to compress or not? (I would still prefer no tampering.) Greetings, John Prof John R Helliwell DSc On 24 Oct 2011, at 22:56, James Holton <jmhol...@lbl.gov> wrote: > The Pilatus is fast, but or decades now we have had detectors that can read > out in ~1s. This means that you can collect a typical ~100 image dataset in > a few minutes (if flux is not limiting). Since there are ~150 beamlines > currently operating around the world and they are open about 200 days/year, > we should be collecting ~20,000,000 datasets each year. > > We're not. > > The PDB only gets about 8000 depositions per year, which means either we > throw away 99.96% of our images, or we don't actually collect images anywhere > near the ultimate capacity of the equipment we have. In my estimation, both > of these play about equal roles, with ~50-fold attrition between ultimate > data collection capacity and actual collected data, and another ~50 fold > attrition between collected data sets and published structures. > > Personally, I think this means that the time it takes to collect the final > dataset is not rate-limiting in a "typical" structural biology project/paper. > This does not mean that the dataset is of little value. Quite the opposite! > About 3000x more time and energy is expended preparing for the final dataset > than is spent collecting it, and these efforts require experimental feedback. > The trick is figuring out how best to compress the "data used to solve a > structure" for archival storage. Do the "previous data sets" count? Or > should the compression be "lossy" about such historical details? Does the > stuff between the spots matter? After all, h,k,l,F,sigF is really just a > form of data compression. In fact, there is no such thing as "raw" data. > Even "raw" diffraction images are a simplification of the signals that came > out of the detector electronics. But we round-off and average over a lot of > things to remove "noise". Largely because "noise" is difficult to compress. > The question of how much compression is too much compression depends on which > information (aka noise) you think could be important in the future. > > When it comes to fine-sliced data, such as that from Pilatus, the main reason > why it doesn't compress very well is not because of the spots, but the > background. It occupies thousands of times more pixels than the spots. Yes, > there is diffuse scattering information in the background pixels, but this > kind of data is MUCH smoother than the spot data (by definition), and > therefore is optimally stored in larger pixels. Last year, I messed around a > bit with applying different compression protocols to the spots and the > background, and found that ~30 fold compression can be easily achieved if you > apply h264 to the background and store the "spots" with lossless png > compression: > > http://bl831.als.lbl.gov/~jamesh/lossy_compression/ > > I think these results "speak" to the relative information content of the > spots and the pixels between them. Perhaps at least the "online version" of > archived images could be in some sort of lossy-background format? With the > "real images" in some sort of slower storage (like a room full of tapes that > are available upon request)? Would 30-fold compression make the storage of > image data tractable enough for some entity like the PDB to be able to afford > it? > > > I go to a lot of methods meetings, and it pains me to see the most brilliant > minds in the field starved for "interesting" data sets. The problem is that > it is very easy to get people to send you data that is so bad that it can't > be solved by any software imaginable (I've got piles of that!). As a > developer, what you really need is a "right answer" so you can come up with > better metrics for how close you are to it. Ironically, bad, unsolvable data > that is connected to a right answer (aka a PDB ID) is very difficult to > obtain. The explanations usually involve protestations about being in the > middle of writing up the paper, the student graduated and we don't understand > how he/she labeled the tapes, or the RAID crashed and we lost it all, etc. > etc. Then again, just finding someone who has a data set with the kind of > problem you are interested in is a lot of work! So is figuring out which > problem affects the most people, and is therefore "interesting". > > Is this not exactly the kind of thing that publicly-accessible centralized > scientific databases are created to address? > > -James Holton > MAD Scientist > > On 10/16/2011 11:38 AM, Frank von Delft wrote: >> On the deposition of raw data: >> >> I recommend to the committee that before it convenes again, every member >> should go collect some data on a beamline with a Pilatus detector [feel free >> to join us at Diamond]. Because by the probable time any recommendations >> actually emerge, most beamlines will have one of those (or similar), we'll >> be generating more data than the LHC, and users will be happy just to have >> it integrated, never mind worry about its fate. >> >> That's not an endorsement, btw, just an observation/prediction. >> >> phx. >> >> >> >> >> On 14/10/2011 23:56, Thomas C. Terwilliger wrote: >>> For those who have strong opinions on what data should be deposited... >>> >>> The IUCR is just starting a serious discussion of this subject. Two >>> committees, the "Data Deposition Working Group", led by John Helliwell, >>> and the Commission on Biological Macromolecules (chaired by Xiao-Dong Su) >>> are working on this. >>> >>> Two key issues are (1) feasibility and importance of deposition of raw >>> images and (2) deposition of sufficient information to fully reproduce the >>> crystallographic analysis. >>> >>> I am on both committees and would be happy to hear your ideas (off-list). >>> I am sure the other members of the committees would welcome your thoughts >>> as well. >>> >>> -Tom T >>> >>> Tom Terwilliger >>> terwilli...@lanl.gov >>> >>> >>>>> This is a follow up (or a digression) to James comparing test set to >>>>> missing reflections. I also heard this issue mentioned before but was >>>>> always too lazy to actually pursue it. >>>>> >>>>> So. >>>>> >>>>> The role of the test set is to prevent overfitting. Let's say I have >>>>> the final model and I monitored the Rfree every step of the way and can >>>>> conclude that there is no overfitting. Should I do the final refinement >>>>> against complete dataset? >>>>> >>>>> IMCO, I absolutely should. The test set reflections contain >>>>> information, and the "final" model is actually biased towards the >>>>> working set. Refining using all the data can only improve the accuracy >>>>> of the model, if only slightly. >>>>> >>>>> The second question is practical. Let's say I want to deposit the >>>>> results of the refinement against the full dataset as my final model. >>>>> Should I not report the Rfree and instead insert a remark explaining the >>>>> situation? If I report the Rfree prior to the test set removal, it is >>>>> certain that every validation tool will report a mismatch. It does not >>>>> seem that the PDB has a mechanism to deal with this. >>>>> >>>>> Cheers, >>>>> >>>>> Ed. >>>>> >>>>> >>>>> >>>>> -- >>>>> Oh, suddenly throwing a giraffe into a volcano to make water is crazy? >>>>> Julian, King of Lemurs >>>>>