I'll second that... can't remember anybody on the barricades about "corrected" CCD images, but they've been just so much more practical.

Different kind of problem, I know, but equivalent situation: the people to ask are not the purists, but the ones struggling with the huge volumes of data. I'll take the lossy version any day if it speeds up real-time evaluation of data quality, helps me browse my datasets, and allows me to do remote but intelligent data collection.

phx.



On 08/11/2011 02:22, Herbert J. Bernstein wrote:
Dear James,

    You are _not_ wasting your time.  Even if the lossy compression ends
up only being used to stage preliminary images forward on the net while
full images slowly work their way forward, having such a compression
that preserves the crystallography in the image will be an important
contribution to efficient workflows.  Personally I suspect that
such images will have more important, uses, e.g. facilitating
real-time monitoring of experiments using detectors providing
full images at data rates that simply cannot be handled without
major compression.  We are already in that world.  The reason that
the Dectris images use Andy Hammersley's byte-offset compression,
rather than going uncompressed or using CCP4 compression is that
in January 2007 we were sitting right on the edge of a nasty
CPU-performance/disk bandwidth tradeoff, and the byte-offset
compression won the competition.   In that round a lossless
compression was sufficient, but just barely.  In the future,
I am certain some amount of lossy compression will be
needed to sample the dataflow while the losslessly compressed
images work their way through a very back-logged queue to the disk.

    In the longer term, I can see people working with lossy compressed
images for analysis of massive volumes of images to select the
1% to 10% that will be useful in a final analysis, and may need
to be used in a lossless mode.  If you can reject 90% of the images
with a fraction of the effort needed to work with the resulting
10% of good images, you have made a good decision.

    An then there is the inevitable need to work with images on
portable devices with limited storage over cell and WIFI networks. ...

    I would not worry about upturned noses.  I would worry about
the engineering needed to manage experiments.  Lossy compression
can be an important part of that engineering.

    Regards,
      Herbert


At 4:09 PM -0800 11/7/11, James Holton wrote:
So far, all I really have is a "proof of concept" compression algorithm here:
http://bl831.als.lbl.gov/~jamesh/lossy_compression/

Not exactly "portable" since you need ffmpeg and the x264 libraries
set up properly.  The latter seems to be constantly changing things
and breaking the former, so I'm not sure how "future proof" my
"algorithm" is.

Something that caught my eye recently was fractal compression,
particularly since FIASCO has been part of the NetPBM package for
about 10 years now.  Seems to give comparable compression vs quality
as x264 (to my eye), but I'm presently wondering if I'd be wasting my
time developing this further?  Will the crystallographic world simply
turn up its collective nose at lossy images?  Even if it means waiting
6 years for "Nielsen's Law" to make up the difference in network
bandwidth?

-James Holton
MAD Scientist

On Mon, Nov 7, 2011 at 10:01 AM, Herbert J. Bernstein
<y...@bernstein-plus-sons.com>  wrote:
  This is a very good question.  I would suggest that both versions
  of the old data are useful.  If was is being done is simple validation
  and regeneration of what was done before, then the lossy compression
  should be fine in most instances.  However, when what is being
  done hinges on the really fine details -- looking for lost faint
  spots just peeking out from the background, looking at detailed
  peak profiles -- then the lossless compression version is the
  better choice.  The annotation for both sets should be the same.
  The difference is in storage and network bandwidth.

  Hopefully the fraud issue will never again rear its ugly head,
  but if it should, then having saved the losslessly compressed
  images might prove to have been a good idea.

  To facilitate experimentation with the idea, if there is agreement
  on the particular lossy compression to be used, I would be happy
  to add it as an option in CBFlib.  Right now all the compressions
  >  we have are lossless.
  Regards,
   Herbert


  =====================================================
   Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  y...@dowling.edu
  =====================================================

  On Mon, 7 Nov 2011, James Holton wrote:

  At the risk of sounding like another "poll", I have a pragmatic question
  for the methods development community:

  Hypothetically, assume that there was a website where you could download
  the original diffraction images corresponding to any given PDB file,
  including "early" datasets that were from the same project, but because of
  smeary spots or whatever, couldn't be solved.  There might even be datasets
  with "unknown" PDB IDs because that particular project never did work out,
  or because the relevant protein sequence has been lost.  Remember, few of
  these datasets will be less than 5 years old if we try to allow enough time
  for the original data collector to either solve it or graduate (and then
  cease to care).  Even for the "final" dataset, there will be a delay, since
  the half-life between data collection and coordinate deposition in the PDB
  is still ~20 months. Plenty of time to forget.  So, although the
images were
  archived (probably named "test" and in a directory called "john") it may be
  that the only way to figure out which PDB ID is the "right answer" is by
  processing them and comparing to all deposited Fs.  Assume this was done.
   But there will always be some datasets that don't match any PDB.
Are those
  interesting?  What about ones that can't be processed?  What
about ones that
  can't even be indexed?  There may be a lot of those!  (hypothetically, of
  course).

  Anyway, assume that someone did go through all the trouble to make these
  datasets "available" for download, just in case they are interesting, and
  annotated them as much as possible.  There will be about 20
datasets for any
  given PDB ID.

  Now assume that for each of these datasets this hypothetical website has
  two links, one for the "raw data", which will average ~2 GB per
wedge (after
  gzip compression, taking at least ~45 min to download), and a second link
  for a "lossy compressed" version, which is only ~100 MB/wedge (2 min
  download). When decompressed, the images will visually look
pretty much like
  the originals, and generally give you very similar Rmerge, Rcryst, Rfree,
  I/sigma, anomalous differences, and all other statistics when
processed with
  contemporary software.  Perhaps a bit worse.  Essentially, lossy
compression
  is equivalent to adding noise to the images.

  Which one would you try first?  Does lossy compression make it easier to
  hunt for "interesting" datasets?  Or is it just too repugnant to have
  "modified" the data in any way shape or form ... after the detector
  manufacturer's software has "corrected" it?  Would it suffice to simply
  supply a couple of "example" images for download instead?

  -James Holton
  MAD Scientist


Reply via email to