dithering by optimizing perceptually-weighted error in the Gabor transform

Kragen Javier Sitaker Mon, 09 Apr 2012 01:05:15 -0700

The spatial responses of neurons in the first stages of the human visual system
are well approximated by Gabor features (the wavelet that arises from the
multiplication of a spatial sinusoid, i.e.  sin(omega dotproduct(somedirection,
[x y]) with a Gaussian), and the visual system is more sensitive to some
frequencies of these Gabor features than others.  (Interestingly, amblyopic
vision is sensitive to a different set of spatial frequencies.) JPEG, JPEG2000,
and MPEG compression exploit this feature of visual perception by spending more
bits on spatial frequencies that are likely to be perceptually salient, given a
typical viewing distance and pixel density.


(I'm going to stop saying "spatial frequencies".  All the frequencies in this
post are spatial; I'm not going to talk about how many terahertz is red or how
frequently you microsaccade.)

Dithering or halftoning is another sort of lossy data compression problem.  In
a typical case, given a picture with, say, 9 bits per pixel of luminance
information, you want to produce another picture with only 1 bit per pixel, but
which nevertheless appears identical to the original, to a human viewer.  This
is similar to the problem JPEG solves, but with the additional constraint that
you don't have the freedom to specify the "decompression algorithm", because
biology has already determined it for you.

There is clearly a tradeoff here between different frequencies: the bits you
use to encode higher frequencies are not available to encode lower frequencies,
and vice versa.  This implies that there's an unavoidable tradeoff between
precisely locating edges and precisely rendering levels of brightness.  But
that doesn't mean that standard algorithms are Pareto-optimal in these
tradeoffs, and because of the human visual system's varying sensitivity to
errors at different Gabor frequencies, even Pareto-optimal algorithms could
possibly be improved.

A special case where this is a particularly easy tradeoff is when the pixels
are very, very small: say, 600 or 1200 dpi. At a typical viewing distance of 1
meter, a 1200dpi pixel subtends 9 seconds of arc.  A foveal cone subtends some
30 arcseconds, and ideal human foveal visual resolution is commonly cited as 20
arcseconds, although "vernier acuity", the ability to visually detect edge
misalignments which is used by vernier scales, goes down to 4-6 arcseconds.  So
modern inkjet printers have the advantage of being able to use spatial
frequencies to which we are not only neurally insensitive but actually not even
able to see at all except perhaps via aliasing and microsaccade-based subpixel
effects.  (Apparently this "visual hyperacuity" is still not understood, and is
listed on Wikipedia's "Perceptual paradox" page; my bet is on microsaccades and
antialiasing.)  So you can guiltlessly devote none of your bits to representing
the top octave or two of the frequency range, which gives you plenty of bits to
encode the lower frequencies that you can actually see.  That's a tradeoff you
don't even have to think about.  (You just have to be careful that your
halftoning algorithm isn't going to generate artifacts that alias down into the
visual range.)

However, we don't always enjoy the advantage of being able to use 10-arcsecond
pixels.  The pixels on the netbook screen I'm typing this on are about 0.2mm
across, or about 130dpi, which is considerably denser than is typical for a
computer screen.  If I look at the screen from 25cm, each pixel subtends about
2½ minutes of arc.  And I'm thinking about a laser-printed microfilm project
where you'll typically be magnifying the printed pixels, for reading, to about
the same size.  At these resolutions, bilevel images generated using the usual
algorithms (halftone screens, ordered dithering, Floyd-Steinberg error
diffusion; [Cris Luengo wrote an excellent overview][2]) are dramatically
inferior to grayscale images.

They're also dramatically inferior to the visual quality achieved by stippling,
like the work of Noli Novak and her colleagues in the Wall Street Journal, and
copperplate engraving, which also produce bilevel images with limited
resolution.

This, then, is the background against which I wrote my kragen-hacks post last
month, which crudely [makes images look like you ran them through a
photocopier][3].

This weekend, I was working on rendering text with a very small pixel
proportional pixel font (a followup to [my font rendering engine in DHTML][0],
with application to archival data recording as in [Carving stone tablets at
home in your spare time][1]) and it occurred to me that a really useful thing
to archive in a tiny font would be Wikipedia.  Maybe not all of Wikipedia, but
at least [the most important articles][4].  And it occurred to me that you'd
pretty much need to include at least some of the images; nearly all of the
articles are substantially improved by them (see
<http://en.wikipedia.org/wiki/Claude_Monet>,
<http://en.wikipedia.org/wiki/Richard_Wagner>, and
<http://en.wikipedia.org/wiki/Maize>) but articles about two- or
three-dimensional inventions, such as <http://en.wikipedia.org/wiki/Arch>;
visual arts, such as <http://en.wikipedia.org/wiki/Mona_Lisa>; or anatomy, such
as <http://en.wikipedia.org/wiki/Kidney>; would be totally vitiated by their 
loss.

With an eye toward finding a reasonable way to include images in what is
effectively microfilm made on a standard laser printer, I [tried out some
different dithering methods on a photograph of singer Gwen Stefani][5],
including my kragen-hacks post.  I was able to get results nearly as good as my
photocopier hack by the well-known technique of applying a "sharpen" filter to
the image before dithering it, thus amplifying the high frequencies.  But none
of the results even approached the quality of WSJ stipple portraits.

[0]: http://canonical.org/~kragen/dofonts.html
[1]: http://lists.canonical.org/pipermail/kragen-tol/2005-May/000777.html
[2]: http://www.cb.uu.se/~cris/blog/index.php/archives/355
[3]: http://canonical.org/~kragen/sw/byn/ (Contains bare breasts.)
[5]: http://canonical.org/~kragen/sw/stefani-images/ (Contains bare arms.)
[4]: http://en.wikipedia.org/wiki/Wikipedia:Vital_articles

Why was the quality so bad? I looked at the academic literature on dithering
and halftoning using computers, and found a lot of stuff about trying to
minimize the error between a smoothed version of the output image and the
original input image, but nothing about minimizing an approximation of
perceptual error using a Gabor-wavelet-based model of the human visual system.

The only thing I came up with where someone was using Gabor wavelets to measure
and optimize the performance of halftoning algorithms was an article called
"Joint Edge-Preserving and Moire-Suppressing Image Resampling Using the
Discrete Gabor Transform", which was specifically about preventing the creation
of moiré artifacts or "aliasing" in gravure printing --- that is, visible
difference frequencies created by heterodyning through nonlinear interaction
between a sampling grid and high-frequency components of the input image ---
without low-pass filtering the holy living fuck out of the input image with a
single-pole filter, thus blurring the edges of everything, and without just
brick-wall filtering its Fourier transform, which will create all kinds of
poorly localized ringing artifacts.  Instead they use its the Gabor transform
to figure out which parts of the image to low-pass filter the holy living fuck
out of.  At least that's what it sounds like from the abstract; I haven't
actually read the whole paper.  You'd think maybe they'd just low-pass filter
it in the Gabor domain instead of the Fourier frequency domain or the spatial
domain, but what do I know?

There's been some dramatic recent work by Chang, Alain, and Ostromoukhov called
"[Structure-Aware Error Diffusion][6]", building on a 2008 paper by Pang, Wong,
Cohen-Or, and Heng called "[Structure-aware Halftoning][7]", which has even
more dramatic results but runs orders of magnitude slower than other halftoning
algorithms.

[6]: http://liris.cnrs.fr/Documents/Liris-4442.pdf
[7]: 
http://www.cse.cuhk.edu.hk/~ttwong/papers/structurehalftone/paper/texhalftone.pdf

But none of this work is exploiting the bandpass nature of the contrast
sensitivity function (CSF) of the human visual system, a well-documented
phenomenon that has been known for decades among vision scientists.  The HVS is
most sensitive to frequencies between 2 and 5 cycles per degree; frequencies
*on both sides* are less visible.  (Those seem to be the half-power points,
i.e.  this is a band-pass filter of just over an octave of bandwidth, with a Q
factor of just below 1.) That means that when papers say things like, "It is
accepted that the ideal halftone image has a blue-noise spectrum," ([Pang et
al.  2008][7]) which implies that the lower-frequency the noise, the less
acceptable it is, they are setting themselves an arbitrary goal that cannot be
justified in terms of perceptual quality.

So here's what I was thinking: halftone by minimizing some vector norm in the
Gabor space between the input and output images, where the Gabor space
components are perceptually weighted.  That is, you do the Gabor transform of
both images, multiply the Gabor components pointwise by a CSF model of the HVS,
subtract corresponding components, and measure the size of the resulting
difference vector, say by the algebraic sum or Euclidean sum of its components.
That's what you want to minimize.  The great thing is that because the Gabor
transform is a linear operator, modifications to the spatial-domain image
affect the Gabor-domain image and difference vector linearly.

It would be nice if the Gabor transform were invertible, but the Gabor basis
functions are apparently not orthogonal, which I think means that most sets of
coefficients for the Gabor basis functions do not correspond to an image in the
spatial domain.  If they were, you could tell which pixels in your image most
need to be flipped in order to improve the image quality: they're the ones with
the highest peaks in the inverse Gabor transform of the motherfucking
difference vector.  Presumably you can make some kind of headway by throwing
out coefficients at random until a solution starts to exist.

There was actually a 1997 SPIE paper, [Image Quality Assessment with a Gabor
Pyramid Model of the Human Visual System][8], by Taylor, Pizlo, Allebach, and
Bouman, which proposed doing exactly this in order to measure the fidelity of
image approximation methods, including, among other things, halftoning.  This
hasn't been completely ignored (Google Scholar finds 34 citations) but you
still find even the best current halftoning work using totally boneheaded
blue-noise criteria instead.  Maybe it's because Taylor was a grad student at
Purdue, where he'd actually just done a bunch of work on halftoning and then
went on to do his Ph.D. thesis on this model, and then went off to teach at a
vocational school in Milwaukee and stopped doing research.

[8]: 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86.2727&rep=rep1&type=pdf

Taylor et al.'s algorithm doesn't let you invert the Gabor transform, but it
does produce for each pixel a probability that a human will notice a difference
between the images at that pixel, which seems like it would be just as good for
the case of halftoning, where it tells you which pixels you most need to flip.
From there, it's gradient descent and boom, Bob's your uncle.

Or maybe I just had too much Diet Coke and it's 5 AM.  Your call.
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

dithering by optimizing perceptually-weighted error in the Gabor transform

Reply via email to