Re: [music-dsp] A theory of optimal splicing of audio in the time domain.

robert bristow-johnson Sat, 09 Jul 2011 12:53:50 -0700


hi Olli (and others)...

i was reviewing this thread because i wanted to read what StefanStenzel had said and realized that you had posted this response, and idon't think i or anyone had responded to it. i don't remember readingit (it must be the cannabis). i hope you're listening Olli - i have alot of respect for what i have read from you (the pink elephant paper).

since this comes from last December, i reposted (with morecorrections) the original "theory" at the bottom.


On Dec 7, 2010, at 5:27 AM, Olli Niemitalo wrote:

RBJ,

I had a look at your theory, and compared it to my approach (dare not
call it a theory, as it was not as rigorously derived). The following
is how I imagine we thought things out.

Both of us wanted to preserve some aspect(s) of the known-to-be-good
constant-voltage crossfade envelopes, and to generalize from those the
envelope functions for arbitrary values of the correlation
coefficient.

You saw that the odd component o(t) determined the shape of the
constant-voltage envelopes. For those, the even component had to be
e(t) = 1/2 to satisfy the symmetry a(t) + a(-t) = 1 required in
constant-voltage crossfades.

it need not be the case that e(t) = 1/2 in the non-constant-voltagecrossfades.

So apparently o(t) was capturing the
essential aspects of the crossfade envelope. You showed how to
recalculate e(t) for different values of the correlation coefficient
in such a way that o(t) was preserved.

i wasn't trying to preserve o(t). it's just that it was easier to geta handle on a(t) (and a(-t)) if i split it into e(t) and o(t). andthen in the final solution, a square root was involved in solving foreither o(t) or e(t). since o(t) *has* to be bipolar, solving for o(t)in terms of e(t) is a little more problematic than vise versa becauseyou *know* that o(t) is necessarily bipolar and you have to deal withthe +/- sqrt() issue. but if you specify o(t) and solve for e(t),there is no problem with defining e(t) to be always non-negative.

I, on the other hand, chose that the ratio a(t)/a(-t) (using your
notation) should be preserved for each value of t.

now, i do not understand why you would do that. by "preserved", doyou mean constant over all t? even for simple, linear crossfades,you cannot satisfy that.

To accomplish this,
one could first do the crossfade using constant-voltage envelopes and
then apply to the resulting signal a volume envelope to adjust for any
deviation from perfect positive correlation. Or equivalently, the
compensation could be incorporated into a(t), which I showed how to do
in the case of a linear constant-voltage crossfade. Other
constant-voltage crossfade envelopes than linear could be handled by a
time deformation function u(t) which gives the time at which the
linear constant-voltage envelope function reaches the value of the
desired constant-voltage envelope function at time t. u(t) would then
used instead of t in the formula for a(t) derived for generalization
of the linear crossfade for arbitrary r.

so if a(t)/a(-t) is not "preserved" over different values of t but ispreserved over different values of r, i am not sure you want to do that.


what is the fundamental reason for preserving a(t)/a(-t) ?

I believe your requirement for r >= 0 could be relaxed. For example,
if one is creating a drum-loop, then it would probably make most sense
to put the loop points in the more quiet areas between the transients.
And there you might only have noise that is independent between the
two loop points, thus giving values of the correlation coefficient
slightly positive or slightly negative. Because the length of a drum
loop is fixed, there might not be so much choice in placement of the
loop points, and a spot giving a slightly negative r might actually be
the most natural choice. I do not think your formulas will fall apart
just as long as -1 < r <= 1.

but i don't think it is necessary to deal with lags where Rxx(tau) <0. why splice a waveform to another part of the same waveform thathas opposite polarity? that would create an even a bigger glitch.you want to find a value of the lag, tau, so that Rxx(tau) is maximum(not including tau around 0) and then your splice is as seamless as itcan be. then, if the splice is real good (r=1), you use a constant-voltage crossfade. when your splice is poor (r=0 and it need not bepoorer than that), you use a constant-power crossfade.

but i agree that the crossfade theory i presented does not require r >1. i just wanted to show that it degenerates to a constant-voltagecrossfade when r=1 and a constant-power crossfade when r=0.


--

r b-j                  r...@audioimagination.com

"Imagination is more important than knowledge."




This is a continuation of the thread started by Element Green titled:
Algorithms for finding seamless loops in audio

As far as I know, it is not published anywhere.  A few years ago, I
was thinking of writing this up and publishing it (or submitting it
for publication, probably to JAES), and had let it fall by the
wayside.  I'm "publishing" the main ideas here on music-dsp because of
some possible interest here (and the hope it might be helpful to
somebody), and so that "prior art" is established in case of anyone
like IVL is thinking of claiming it as their own.  I really do not
know how useful it will be in practice.  It might not make any
difference.  It's just a theory.

______________________________________________________________________

Section 0:

This is about the generalization of the different ways we can splice
and crossfade audio that has these two extremes:

  (1)  Splicing perfectly coherent and correlated signals
  (2)  Splicing completely uncorrelated signals

I sometimes call the first case the "constant-voltage crossfade"
because the crossfade envelopes of the two signals being spliced add
up to one.  The two envelopes meet when both have a value of 1/2.  In
the second case, we use a "constant-power crossfade", the square of
the two envelopes add to one and they meet when both have a value of
sqrt(1/2) = 0.707 .

The questions I wanted to answer are: What does one do for cases in
between, and how does one know from the audio, which crossfade
function to use?  How does one quantify the answers to these
questions?  How much can we generalize the answer?

______________________________________________________________________

Section 1: Set up the problem.

We have two continuous-time audio signals, x(t) and y(t), and we want
to splice from one to the other at time t=0.  In pitch-shifting or
time-scaling or any other looping, y(t) can be some delayed or
advanced version of x(t).

   e.g.    y(t) = x(t+P)

   where P is a period length or some other "good" splice
   displacement.  We get that value, P, from an algorithm
   we call a "pitch detector".

Also, it doesn't matter whether x(t) is getting spliced to y(t) or the
other way around, it should work just as well for the audio played in
reverse.  And it should be no loss of generality that the splice
happens at t=0, we define our coordinate system any damn way we damn
well please.

The signal resulting from the splice is

   v(t)  =  a(t)*y(t) + a(-t)*x(t)

By restricting our result to be equivalent if run either forward or
backward in time, we can conclude that "fade-in" function (say that's
a(t)) is the time-reversed copy of the "fade-out" function, a(-t).

For the correlated case   (1):   a(t)    +  a(-t)    = 1   for all t

For the uncorrelated case (2):  (a(t))^2 + (a(-t))^2 = 1   for all t

This crossfade function, a(t), has well-defined even and odd symmetry
components:

               a(t)  =  e(t) + o(t)
where

   even part:  e(t) =  e(-t)  =  ( a(t) + a(-t) )/2
   odd part:   o(t) = -o(-t)  =  ( a(t) - a(-t) )/2

And it's clear that

               a(-t)  =  e(t) - o(t)  .


For example, if it's a simple linear crossfade (equivalent to splicing
analog tape with a diagonally-oriented razor blade):

         { 0                 for   t <= -1
         {
  a(t) = { 1/2 + t/2         for  -1 < t < 1
         {
         { 1                 for   t >= 1

This is represented simply, in the even and odd components, as:

  e(t) = 1/2

         { t/2               for  |t| < 1
  o(t) = {
         { sgn(t)/2          for  |t| >= 1


   where  sgn(t) is the "sign function":


           { -1                for   t < 0
           {
  sgn(t) = { 0                 for   t = 0
           {
           { +1                for   t > 0

  a shorthand: sgn(t) = t/|t| .

This is a constant voltage-crossfade, appropriate for perfectly
correlated signals; x(t) and y(t).  There is no loss of generality by
defining the crossfade to take place around t=0 and have two time
units in length.  Both are simply a matter of offset and scaling of
time.

Another constant-voltage crossfade would be what I might call a "Hann
crossfade" (after the Hann window):

  e(t) = 1/2

         { (1/2)*sin(pi/2 * t)     for  |t| < 1
  o(t) = {
         { sgn(t)/2                for  |t| >= 1


Some might like that better because the derivative is continuous
everywhere.  Extending this idea, one more constant-voltage crossfade
is what I might call a "Flattened Hann crossfade":

  e(t) = 1/2

         { (9/16)*sin(pi/2 * t) + (1/16)*sin(3*pi/2 * t) for |t| < 1
  o(t) = {
         { sgn(t)/2                                     for |t| >= 1

This splice is everywhere continuous in the zeroth, first, and second
derivative.  A very smooth crossfade.

As another example, a constant-power crossfade would be the same as
any of the above, but where the above a(t) is square rooted:

         { 0                   for   t <= -1
         {
  a(t) = { sqrt(1/2 + t/2)     for  -1 < t < 1
         {
         { 1                   for   t >= 1

This is what we might use to splice to completely uncorrelated signals
together.  We can separate this into even and odd parts as:


         { (1/2)*(sqrt(1/2 + t/2) + sqrt(1/2 - t/2))   for  |t| < 1
  e(t) = {
         {  1/2                                        for  |t| >= 1


         { (1/2)*(sqrt(1/2 + t/2) - sqrt(1/2 - t/2))   for  |t| < 1
  o(t) = {
         { sgn(t)/2                                    for  |t| >= 1

______________________________________________________________________

Section 2:  Which crossfade function to use?

Now we shall make a definition and an assumption.  We shall define an
inner product of two general signals as:

                                +inf
   <x,y> = <x(t), y(t)>  =  integral{ x(t)*y(t) * w(t) dt}
                                -inf

w(t) is a window function that is symmetrical about t=0 and is
probably wider than the crossfade.  Strictly speaking, if you were
coming at this from out of a graduate course in metric spaces or
functional analysis, one of the components (probably y(t)) should be
complex conjugated, but since x(t) and y(t) are always real, in this
whole theory, I will not bother with that notation.

This inner product is an degenerate case of the more general cross-
correlation evaluated with a lag of zero:

                                      +inf
   Rxy(tau) = <x(t), y(t+tau)>  = integral{ x(t)*y(t+tau) * w(t) dt}
                                      -inf

If y(t) is a time-offset copy of x(t), then Rxy(tau) is the
autocorrelation of x(t), Rxx(tau), but also accounting for the time
offset in the lag, tau.

   So  <x,y>  =  Rxy(0)

A measure of signal energy or average power is:

                        +inf
   Rxx(0) = <x,x> = integral{ (x(t))^2 * w(t) dt}
                        -inf

Now, the assumption that we are going to toss in here is that the mean
power of the two signals that we are crossfading, x(t) and y(t), are
equal.

   <x,x> = <y,y>

We are assuming that we're not crossfading this very quiet tone or
sound to a very loud sound that is 60 dB louder.  Similarly, the
resulting spliced sound, v(t), has the same mean power of the two
signals being spliced:

   <v,v> = <x,x> = <y,y>

So, assuming we lined up x(t) and y(t) so that we want to splice from
one to the other at t=0, and scaled x(t) and y(t) so that they have
the same mean power in the neighborhood of t=0, then the inner product
is a measure of how well they are correlated.  We shall define this
normalized measure of correlation as:

   r  =  <x,y>/<x,x>  =  <x,y>/<y,y>

If r = 1, they are perfectly correlated and if r = 0, they are
completely uncorrelated.

We will make the additional assumption that our pitch detection
algorithm will find *some* lag, P, where the correlation is at least
zero correlated.  We should not have to deal with splicing
*negatively* correlated audio (that would have quite a "glitch" or a
bad splice).  If the two signals, x(t) and y(t), have no DC component,
then their autocorrelations and their cross-correlations to each other
must have no DC component.  That means there will be values of tau
such that Rxy(tau) are either negative or positive.  If it was
theoretical white noise, Rxx(tau) would be zero for |tau| > 0 and
Rxx(0) would be the noise variance or power.  But Rxx(tau) cannot be
negative for *all* values of tau, even excluding tau=0.

For the splicing done in a time-domain pitch shifting or time scaling
algorithm, we can find a value of tau so that Rxx(tau) is non-negative
and we want to choose tau = P so that has the highest value of
Rxx(tau).  Then define

   y(t)  =  x(t+P)

and then

   <x,y>  =  Rxy(0)  =  Rxx(P)

Now we shall also assume that the crossfade function, a(t), is
completely uncorrelated and even statistically independent from the
two signals being spliced.  a(t) is a volume control that varies in
time, but is unaffected by anything in x(t) or y(t).

We shall also assume something called "ergodicity".  This means that
*time* averages of x(t) and y(t) (or combinations of x(t) and y(t))
are equal to *statistical* averages.  If this window, w(t) is scaled
(or normalized) so that its integral is 1,

        +inf
    integral{ w(t) dt} = 1
        -inf

then all these inner products (which are time averages) can be related
to "expectation values" (which are statistical averages):

    <x,y> = E{ x(t)*y(t) }

If x(t) and y(t) are thought of as sorta "random" processes (rather
than well defined deterministic functions), the expectation value is
unmoved no matter what t is.  But if the envelope a(t) is considered
deterministic, then it simply scales x(t) or y(t) and is treated as a
constant in the expectation.  So at some particular time t0,

    <a(t0)*x,y>  =  E{ (a(t0)*x(t)) * y(t) }

                 =  a(t0) * E{ x(t) * y(t) }

                 = a(t0) * <x,y>

This is a little sloppy, mathematically, because I am "fixing" t for
a(t) to be t0, but not fixing t for x(t) or y(t) (so that "time
averages" for x(t) and y(t) can be meaningful and equated to
statistical averages).

Recall that

   v(t)  =  a(t)*y(t) + a(-t)*x(t)

Then:

   <v,v> =  <(a(t)*y(t) + a(-t)*x(t)), (a(t)*y(t) + a(-t)*x(t))>

Using identities that we can apply to expectation values

   <v,v> = (a(t))^2*<y,y>  +  2*a(t)*a(-t)*<x,y>  +  (a(-t))^2*<x,x>

Since <v,v> = <x,x> = <y,y>, we can divide by <v,v> and get to the key
equation of this whole theory:

   1  =  (a(t))^2  +  2*r*a(t)*a(-t)  +  (a(-t))^2

Given the normalized correlation measure, we want the above equation
to be true all of the time.  If r=0 (completely uncorrelated), one can
see we get a constant-power crossfade:

   (a(t))^2 + (a(-t))^2  =  1

If r=1 (completely correlated), one can see that we get a constant-
voltage crossfade:

   (a(t))^2 + (a(-t))^2 + 2*a(t)*a(-t)  =  ( a(t) + a(-t) )^2  =  1

or, assuming a(t) is non-negative,

   a(t) + a(-t) = 1 .

______________________________________________________________________

Section 3:  Generalizing the crossfade function

Recall that

               a(t)   =  e(t) + o(t)

               a(-t)  =  e(t) - o(t)

and substituting into

   (a(t))^2  +  (a(-t))^2  +  2*r*a(t)*a(-t)  =  1

results in

    (e(t) + o(t))^2  +  (e(t) - o(t))^2
         +  2*r*(e(t) + o(t))*(e(t) - o(t))  =  1

Blasting through that gets:

   (1+r)*(e(t))^2  +  (1-r)*(o(t))^2  =  1/2


This means that, if r is measured and known (from the correlation
function) we have the freedom to define either one of e(t) or o(t)
arbitrarily (as long as the even or odd symmetry is kept) and solve
for the other.  We can see that square rooting is involved in solving
for either e(t) or o(t) and there is an ambiguity for which sign to
pick.  We shall resolve that ambiguity by adding the additional
assumption that the even-symmetry component, e(t), is non-negative.

   e(t)  =  e(-t)  >=  0

Given a general and bipolar odd-symmetry component function,

   o(t)  =  -o(-t)

then we solve for the even component (picking the non-negative square
root):

   e(t)  =  sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 )

The overall crossfade envelope would be

   a(t)  =  e(t)  +  o(t)

         =  sqrt( (1/2)/(1+r) - (1-r)/(1+r)*(o(t))^2 )  +  o(t)

______________________________________________________________________

Section 4:  Implementation:

Given a particular form for the odd part, o(t) (linear or Hann or
Flattened Hann or whatever is your heart's desire), and for a variety
of values of r, ranging from r=0 to r=1, a collection of envelope
functions, a(t), are pre-calculated and stored in memory.  Then, when
pitch detection or loop matching is done, a splice displacement that
is optimal is determined, and if autocorrelation of some form is used
in determining a measure of goodness (or seamlessness, using Element's
language) of that loop splice, that autocorrelation is normalized (by
dividing by Rxx(0)) to get r and that value of r is used to choose
which pre-calculated a(t) from the above collection is used for the
crossfade in the splice.

______________________________________________________________________


--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp

Re: [music-dsp] A theory of optimal splicing of audio in the time domain.

Reply via email to