Thanks for your expanded notes, RBJ. I haven't found anything that I disagree 
with or that contradicts what I was saying earlier - I'm not sure if they were 
intended as expanded context or if there was something you were disagreeing 
with.

On March 8, 2020 7:55 PM Ethan Duni <ethan.d...@gmail.com> wrote:
> 
> Fast FIR is a different thing than an FFT filter bank.
> 
> You can combine the two approaches but I don’t think that’s what is being 
> done here?

The point I'm making here is that overlap-add fast FIR is a special case of 
STFT-domain multiplication and resynthesis. I'm defining the standard STFT 
pipeline here as:

1. slice your signal into frames
2. pointwise-multiply an analysis window by each frame
3. perform `rfft` on each frame to give the STFT domain representation
4. modify the STFT representation
5. perform `irfft` on each frame
6. pointwise-multiply a synthesis window on each frame
7. overlap-add each frame to get the resulting time-domain signal

See below for more.

On Mon, Mar 9, 2020, at 5:44 PM, robert bristow-johnson wrote:
> 
> > On March 9, 2020 10:15 AM Spencer Russell <s...@media.mit.edu> wrote:
> > 
> > 
> > I think we're mostly on the same page, Ethan.
> 
> well, i think that i am on the same page as Ethan.
> 
> > Though even with STFT-domain time-variant filtering (such as with noise 
> > reduction, or mask-based source separation) it would seem you could still 
> > zero-pad each input frame to eliminate any issues due to time-aliasing.
> 
> zero-padding is the sole technique that gets rid of time-aliasing.

Right - we're together here.

> let's say your FIR is of length L.  let's say that your frame hop is H 
> and frame length is F ≥ H and we're doing overlap-add.  then your F 
> samples of input (H samples are *new* samples in the current frame, F-H 
> samples are remaining from the previous frame) are considered 
> zero-padded out to infinity in both directions.  then the length of the 
> result of linear convolution is L+F-1.  now if you can guarantee that 
> the size of the DFT, which we'll call "N" (and most of the time is a 
> power of 2) is at least as large as the non-zero length of the linear 
> convolution, then the result of circular convolution of the zero-padded 
> FIR and the zero-padded frame of samples will be exactly the same.  
> that means
> 
>    N ≥ L + F - 1

You are completely correct, and as far as I can tell we're in agreement here 
(again please correct me if this was meant to be a rebuttal). Specifically I'm 
talking about the case where F=H. You then perform a standard STFT with these 
parameters (Hop size H, rectangular window of size F = H, FFT length H+L-1), 
multiply each frame by the (r)FFT of your filter, then do the standard ISTFT 
with overlap-add.

Your STFT will have a height of `N/2-1` (integer division). You do the standard 
ISTFT with overlap-add, and the same hop size H. The frame size is now the full 
N. You use a "synthesis window" that's the full length N (in practice just 
taking each chunk with no windowing). Within the ISTFT process you took the 
`irfft` of each frame, which is now nonzero for some length longer than H, but 
not more than N (so there's no time aliasing).

That should be exactly the same thing as fast FIR convolution with a chunk size 
of F, but in the framework of STFT->multiply->ISTFT. The only thing that's not 
standard STFT processing is the zero-padding (to remove aliasing due to 
circular convolution, a now much-belabored point).

This is just to make the point that fast FIR is a special case of STFT 
processing. From a compute perspective this should be no less efficient than 
fast FIR (I mean, it's doing the same thing). If you do the whole STFT off-line 
then you wasted some memory materializing the whole STFT, but you could 
consider a streaming version, and at that point the implementation would look 
very similar to what you'd code up for fast FIR.

Are we all together here?

===== Time Variant Filtering ====

So this seems like it's the really interesting part, and usually why people 
work in the STFT domain in the first place. As RBJ mentioned, padding (ensuring 
N >= L+F-1) completely resolves time-aliasing is true whether the filter is 
stationary or time-varying.

> if it is a rectangular window, the frame length and frame hop are the 
> same, F=H, and the number of generated output samples that are valid is 
> H, and the most you can hope to get is:
> 
>     H = F = N - L + 1

Right, this is the Fast FIR situation I described above.

> <snip>
> if you cut your frame hop size, H, from F to nearly half (F+1)/2 (and 
> use a complementary window such as Hann), it is half as efficient, but 
> the crossfade is even smoother (and the frame rate is faster, so the 
> filter definition can change more often).
> 
> all of this is well-established knowledge regarding frame-by-frame 
> processing with windows and the FFT.

Yep, we're in agreement here as well. Applying a time-varying filter using 
non-overlapping rectangular windows seems like a bad idea.

On Mon, Mar 9, 2020, at 8:41 PM, Ethan Duni wrote:
> > On Mar 9, 2020, at 7:16 AM, Spencer Russell <s...@media.mit.edu> wrote:
> > 
> > 
> > if you have an KxN STFT (K frequency components and N frames) then then 
> > zero-padding each frame by K-1 should still eliminate any time-aliasing 
> > even if your filter has hard edges in the frequency domain, right?
> 
> Right, but if you are using length K FFT and zero-padding by K-1, then 
> the hop size is 1 sample and there are no windows. 

Whoops, this was dumb on my part. I was not referring to a hop size of 1! 
Hopefully my explanation above is more clear.

> This is just applying the raw IDFT of the response as an FIR, which is 
> not appropriate for something estimated in a windowed filterbank 
> domain. Deriving an equivalent FIR from, say, an estimated noise 
> reduction mask is not trivial.

Agreed! I think that the relationship between STFT-domain multiplication and 
applying a time-varying FIR filter is the most interesting part of this 
conversation. You could think of STFT multiplication as applying a different 
FIR filter to each frame and then cross-fading between them, which is clearly 
not the same as continually varying the FIR parameters in the time domain. They 
do seem to have a tight relationship though, and when we do STFT modifications 
it seems that in some contexts we're trying to approximate the time-varying FIR 
filter.

> > I understand the role of time-domain windowing in STFT processing to be 
> > mostly:
> > 1. Reduce frequency-domain ripple (side-lobes in each band)
> 
> Right, this is the “analysis” aspect, where the window controls the 
> spectral characteristics (frequency selectivity, bandwidth, leakage, 
> etc.)
> 
> > 2. Provide a sort of cross-fade from frame-to-frame to smooth out framing 
> > effects
> 
> And that is the “synthesis” aspect, where the window controls the 
> characteristics of the artifacts introduced by processing. Note that 
> “framing effects” are by definition time-variant: this is a form of 
> aliasing.

Great - we're in agreement on the role of the analysis window, and this is 
starting to get towards the relationship I mentioned above. Can you clarify 
what you mean by a form of aliasing? As mentioned above, with proper 
zero-padding there should be no time-aliasing introduced. Do you mean 
frequency-aliasing? I get that the synthesis window has a smoothing effect, but 
I'm struggling to understand it in terms of aliasing.

-s
_______________________________________________
dupswapdrop: music-dsp mailing list
music-dsp@music.columbia.edu
https://lists.columbia.edu/mailman/listinfo/music-dsp

Reply via email to