Thanks for your expanded notes, RBJ. I haven't found anything that I disagree with or that contradicts what I was saying earlier - I'm not sure if they were intended as expanded context or if there was something you were disagreeing with.
On March 8, 2020 7:55 PM Ethan Duni <ethan.d...@gmail.com> wrote: > > Fast FIR is a different thing than an FFT filter bank. > > You can combine the two approaches but I don’t think that’s what is being > done here? The point I'm making here is that overlap-add fast FIR is a special case of STFT-domain multiplication and resynthesis. I'm defining the standard STFT pipeline here as: 1. slice your signal into frames 2. pointwise-multiply an analysis window by each frame 3. perform `rfft` on each frame to give the STFT domain representation 4. modify the STFT representation 5. perform `irfft` on each frame 6. pointwise-multiply a synthesis window on each frame 7. overlap-add each frame to get the resulting time-domain signal See below for more. On Mon, Mar 9, 2020, at 5:44 PM, robert bristow-johnson wrote: > > > On March 9, 2020 10:15 AM Spencer Russell <s...@media.mit.edu> wrote: > > > > > > I think we're mostly on the same page, Ethan. > > well, i think that i am on the same page as Ethan. > > > Though even with STFT-domain time-variant filtering (such as with noise > > reduction, or mask-based source separation) it would seem you could still > > zero-pad each input frame to eliminate any issues due to time-aliasing. > > zero-padding is the sole technique that gets rid of time-aliasing. Right - we're together here. > let's say your FIR is of length L. let's say that your frame hop is H > and frame length is F ≥ H and we're doing overlap-add. then your F > samples of input (H samples are *new* samples in the current frame, F-H > samples are remaining from the previous frame) are considered > zero-padded out to infinity in both directions. then the length of the > result of linear convolution is L+F-1. now if you can guarantee that > the size of the DFT, which we'll call "N" (and most of the time is a > power of 2) is at least as large as the non-zero length of the linear > convolution, then the result of circular convolution of the zero-padded > FIR and the zero-padded frame of samples will be exactly the same. > that means > > N ≥ L + F - 1 You are completely correct, and as far as I can tell we're in agreement here (again please correct me if this was meant to be a rebuttal). Specifically I'm talking about the case where F=H. You then perform a standard STFT with these parameters (Hop size H, rectangular window of size F = H, FFT length H+L-1), multiply each frame by the (r)FFT of your filter, then do the standard ISTFT with overlap-add. Your STFT will have a height of `N/2-1` (integer division). You do the standard ISTFT with overlap-add, and the same hop size H. The frame size is now the full N. You use a "synthesis window" that's the full length N (in practice just taking each chunk with no windowing). Within the ISTFT process you took the `irfft` of each frame, which is now nonzero for some length longer than H, but not more than N (so there's no time aliasing). That should be exactly the same thing as fast FIR convolution with a chunk size of F, but in the framework of STFT->multiply->ISTFT. The only thing that's not standard STFT processing is the zero-padding (to remove aliasing due to circular convolution, a now much-belabored point). This is just to make the point that fast FIR is a special case of STFT processing. From a compute perspective this should be no less efficient than fast FIR (I mean, it's doing the same thing). If you do the whole STFT off-line then you wasted some memory materializing the whole STFT, but you could consider a streaming version, and at that point the implementation would look very similar to what you'd code up for fast FIR. Are we all together here? ===== Time Variant Filtering ==== So this seems like it's the really interesting part, and usually why people work in the STFT domain in the first place. As RBJ mentioned, padding (ensuring N >= L+F-1) completely resolves time-aliasing is true whether the filter is stationary or time-varying. > if it is a rectangular window, the frame length and frame hop are the > same, F=H, and the number of generated output samples that are valid is > H, and the most you can hope to get is: > > H = F = N - L + 1 Right, this is the Fast FIR situation I described above. > <snip> > if you cut your frame hop size, H, from F to nearly half (F+1)/2 (and > use a complementary window such as Hann), it is half as efficient, but > the crossfade is even smoother (and the frame rate is faster, so the > filter definition can change more often). > > all of this is well-established knowledge regarding frame-by-frame > processing with windows and the FFT. Yep, we're in agreement here as well. Applying a time-varying filter using non-overlapping rectangular windows seems like a bad idea. On Mon, Mar 9, 2020, at 8:41 PM, Ethan Duni wrote: > > On Mar 9, 2020, at 7:16 AM, Spencer Russell <s...@media.mit.edu> wrote: > > > > > > if you have an KxN STFT (K frequency components and N frames) then then > > zero-padding each frame by K-1 should still eliminate any time-aliasing > > even if your filter has hard edges in the frequency domain, right? > > Right, but if you are using length K FFT and zero-padding by K-1, then > the hop size is 1 sample and there are no windows. Whoops, this was dumb on my part. I was not referring to a hop size of 1! Hopefully my explanation above is more clear. > This is just applying the raw IDFT of the response as an FIR, which is > not appropriate for something estimated in a windowed filterbank > domain. Deriving an equivalent FIR from, say, an estimated noise > reduction mask is not trivial. Agreed! I think that the relationship between STFT-domain multiplication and applying a time-varying FIR filter is the most interesting part of this conversation. You could think of STFT multiplication as applying a different FIR filter to each frame and then cross-fading between them, which is clearly not the same as continually varying the FIR parameters in the time domain. They do seem to have a tight relationship though, and when we do STFT modifications it seems that in some contexts we're trying to approximate the time-varying FIR filter. > > I understand the role of time-domain windowing in STFT processing to be > > mostly: > > 1. Reduce frequency-domain ripple (side-lobes in each band) > > Right, this is the “analysis” aspect, where the window controls the > spectral characteristics (frequency selectivity, bandwidth, leakage, > etc.) > > > 2. Provide a sort of cross-fade from frame-to-frame to smooth out framing > > effects > > And that is the “synthesis” aspect, where the window controls the > characteristics of the artifacts introduced by processing. Note that > “framing effects” are by definition time-variant: this is a form of > aliasing. Great - we're in agreement on the role of the analysis window, and this is starting to get towards the relationship I mentioned above. Can you clarify what you mean by a form of aliasing? As mentioned above, with proper zero-padding there should be no time-aliasing introduced. Do you mean frequency-aliasing? I get that the synthesis window has a smoothing effect, but I'm struggling to understand it in terms of aliasing. -s _______________________________________________ dupswapdrop: music-dsp mailing list music-dsp@music.columbia.edu https://lists.columbia.edu/mailman/listinfo/music-dsp