Re: Realtime AEC + VAD

Tamás Zahola via Coreaudio-api Fri, 18 Oct 2024 13:12:40 -0700

> So if I want something that works on all 3, I kinda need to roll my own 
> AEC+VAD.


If you insist on running the same audio code on all 3 devices, then yes.

> I'm struggling really hard to extract aec3 out of WebRTC. Whereas VAD was 
> pretty straight forward, aec3 seems to have a dependency on something called 
> abseil.

abseil should be pretty much self-contained. You can get it from here if you 
can't set up the WebRTC dependencies: https://github.com/abseil/abseil-cpp

I'm afraid I can't address the other questions you've raised about AEC 
algorithms. I would recommend going with the simplest solution, and only start 
digging into research papers if that doesn't work for your purposes.

Regards,
Tamás Zahola

> On 18 Oct 2024, at 16:09, π via Coreaudio-api <[email protected]> 
> wrote:
> 
> Yikes!
> 
> Well, the purpose of my project was to investigate the possibilities of using 
> OpenAI's realtime API on Apple tech, and I'm indeed discovering the gotchas.
> 
> So, IIUC:
> - on macOS I can get beneath the AudioUnit level and go straight to 
> AudioDevice; down to the wire, so to speak. And roll my own AEC & VAD. 
> Alternatively I can use VoiceProcessingIO AudioUnit which gives me AEC & VAD 
> tho' they don't play nice together, but if I roll my own VAD (using the 
> WebRTC code) I'm good to go.
> 
> - on iOS I can't get at the AudioDevice, but still have the VoiceProcessingIO 
> technique available as above. Alternatively I could use RemoteIO AudioUnit 
> and roll my own AEC+VAD. But then I'm not getting system audio-out. hum ho. 
> liveable-withable.
> 
> - on WatchOS, we don't have AudioDevice OR VoiceProcessingIO audioUnit, but 
> we DO still have RemoteIO audiounit.
> 
> So if I want something that works on all 3, I kinda need to roll my own 
> AEC+VAD.
> 
> I'm struggling really hard to extract aec3 out of WebRTC. Whereas VAD was 
> pretty straight forward, aec3 seems to have a dependency on something called 
> abseil.
> 
> It seems AEC is far from a "Solved Problem". I see 
> https://www.microsoft.com/en-us/research/academic-program/acoustic-echo-cancellation-challenge-icassp-2023/
>  Microsoft have recently (2023) issued a challenge inviting novel AEC 
> solutions (presumably the current AI boom is gona shake loose some new 
> approaches), but as an outsider I don't get to see the submissions. e.g. the 
> winning non-microsoft entry is behind a paywall 
> https://ieeexplore.ieee.org/document/10096411 (though maybe the same as 
> https://arxiv.org/pdf/2303.06828).
> 
> I wonder whether "cheating" buys much; i.e. emitting a periodic sweep/chirp 
> from the speakers to estimate the impulse-response of the acoustic 
> environment, in order to deduce an inverse-IR. Then I think the AEC is just 
> applying that, possibly together with some delay to compensate for I/O 
> latency.
> 
> Does anyone have an intuition whether it's even sensible to be considering 
> realtime AEC on WatchOS? Just from a performance PoV it might rinse out the 
> battery really fast.
> 
> π
> 
> On Fri, 18 Oct 2024 at 11:55, Tamás Zahola via Coreaudio-api 
> <[email protected] <mailto:[email protected]>> wrote:
>> Hold on a sec, how are you planning to use the AudioDevice VAD on watchOS? 
>> It is a macOS-only API. It's not available on watchOS, neither on iOS.
>> 
>> Now, considering what Julian wrote, I think your problem might he that 
>> you're using the VPIO unit in conjunction with the AudioDevice VAD. Because 
>> if what Julian wrote is true, that AudioDevice already has echo cancellation 
>> when VAD is enabled, then what could be happening is that your output signal 
>> is subtracted *twice* from the input: first by the echo canceller of 
>> AudioDevice, and then by the VPIO unit. So in effect the VPIO unit ends up 
>> re-adding the echo with inverted phase.
>> 
>> I would recommend trying just AudioDevice directlt, without the VPIO unit.
>> 
>> Obviously, this is all macOS-only. On iOS (and I guess watchOS) you only 
>> have AudioUnits, so you must use your own VAD.
>> 
>> Regards,
>> Tamás Zahola
>> 
>>> On 2024. Oct 18., at 12:30, π via Coreaudio-api 
>>> <[email protected] <mailto:[email protected]>> 
>>> wrote:
>>> 
>>> 
>>> Thanks for the pointer Tamás!
>>> 
>>> Pulling out VAD from WebRTC worked a treat.
>>> 
>>> I started with https://github.com/daanzu/py-webrtcvad-wheels and knocked 
>>> together a hello.cpp and CMakeLists.txt 
>>> (https://gist.github.com/p-i-/598da13d2a1a1e2a6ec978e15fa7d892)
>>> 
>>> I have to say, it feels hella awkward that I cannot control the pipeline 
>>> and use native AudioUnits for this kind of work.
>>> 
>>> Surely it is a mistake on Apple's part to put VAD before AEC, if this is 
>>> really what they're doing... it's gona trigger VAD callback on 
>>> incoming/remote audio, rather than user-speech.
>>> 
>>> For a low-power usage scenario (say WatchOS), I really want to be 
>>> dynamically rerouting -- if there's no audio being sent thru the speaker, I 
>>> don't want AEC eating CPU cycles, but I DO want VAD detecting user-speech 
>>> onset. And if audio IS being sent thru the speaker, I want AEC to be 
>>> subtracting it, and VAD to be operating on this "cleaned" mic-input. I'd 
>>> love it if VoiceProcessingIO unit took care of all of this.
>>> 
>>> I haven't yet managed to scientifically determine exactly what 
>>> VoiceProcessingIO unit is actually doing, but if I engage its AEC and VAD 
>>> and play a sine-wave, it disturbs the VAD callbacks, yet successfully 
>>> subtracts the sinewave from mic-audio. So I strongly suspect they have 
>>> these two subcomponents wired up in the wrong order.
>>> 
>>> If this is indeed the case, is there any liklihood of a future fix? Do 
>>> Apple core-audio devs listen in on this list?
>>> 
>>> π
>>> 
>>> On Thu, 17 Oct 2024 at 10:24, Tamás Zahola via Coreaudio-api 
>>> <[email protected] <mailto:[email protected]>> 
>>> wrote:
>>>> You can extract the VAD algorithm from WebRTC by starting at this file: 
>>>> https://chromium.googlesource.com/external/webrtc/stable/src/+/master/common_audio/vad/vad_core.h
>>>> 
>>>> You'll also need some stuff from the common_audio/signal_processing 
>>>> folder, but otherwise it's self-contained.
>>>> 
>>>>> It's easy for me to get the audio-output-stream for MY app (it just comes 
>>>>> in over the websocket), but I may wish to toggle whether I want my AEC to 
>>>>> be cancelling out any output-audio generated by other processes on my mac.
>>>> 
>>>> From macOS Ventura onwards it is possible to capture system audio with the 
>>>> ScreenCaptureKit framework, although your app will need extra privacy 
>>>> permissions.
>>>> 
>>>>> It must be possible on macOS, as apps like soundFlower or blackHole are 
>>>>> able to do it.
>>>> 
>>>> BlackHole and SoundFlower are using an older technique, where they install 
>>>> a virtual loopback audio device on the system (you can see it listed in 
>>>> Audio MIDI Settings as e.g. "BlackHole 2 ch"), and change the system's 
>>>> default output device to that, then capture from the input port of this 
>>>> loopback device. But this requires installing the virtual device in 
>>>> /Library/Audio/Plug-Ins/HAL, which requires admin privileges.
>>>> 
>>>>> But mobile, I'm not so sure. My memory of iPhone audio dev (~2008) is 
>>>>> that it was impossible to access this. But there's now some mention of v3 
>>>>> audio-units being able to process inter-app audio.
>>>> 
>>>> On iOS you must use the voice-processing I/O unit. Normal apps cannot 
>>>> capture the system audio output. Technically there is a way to do it with 
>>>> the ReplayKit framework, but it's a pain in the ass to use, and the 
>>>> primary purpose of that framework is capturing screen content, not audio. 
>>>> If you try e.g. Facebook Messenger on iOS, and initiate screen-sharing in 
>>>> a video call, that's going to use ReplayKit.
>>>> 
>>>> Regards,
>>>> Tamás Zahola 
>>>> 
>>>>> On 17 Oct 2024, at 08:04, π via Coreaudio-api 
>>>>> <[email protected] <mailto:[email protected]>> 
>>>>> wrote:
>>>>> 
>>>>> Thankyou for the replies. I am glad to see that this mailing-list is 
>>>>> still alive, despite the dwindling traffic this last few years.
>>>>> 
>>>>> Can I not encapsulate a VPIO unit, and control the input/output 
>>>>> audio-streams by implementing input/render callbacks, or making 
>>>>> connections?
>>>>> 
>>>>> I'm veering towards this approach of manual implementation: Just to use a 
>>>>> (misnamed as it's I/O) HALInput unit on macOS or a RemoteIO unit on the 
>>>>> mobile platforms to access the raw I/O buffers, and write my own pipeline.
>>>>> 
>>>>> Would it be a good idea to use https://github.com/apple/AudioUnitSDK to 
>>>>> wrap this? My hunch is to minimize the layers/complexity and NOT use this 
>>>>> framework.
>>>>> 
>>>>> And for the AEC/VAD, can anyone offer a perspective? Arshia? The two 
>>>>> obvious candidates I see are WebRTC and SpeeX. GPT4o reckons WebRTC will 
>>>>> be the most-advanced / best-performant solution, with the downside that 
>>>>> it's a big project (and maybe a more complicated build process), while 
>>>>> SpeeX is more light-weight and will probably do the job well enough for 
>>>>> my purposes.
>>>>> 
>>>>> And as both are open-source, I may have the option of pulling out the 
>>>>> minimal-dependency files and building just those.
>>>>> 
>>>>> The last question is regarding system-wide audio output. It's easy for me 
>>>>> to get the audio-output-stream for MY app (it just comes in over the 
>>>>> websocket), but I may wish to toggle whether I want my AEC to be 
>>>>> cancelling out any output-audio generated by other processes on my mac. 
>>>>> e.g. if I am watching a YouTube video, maybe I want my AI to listen to 
>>>>> that, and maybe I want it subtracted. So do I have the option to listen 
>>>>> to SYSTEM-level audio output (so as to feed it into my AEC impl)? It must 
>>>>> be possible on macOS, as apps like soundFlower or blackHole are able to 
>>>>> do it. But mobile, I'm not so sure. My memory of iPhone audio dev (~2008) 
>>>>> is that it was impossible to access this. But there's now some mention of 
>>>>> v3 audio-units being able to process inter-app audio.
>>>>> 
>>>>> π
>>>>> 
>>>>> On Wed, 16 Oct 2024 at 19:35, Arshia Cont via Coreaudio-api 
>>>>> <[email protected] <mailto:[email protected]>> 
>>>>> wrote:
>>>>>> Hi π,
>>>>>> 
>>>>>> From my experience that’s not possible. VPIO is an option for the lower 
>>>>>> level IO device; so is VAD. You don’t have much control over their 
>>>>>> internals, routing and wirings! Also, from our experience, VPIO has 
>>>>>> different behaviour on different devices. On some iPads we saw “gating” 
>>>>>> instead of actually removing echo (be aware of that!). In the end for a 
>>>>>> similar use-case we ended up doing our own AEC and Activity Detection.
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Arshia Cont
>>>>>> metronautapp.com <http://metronautapp.com/>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 15 Oct 2024, at 18:08, π via Coreaudio-api 
>>>>>>> <[email protected] <mailto:[email protected]>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Dear Audio Engineers,
>>>>>>> 
>>>>>>> I'm writing an app to interact with OpenAI's 'realtime' API 
>>>>>>> (bidirectional realtime audio over websocket with AI serverside).
>>>>>>> 
>>>>>>> To do this, I need to be careful that the AI-speak doesn't make its way 
>>>>>>> out of the speakers, back in thru the mic, and back to their server 
>>>>>>> (else it starts to talk to itself, and gets very confused).
>>>>>>> 
>>>>>>> So I need AEC, which I've actually got working, using 
>>>>>>> kAudioUnitSubType_VoiceProcessingIO and 
>>>>>>> AudioUnitSetProperty(kAUVoiceIOProperty_BypassVoiceProcessing, setting 
>>>>>>> to False).
>>>>>>> 
>>>>>>> Now I also wish to detect when the speaker (me) is speaking or not 
>>>>>>> speaking, which I've also managed to do via 
>>>>>>> kAudioDevicePropertyVoiceActivityDetectionEnable.
>>>>>>> 
>>>>>>> But getting them to play together is another matter, and I'm struggling 
>>>>>>> hard here.
>>>>>>> 
>>>>>>> I've rigged up a simple test 
>>>>>>> (https://gist.github.com/p-i-/d262e492073d20338e8fcf9273a355b4), where 
>>>>>>> a 440Hz sinewave is generated in the render-callback, and mic-input is 
>>>>>>> recorded to file in the input-callback. 
>>>>>>> 
>>>>>>> So the AEC works delightfully, subtracting the sinewave and recording 
>>>>>>> my voice.
>>>>>>> And if I turn the sine-wave amplitude down to 0, the VAD correctly 
>>>>>>> triggers the speech-started and speech-stopped events.
>>>>>>> 
>>>>>>> But if I turn up the sine-wave, it messes up the VAD.
>>>>>>> 
>>>>>>> Presumably the VAD is working over the pre-EchoCancelled audio, which 
>>>>>>> is most undesirable.
>>>>>>> 
>>>>>>> How can I progress here?
>>>>>>> 
>>>>>>> My thought was to create an audio pipeline, using AUGraph, but my 
>>>>>>> efforts have thus far been unsuccessful, and I lack confidence that I'm 
>>>>>>> even pushing in the right direction.
>>>>>>> 
>>>>>>> My thought was to have an IO unit that interfaces with the hardware 
>>>>>>> (mic/spkr), which plugs into an AEC unit, which plugs into a VAD unit.
>>>>>>> 
>>>>>>> But I can't see how to set this up.
>>>>>>> 
>>>>>>> On iOS there's a RemoteIO unit to deal with the hardware, but I can't 
>>>>>>> see any such unit on macOS. It seems the VoiceProcessing unit wants to 
>>>>>>> do that itself.
>>>>>>> 
>>>>>>> And then I wonder: Could I make a second VoiceProcessing unit, and have 
>>>>>>> vp1_aec split send its bus[1(mic)].outputScope to 
>>>>>>> vp2_vad.bus[1].inputScope?
>>>>>>> 
>>>>>>> Can I do this kind of work by routing audio, or do I need to get my 
>>>>>>> hands dirty with input/render callbacks?
>>>>>>> 
>>>>>>> It feels like I'm going hard against the grain if I am faffing with 
>>>>>>> these callbacks.
>>>>>>> 
>>>>>>> If there's anyone out there that would care to offer me some guidance 
>>>>>>> here, I am most grateful!
>>>>>>> 
>>>>>>> π
>>>>>>> 
>>>>>>> PS Is it not a serious problem that VAD can't operate on post-AEC input?
>>>>>>> _______________________________________________
>>>>>>> Do not post admin requests to the list. They will be ignored.
>>>>>>> Coreaudio-api mailing list      ([email protected] 
>>>>>>> <mailto:[email protected]>)
>>>>>>> Help/Unsubscribe/Update your Subscription:
>>>>>>> https://lists.apple.com/mailman/options/coreaudio-api/arshiacont%40antescofo.com
>>>>>>> 
>>>>>>> This email sent to [email protected] 
>>>>>>> <mailto:[email protected]>
>>>>>> 
>>>>>>  _______________________________________________
>>>>>> Do not post admin requests to the list. They will be ignored.
>>>>>> Coreaudio-api mailing list      ([email protected] 
>>>>>> <mailto:[email protected]>)
>>>>>> Help/Unsubscribe/Update your Subscription:
>>>>>> https://lists.apple.com/mailman/options/coreaudio-api/pipad.org%40gmail.com
>>>>>> 
>>>>>> This email sent to [email protected] <mailto:[email protected]>
>>>>> _______________________________________________
>>>>> Do not post admin requests to the list. They will be ignored.
>>>>> Coreaudio-api mailing list      ([email protected] 
>>>>> <mailto:[email protected]>)
>>>>> Help/Unsubscribe/Update your Subscription:
>>>>> https://lists.apple.com/mailman/options/coreaudio-api/tzahola%40gmail.com
>>>>> 
>>>>> This email sent to [email protected] <mailto:[email protected]>
>>>>  _______________________________________________
>>>> Do not post admin requests to the list. They will be ignored.
>>>> Coreaudio-api mailing list      ([email protected] 
>>>> <mailto:[email protected]>)
>>>> Help/Unsubscribe/Update your Subscription:
>>>> https://lists.apple.com/mailman/options/coreaudio-api/pipad.org%40gmail.com
>>>> 
>>>> This email sent to [email protected] <mailto:[email protected]>
>>> _______________________________________________
>>> Do not post admin requests to the list. They will be ignored.
>>> Coreaudio-api mailing list      ([email protected] 
>>> <mailto:[email protected]>)
>>> Help/Unsubscribe/Update your Subscription:
>>> https://lists.apple.com/mailman/options/coreaudio-api/tzahola%40gmail.com
>>> 
>>> This email sent to [email protected] <mailto:[email protected]>
>>  _______________________________________________
>> Do not post admin requests to the list. They will be ignored.
>> Coreaudio-api mailing list      ([email protected] 
>> <mailto:[email protected]>)
>> Help/Unsubscribe/Update your Subscription:
>> https://lists.apple.com/mailman/options/coreaudio-api/pipad.org%40gmail.com
>> 
>> This email sent to [email protected] <mailto:[email protected]>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> Coreaudio-api mailing list      ([email protected])
> Help/Unsubscribe/Update your Subscription:
> https://lists.apple.com/mailman/options/coreaudio-api/tzahola%40gmail.com
> 
> This email sent to [email protected]

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Coreaudio-api mailing list      ([email protected])
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/coreaudio-api/archive%40mail-archive.com

This email sent to [email protected]

Re: Realtime AEC + VAD

Reply via email to