Hi Gaurav, The performance concerns is not just around librosa, but also the way to integrate it. librosa as a python library requires holding GIL when calling it, which makes it hard for asynchronous data preprocessing during training. Also, the API design hasn't been verified on the more full-fledged use cases that you outlined. That, and based on the lack of expertise of audio processing reviewing the design doc, my suggestion is to continue the work as a Gluon example, until other use cases are adopted, which is what you started in https://github.com/apache/incubator-mxnet/pull/13325. Once you make more progress and become more familiar with Gluon design, please report back to this thread and I'd be happy to help more on the review.
-sz On 2018/11/20 19:20:18, Gaurav Gireesh <gaurav.gire...@gmail.com> wrote: > Hi All! > Following up on this PR: > https://github.com/apache/incubator-mxnet/pull/13241 > I would need some comments or feedback regarding the API design : > https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio > > The comments on the PR were mostly around *librosa *and its performance > being a blocker if and when the designed API can be tested with bigger ASR > models DeepSpeech 2, DeepSpeech 3. > I would appreciate if the community provides their expertise/knowledge on > loading audio data and feature extraction used currently with bigger ARS > models. > If there is anything in design which may be changed/improved that will > improve the performance, I ll be happy to look into this. > > Thanks and regards, > Gaurav Gireesh > > On Thu, Nov 15, 2018 at 10:47 AM Gaurav Gireesh <gaurav.gire...@gmail.com> > wrote: > > > Hi Lai! > > Thank you for your comments! > > Below are the answers to your comments/queries: > > 1) That's a good suggestion. However, I have added an example in the Pull > > request related to this: > > https://github.com/apache/incubator-mxnet/pull/13241/commits/eabb68256d8fd603a0075eafcd8947d92e7df27f > > . > > I would be happy to include a dataset similar to MNIST to support that. I > > have come across an example dataset used in tensor flow speech > > related example here > > <https://www.tensorflow.org/tutorials/sequences/audio_recognition>. This > > could be included. > > > > 2) Thank you for the suggestion, I shall look into the FFT operator that > > you have pointed out. However, there are other kind of features like, mfcc, > > mels and so on which are popular in audio data feature extraction, which > > will find utility if implemented. I am not sure if we have operators for > > this. > > > > 3) The references look good too. I shall look into them. Thank you for > > bringing them into my notice. > > > > Regards, > > Gaurav > > > > On Tue, Nov 13, 2018 at 11:22 AM Lai Wei <roywei...@gmail.com> wrote: > > > >> Hi Gaurav, > >> > >> Thanks for starting this. I see the PR is out > >> <https://github.com/apache/incubator-mxnet/pull/13241>, left some initial > >> reviews, good work! > >> > >> In addition to Sandeep's queries, I have the following: > >> 1. Can we include some simple classic audio dataset for users to directly > >> import and try out? like MNIST in vision. (e.g.: > >> http://pytorch.org/audio/datasets.html#yesno) > >> 2. Librosa provides some good audio feature extractions, we can use it for > >> now. But it's slow as you have to do conversions between ndarray and > >> numpy. > >> In the long term, can we make transforms to use mxnet operators and change > >> your transforms to hybrid blocks? For example, mxnet FFT > >> < > >> https://mxnet.apache.org/api/python/ndarray/contrib.html?highlight=fft#mxnet.ndarray.contrib.fft > >> > > >> operator > >> can be used in a hybrid block transformer, which will be a lot faster. > >> > >> Some additional references on users already using mxnet on audio, we > >> should > >> aim to make it easier and automate the file load/preprocess/transform > >> process. > >> 1. https://github.com/chen0040/mxnet-audio > >> 2. https://github.com/shuokay/mxnet-wavenet > >> > >> Looking forward to seeing this feature out. > >> Thanks! > >> > >> Best Regards > >> > >> Lai > >> > >> > >> On Tue, Nov 13, 2018 at 9:09 AM sandeep krishnamurthy < > >> sandeep.krishn...@gmail.com> wrote: > >> > >> > Thanks, Gaurav for starting this initiative. The design document is > >> > detailed and gives all the information. > >> > Starting to add this in "Contrib" is a good idea while we expect a few > >> > rough edges and cleanups to follow. > >> > > >> > I had the following queries: > >> > 1. Is there any analysis comparing LibROSA with other libraries? w.r.t > >> > features, performance, community usage in audio data domain. > >> > 2. What is the recommendation of LibROSA dependency? Part of MXNet PyPi > >> or > >> > ask the user to install if required? I prefer the latter, similar to > >> > protobuf in ONNX-MXNet. > >> > 3. I see LibROSA is a fully Python-based library. Are we getting > >> blocked on > >> > the dependency for future use cases when we want to make > >> transformations as > >> > operators and allow for cross-language support? > >> > 4. In performance design considerations, with lazy=True / False the > >> > performance difference is too scary ( 8 minutes to 4 hours!!) This > >> requires > >> > some more analysis. If we known turning a flag off/on has 24X > >> performance > >> > degradation, should we need to provide that control to user? What is the > >> > impact of this on Memory usage? > >> > 5. I see LibROSA has ISC license ( > >> > https://github.com/librosa/librosa/blob/master/LICENSE.md) which says > >> free > >> > to use with same license notification. I am not sure if this is ok. I > >> > request other committers/mentors to suggest. > >> > > >> > Best, > >> > Sandeep > >> > > >> > On Fri, Nov 9, 2018 at 5:45 PM Gaurav Gireesh <gaurav.gire...@gmail.com > >> > > >> > wrote: > >> > > >> > > Dear MXNet Community, > >> > > > >> > > I recently started looking into performing some simple sound > >> multi-class > >> > > classification tasks with Audio Data and realized that as a user, I > >> would > >> > > like MXNet to have an out of the box feature which allows us to load > >> > audio > >> > > data(at least 1 file format), extract features( or apply some common > >> > > transforms/feature extraction) and train a model using the Audio > >> Dataset. > >> > > This could be a first step towards building and supporting APIs > >> similar > >> > to > >> > > what we have for "vision" related use cases in MXNet. > >> > > > >> > > Below is the design proposal : > >> > > > >> > > Gluon - Audio Design Proposal > >> > > <https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio> > >> > > > >> > > I would highly appreciate your taking time to review and provide > >> > feedback, > >> > > comments/suggestions on this. > >> > > Looking forward to your support. > >> > > > >> > > > >> > > Best Regards, > >> > > > >> > > Gaurav Gireesh > >> > > > >> > > >> > > >> > -- > >> > Sandeep Krishnamurthy > >> > > >> > > >