Re: [Mscore-developer] (GSOC 2016) Regarding the Virtual Singer project idea...

syrma Tue, 22 Mar 2016 08:01:19 -0700

Thank you for your reply.

As for the playback, I also think that singing each note the moment we put
it is impossible; we need to set the lyrics, and even then, the synthesis
takes time. But getting it to play like in Cadencii would probably be good;
to press play once everything is set, that is. Cadencii takes a while to do
that though, and at some point, the time spent waiting for the synthesis is
probably multiple times the time spent actually editing (that being said I
think a lot of optimization is possible on Cadencii so it's probably not the
best example).

Leaving the questions about dictionaries for later, a side note about my
struggles with v.Connect-STAND, Cadencii's synthesizing engine. I have
finally been able to get some results out of it (by switching between my
Linux and Windows every time one gets a problem). The rendering is more than
decent in my opinion (although it depends a lot on the settings and the used
voicebank, and it could sound worse than e-cantorix if not used properly
(okay, not that bad, but still) ), and I think it is an interesting tool to
use overall (some Utau users import their Utaus to v.Connect-stand to get a
better rendering, but it is sometimes a little tricky). However, there are a
few points that hinders direct use:

- The Windows binaries won't work unless the system is as Japanese as
possible, and while I don't know what is causing this yet (because I am not
used to compiling on Windows), this needs a fix.
- Encoding auto-detection is probably needed; even my Linux-built version
needs a default input encoded as shift-jis (the typical encoding when
dealing with files created by Japanese users on Windows). It supports other
encodings, but the user needs to specify them.
- The software takes a meta text sequence file (its own format), and outputs
an audio. While I think implementing a conversion from a score to a meta
text sequence would be sufficient for the first part of the project
(generating the audio), optionally, I believe an optimization might be
possible. As v.Connect's based on World (which implements real-time singing
synthesis according to their introduction page), I am wondering whether
changing the code to intercept the parameters before the audio is generated,
and playing it in real-time would be possible. I have not dived into
v.Connect's code far enough, so if someone who did thinks I am going a wrong
and completely impossible way, please do let me know.

A very interesting point in it however is its ability to convert and use
Utau voicebanks, with the great amount of downlodable utaus on the net
(let's forget for now about the mass of problems that alone causes). While
looking for the possibility of using English with Utau voices, I came, among
others, across this page : http://utau.wiki/cv-vc (see also:
utau.wiki/tutorials:cvvc-english-tutorial-by-mystsaphyr ). This seems to be
popular enough that a lot of utauloids use this method to simulate
non-Japanese pronunciation. Namine Ritsu, a free voice for v.Connect-stand
(and the most popular one), also has recordings of this kind, although the
way English is rendered far from being perfect, and accents are all for the
user to simulate. There are also (non-open source) plugins that can convert
lyrics (or rather sequence files) from CVVC to VCV (another style used in
utaus). Even though this allows for the user to get and add their sets of
voice from the internet, I can easily think of a few issues one can come
across:

- Making the user input phonetic symbols instead of actual lyrics is not a
solution. I think it may be possible to convert lyrics to espeak phonemes,
and implement the remaining conversion step (that would depend on the
voice). That gets us to another set of problems; the user would need to
supply both the word and the hyphenation. And even then, some other problems
are bound to happen, either because the word isn't in the dictionary or
because the sound isn't available. In the first case, the user may need to
provide the pronunciation (a proper noun for example). Beside this, should
we let the user modify the pronunciation they want (after it is
automatically generated) to simulate an accent or to make something sound
more natural?

- Encoding problems, always. Japanese on Windows is unpredictably tricky to
deal with.

- Voicebanks are usually recorded for a precise language. I could be wrong,
but for now I don't see how we could detect the language unless the user
specifies it. Also, some of the Japanese are only compatible with either
romaji or kana (we could use kakasi to convert either the lyrics or the
voicebank).

Anyways, I don't think any amount of work of one summer would be enough to
even think about all the issues (everything is so much more complicated than
it first seems). The question would be, how much would make an acceptable
project?

The project I have in mind for now would be something like the following:

- As a first step, taking care of the usability issues of v.Connect-stand,
or ideally turning it into a usable library.
- Implementing the generation of meta text sequences (it would be
interesting to see how Cadencii, the open source C++/Qt editor, does it).
This should include the processing of whatever settings we have (including
phonemes) as this kind of files should provide all the information needed
for synthesis.
- Making a MuseScore plugin out of the two aforementioned items. This would
include in addition:
- the front-end (collecting settings)
- the playback function

Though I don't know if this is relevant to the current discussion (or at
all), while looking for a good free voice data, I found Namine Ritsu's
license is very unclear to me (the site the wiki pages link to for the use
terms doesn't exist anymore). There is a separation between the character
(visual art, profile, ...) and the voice resources. I suspect from the
contradicting official information that it has changed over the time. The
character itself seems to be the property of canon, but there doesn't seem
to be any restrictions over the use of the voices. In addition, this
voicebank (http://hal-the-cat.music.coocan.jp/ritsu_e.html) says it is
released under the terms of GPLv3. I assume at least this voicebank is safe
enough.
[Unclear official material :
- http://www.canon-voice.com/english/kiyaku.html (the English says something
very unclear about the character but the voice is free)
- http://canon-voice.com/ritsu.html ]

So immediate questions are:
- Is this a realistic and/or an acceptable project?
- I am not aware of MuseScore plugin rules, so is such an approach alright?
If not, what is the better way?
- I am not sure where to integrate the second part, but I think the part to
integrate into MuseScore should be as general as possible to add gradually
support for other tools.

Sorry for the long post. Please let me know your opinion, and whether I am
analyzing things wrong!

--
View this message in context:
http://dev-list.musescore.org/GSOC-2016-Regarding-the-Virtual-Singer-project-idea-tp7579698p7579737.html
Sent from the MuseScore Developer mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mscore-developer

Re: [Mscore-developer] (GSOC 2016) Regarding the Virtual Singer project idea...

Reply via email to