Dear Stojan, On 25 September 2017 at 11:09, Стоян Димитров wrote: > Greetings, > > I'd like to propose for adding an alternative set of hyphenation patterns > for language that to already have one. What is the procedure I should > follow? Is it possible?
I'm not saying that it's not possible, but it's something we've been "sweeping under the rug" for the past 9 years (since me and Arthur started with the patterns cleanup). So if someone wants to address the issue, it would help to come up with some reasonable solution as well. Another language with a similar problem is Russian. I believe there are roughly 6 alternative patterns for that. > What about the license, author permissions? That depends on the patterns / the files you intend to use and has nothing to do with the rest of technical problems. In ideal case the author would agree to some permissive licence (for us the ideal seems to be MIT which seems to work for all projects involved so far). > Where should be hosted? That depends on the solution you come up with. To start with, the patterns should be *somewhere* where one could fetch them. What is the "upstream" source? > Are there any restrictions I should take into account. > > The language in question is Bulgarian. In the wild there are two sets of > patterns. The one officially listed here and the one that is not. As far as > I can tell both of them are used by the Bulgarian community. Though there > are no figures I can present to you. There are no publicly available quality > checks or auditions for any of them so this also could not be used as > factor. By far the best solution would be to encourage a couple of local linguists, run both patterns through long list of words and do some extensive analysis, and decide which pattern set works best. Or perhaps come up with the third set that works better than any of the other individual ones (see also https://xkcd.com/927/). It would be super beneficial if someone did the quality analysis of the existing patterns. Now, there are numerous different possibilities to address the problem. If the first and best solution is not an option, you can make it work by also collaborating with babel and polyglossia to support those new patterns in some consistent way. But then you also need to educate local TeX users to make use of those options. Germans have a package that replaces the default set of patterns with alternative ones for example. My biggest fear is that after doing all the work you might still end up with just 5 users or even less actually using those alternative patterns. (In all those 9 years this was for example the first question about alternative patterns for Bulgarian. If the other alternative was in high demand, I would expect the question to pop up earlier. My estimate is not that the current set is superior in any way, but that simply users don't care enough to explicitly switch to another set.) But the problem will remain elsewhere even if you address the problem inside TeX. The patterns may be used to hyphenate websites, to hyphenate documents in (Open/Libre/Whatever)Office etc. I'm pretty sure that you cannot convince all the web browser developers to support multiple sets of hyphenation patterns per language (and then all the website content contributors to specify which set of patterns should be used when hyphenating Bulgarian?) unless there's in fact some fundamental difference in the grammar (rather than just different quality of the patterns). From that perspective it would make more sense to agree on a single good quality set of patterns. For example there are three sets of hyphenation patterns for German: one set for traditional Swiss German, one set for traditional German and one set for modern German. If someone wants to explicitly follow the rules from more than 20 years ago (for example to reproduce an old book), they explicitly switch. But that "language variant" also has an officially registered tag in the standard and I'm still pretty sure that no browser supports that (I would be glad to be proven wrong though :). I see three fundamentally different approaches: - patterns end up in hyph-utf8 - patterns end up in some new repository "hyph-utf8-alternatives" - you or someone else creates a new package with alternative patterns, similar to what Germans are shipping(*) I have some "problems" justifying going for the first option without a damn good justification as that only introduces additional mess and handling of special cases. We could theoretically do the second. I would need slightly less justification for that, but still at least somewhat good reason to do it. And then we would need sufficient support also from babel & polyglossia, else this hardly makes any sense anyway. Doing the third is always an option that any user is free to do and we can help if needed. Again ... any option that the linguists would come together and provide the definitive answer about getting a single set of high quality patterns? Mojca (*) Germans actually have 5 sets of patterns right now, plus three additional ones loaded by an additional package. So 8 sets in total. Two sets correspond to "traditional" and "modern" German, they are super old and are only ever used in TeX, pdfTeX and other 8-bit-only engines/formats. LuaTeX, XeTeX, pTeX would take the patterns from the new effort (http://projekte.dante.de/Trennmuster) which provides three sets of patterns: "traditional", "traditional Swiss" and "modern" German. But the Germans are then afraid to break backward compatibility of older documents which is why we never got rid of the old patterns (yet). And because some users want to use the new patterns with pdfTeX, all these three sets are duplicated again in an external package (https://www.ctan.org/pkg/dehyph-exptl). I find this "mess" somewhat hard to justify and would much prefer to stick to just three sets of the patterns from the Trennmuster project (three patterns per language should be complex enough :). Then we have some further mess with some other languages from the Balkan where people cannot even decide which language they speak and thus which hyphenation patterns to use :) :) :) I would be really grateful not to introduce additional mess with other languages.
