from:"Hèctor Alòs i Font"

[Apertium-stuff] apertium-srd-ita 1.2.0 ready to be released

2023-10-07 Thread Hèctor Alòs i Font

Greetings!

apertium-srd-ita, apertium-srd and apertium-ita are ready for a new
release. The major innovation is the new translator from Sardinian to
Italian (previously only the opposite direction was available). I have not
been able to quantify its quality, but it seems to me to be fully
functional, starting from the standard Sardinian (Limba Sarda Comuna,
2006). The translator is also quite able to translate from the previous
standard (Limba Sarda Unificada, 2001), but only partially from the
different regional spellings.

In the direction of Italian into Sardinian, the translator has been greatly
improved: several thousand words have been added to the dictionaries, and
tons of new rules of disambiguation, lexical selection and transfer have
been added. Gianfranco Fronteddu has helped a lot in the translation of
words and the resolution of dozens of doubts. The new version also owes
much to the hundreds of notes by Diegu Corràine on errors or requested
improvements in the translation of the previous version of the
Italian-Sardinian translator. Many thanks also to Daniel Swanson and Tino
Didriksen for their technical support.

Finally, just to mention that, from an architectural point of view, these
are pretty conventional Apertium translators, with a three-tier
architecture, lttoolbox and CG.

Best regards,
Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Changes to apertium-preprocess-transfer

2023-06-28 Thread Hèctor Alòs i Font

The improvement is huge. The compile time is now a looot faster! Many
thanks, Daniel!!
As for the warnings, in principle it is OK. The problem is that the line
codes are not working, just as they didn't previously. If I "touch *t1x"
and recompile apertium-oci-fra, I get (after one or two seconds!) :

$ touch *t1x
$ make
apertium-validate-transfer oci-fra.t1x
apertium-preprocess-transfer oci-fra.t1x oci-fra.t1x.bin
Warning at line 12933, column 15: Rule 97 has the same pattern as rule 4.
Skipping.
Warning at line 15492, column 9: Rule 140 has the same pattern as rule 139.
Skipping.
apertium-validate-transfer oci@gascon-fra.t1x
apertium-preprocess-transfer oci@gascon-fra.t1x o...@gascon-fra.t1x.bin
Warning at line 12983, column 32: Rule 98 has the same pattern as rule 4.
Skipping.
Warning at line 15542, column 20: Rule 141 has the same pattern as rule
140. Skipping.
apertium-validate-transfer fra-oci.t1x
apertium-preprocess-transfer fra-oci.t1x fra-oci.t1x.bin
apertium-validate-transfer fra-oci@gascon.t1x
apertium-preprocess-transfer fra-oci@gascon.t1x fra-...@gascon.t1x.bin
Warning at line 12983, column 32: Rule 98 has the same pattern as rule 4.
Skipping.
Warning at line 15542, column 20: Rule 141 has the same pattern as rule
140. Skipping.
apertium-validate-transfer fra-oci.t1x
apertium-preprocess-transfer fra-oci.t1x fra-oci.t1x.bin
apertium-validate-transfer fra-oci@gascon.t1x
apertium-preprocess-transfer fra-oci@gascon.t1x fra-...@gascon.t1x.bin

So, for instance, there isn't any rule in oci-fra.t1x which begins at line
12933 or 15492. In addition, preprocessing often removes comments from rule
headers, so that even if you have the right line, it is not easy to find
the rule in the source code. I should put this as issues, but I have always
been lazy.

Hèctor


Missatge de Daniel Swanson  del dia dl., 26 de
juny 2023 a les 23:01:

> Greetings Apertiumers!
>
> I recently identified a way that apertium-preprocess-transfer was
> being rather inefficient and today I fixed it, so tomorrow you all
> should be able to update to apertium 3.9.4 and see some improved
> compile times for any pairs not using apertium-recursive, with
> speedups between 10x and 7000x faster on the files I tested.
>
> I'm writing this email to let you know that in the process
> apertium-preprocess-transfer lost the ability to report partial
> overlaps like the following:
>
> Warning at line 6867, column 4: Paths to rule 27 blocked by rule 24.
>
> And I just wanted to let you all know, in case someone was depending
> on those. To compensate, I added a check to apertium-lint which can
> report roughly the same information:
>
> Warning (overlapping-paths) on line 6852: The sequence [preadv
> vblex.pp n.*] matches both this rule and the rule on line 6628.
>
> Daniel, who is trying to get better at not doing things that
> potentially break people's workflows without telling them
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Ready to release: spa-arg 0.6.0 and arg-cat 0.3.0

2023-05-05 Thread Hèctor Alòs i Font

Gracias, Juan Pablo. Muito interesant.
Hèctor

Missatge de Juan Pablo  del dia dv., 5 de maig 2023 a les
13:24:

> Muitas gracias, Hèctor!
>
> Not so many changes in lemmae, although with some visual impact as some
> changes affect morphological sufixes (on the other hand, most of the "new"
> spellings were already present in other spelling proposals, and were
> analyzed by the previous version of Apertium, so it has been more a matter
> of selecting the forms to generate).
>
> Main changes affect etymological "qu/qü" which is now "cu" when the "u" is
> pronounced: quán -> cuán, freqüent -> frecuent, and some "common spellings"
> that were valid for different dialectal support (like plurals in -z:
> mocetz now is mocez (mocets in benasqués); or verbal desinences (fetz is
> now fez (or fets in benasqués), or ell(s), bell(s), aquell(s), which are
> now él(s), bel(s), aquel(s), excepting areas with palatalization). Other
> changes affect slightly accentuation of vowels, and allow "doublets" for
> some spellings (but in those cases, Apertium continues to generate the same
> forms as before, as they is also considered the preferred ones in the new
> norm).
>
> Best,
>
> Juan Pablo
>
>
> El 05/05/2023 a las 11:25, Hèctor Alòs i Font escribió:
>
> Felicidaz y gracias, Juan Pablo. Out of curiosity, have there been many
> changes between the previous version and the new one?
> Best,
> Hèctor
>
> Missatge de Juan Pablo  del dia dv., 5 de maig 2023 a
> les 2:24:
>
>> Dear all,
>>
>> The pairs Spanish-Aragonese and Aragonese-Catalan are ready to release
>> (can anyone tag them?)
>>
>> apertium-spa-arg 0.6.0 (commit 61048e9) depends on apertium-spa (commit
>> d2455cf, needs new tag)  and apertium-arg 0.2.0 (commit 0b9f06e).
>>
>> apertium-arg-cat 0.3.0 (commit 5255af5) depends on apertium-arg 0.2.0
>> (commit 0b9f06e) and apertium-cat (commit 201dcec, needs new tag).
>>
>> Although they include some new entries and paradigms (especially in the
>> monolingual apertium-arg), the mean reason for the release is that both
>> pairs have been adapted to generate Aragonese according to the new
>> official spelling system approved by the Academia Aragonesa de la Lengua
>> (while still analyzing text with the previous spelling system).
>>
>> Best,
>>
>> Juan Pablo
>>
>>
>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>> <https://urldefense.com/v3/__https://lists.sourceforge.net/lists/listinfo/apertium-stuff__;!!D9dNQwwGXtA!S5qv0Ka38vmppIO2c5keMqT0hs8LIncCf1KS--gwTbvoYzIG5wjLXeGHajrrVHkhPGjFZzzJbj2A3oPC190$>
>>
>
>
> ___
> Apertium-stuff mailing 
> listApertium-stuff@lists.sourceforge.nethttps://urldefense.com/v3/__https://lists.sourceforge.net/lists/listinfo/apertium-stuff__;!!D9dNQwwGXtA!S5qv0Ka38vmppIO2c5keMqT0hs8LIncCf1KS--gwTbvoYzIG5wjLXeGHajrrVHkhPGjFZzzJbj2A3oPC190$
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Ready to release: spa-arg 0.6.0 and arg-cat 0.3.0

2023-05-05 Thread Hèctor Alòs i Font

Felicidaz y gracias, Juan Pablo. Out of curiosity, have there been many
changes between the previous version and the new one?
Best,
Hèctor

Missatge de Juan Pablo  del dia dv., 5 de maig 2023 a les
2:24:

> Dear all,
>
> The pairs Spanish-Aragonese and Aragonese-Catalan are ready to release
> (can anyone tag them?)
>
> apertium-spa-arg 0.6.0 (commit 61048e9) depends on apertium-spa (commit
> d2455cf, needs new tag)  and apertium-arg 0.2.0 (commit 0b9f06e).
>
> apertium-arg-cat 0.3.0 (commit 5255af5) depends on apertium-arg 0.2.0
> (commit 0b9f06e) and apertium-cat (commit 201dcec, needs new tag).
>
> Although they include some new entries and paradigms (especially in the
> monolingual apertium-arg), the mean reason for the release is that both
> pairs have been adapted to generate Aragonese according to the new
> official spelling system approved by the Academia Aragonesa de la Lengua
> (while still analyzing text with the previous spelling system).
>
> Best,
>
> Juan Pablo
>
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] gsoc proposal

2023-03-26 Thread Hèctor Alòs i Font

Thank you for your proposal. You seem to have rushed to submit it even
though there are still several days to go. The result is that there are
many formatting errors that make it difficult to read. It doesn't make a
good impression for someone who is supposed to be meticulous in creating
code so as not to leave errors. Among other things "Coding Challenge"
appears as a part of "Why Google should sponsor it?" and contains things
that are not part of the challenge; the section "What is your plan for
completing the project on time?" contains almost nothing because the
putative subsection "UI improvements" is at the same level; there is a
"week 1" (beware of the repeated typo), but no other weeks at the same
level, etc.
Please note that I am only talking about purely formal issues. I don't know
if there will be other problems as well. I am afraid that it might be easy
to find them in view of the haste in writing the proposal. So, take the
time to review the code (in this case, the proposal) before submitting it
again.
Best wishes,
Hèctor

‪Missatge de ‫حازم شهاوى‬‎  del dia dl., 27 de març
2023 a les 1:33:‬

>
> I created a good proposal that has many   UI and backend improvements, I
> hope it appeals you !
> https://wiki.apertium.org/wiki/Apertium:HazemShehawy
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Apply for GSOC [English-Bodo language pair]

2023-03-24 Thread Hèctor Alòs i Font

Hi, Maharaj,

Bodo seems to me to be an excellent language for Apertium. Unfortunately,
we have no morphological parser for it, even embryonic. That's why  to
develop one seems to me much more realistic than to think of an automatic
translator project between Bodo and another language (and even more so, if
it belongs to another language family).

It would therefore be a matter of doing this type of project:
https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code#Develop_a_morphological_analyser


I've been looking a bit at the morphology of the language, and lexd [2]
seems the best choice for developing a parser (e.g. because of the
possessive prefixes).

It would now be a matter of making the coding challenge of the project.
There are about ten days left before the deadline. That is not a lot of
time.

I would suggest joining IRC to get faster help, if needrf:
https://wiki.apertium.org/wiki/IRC

Best regards,
Hèctor

[1] https://github.com/apertium/lexd

Missatge de Maharaj Brahma via Apertium-stuff <
apertium-stuff@lists.sourceforge.net> del dia dv., 24 de març 2023 a les
15:54:

> Dear Apertium folks,
>
> I'm writing to express my interest in participating in the GSoC with
> Apertium. I'm a first-year Ph.D. student in the NLP domain. I'm excited to
> work with the Apertium community on developing translation technology for
> low-resource languages, particularly Bodo. Bodo is a low-resource language
> primarily spoken in the Northeastern region of India. As an NLP researcher
> and native speaker of Bodo, I'm committed to building technology to
> preserve and promote indigenous languages like Bodo.
>
> I'm interested in adding a new language English-Bodo to the Apertium
> platform through this GSoC program or otherwise. I believe this can
> potentially impact the translation technology for low resources.
> Additionally, due to the following points:
> (i) There is no existing publicly available translation technology for
> English-Bodo.
> (ii) Bodo is a low-resource language (potentially ample contribution space
> remains).
> (iii) Belongs to the Sino-Tibetan language family, unlike other Indian
> languages like Hindi and Assamese.
>
> I would like to know if this is a potential project. If so, I would like
> to interact with potential mentors.
>
> Any comments on the English-Bodo language pair are most welcome.
>
> With Regards,
> Maharaj Brahma
> Research Scholar
> Dept. of Computer Science & Engineering
> CS23RESCH01004
>
> Disclaimer:- This footer text is to convey that this email is sent by one
> of the users of IITH. So, do not mark it as SPAM.
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Applying for GSOC 2023 projects

2023-02-25 Thread Hèctor Alòs i Font

Hi Lahari,

Did you take a look on this page:
https://wiki.apertium.org/wiki/Starting_a_new_language_with_HFST ?

Telugu has already a minimal skeleton in Apertium:
https://github.com/apertium/apertium-tel

We would have to do something similar to what I proposed for Hindi, with
the difference that with Telugu we start practically from scratch. Another
difference is that Telugu uses twol + lexc/lexd. In short: you have to
describe the morphology of the language (nouns, adjectives, verbs, adverbs,
etc.) + introduce words by associating them with the paradigms defined in
the morphology. The aim is to achieve a high coverage.

Typically, this is done by starting with the closed morphological
categories and then moving on to the open ones. The inclusion of words
depends on the free resources available. In the worst case, it is done by
hand in decreasing order of word frequency.

At the moment there is quite a lot of work: installing Apertium (with twol
+ lexc/lexd) and basically learning how it works. The "coding challenge" (
https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Morphological_analyser
) is essential as it shows that the candidate has understood the basics. In
case you have problems (which is absolutely normal), don't hesitate to ask
for help in the IRC channel. We are here to help.

Hèctor

Missatge de Lahari Sreeja Tallapaka  del
dia ds., 25 de febr. 2023 a les 16:59:

> Sir, please explain how I should proceed with the Telugu Morphological
> Analyser. I've read the documentation available on Apertium Wiki Pages and
> looked at the apertium-tel GitHub page, which has very little information.
> There is information on how to make a new language pair; I'm unsure if that
> is what I should be doing as only one language is involved here.
>
> Regards,
> Lahari Sreeja
>
> On Thu, Feb 23, 2023 at 10:12 PM Hèctor Alòs i Font 
> wrote:
>
>> I agree with Daniel. Moreover, experience shows that it is also
>> unrealistic to do such a complex project as a translator for distant
>> languages in a GSoC. I would look at the Telugu morphological analyser
>> currenly available in Apertium and concentrate on it. 90% of coverage could
>> be a minimal goal. The more the better. This will require a good command of
>> Telugu grammar. Lexc/lexd + twol should not be a problem for someone with a
>> computer science background.
>> Hèctor
>>
>> Missatge de Daniel Swanson  del dia dj., 23
>> de febr. 2023 a les 16:37:
>>
>>> Hi Lahari,
>>>
>>> For translation pairs, Hindi-English has been tried several times
>>> without success. I would suggest considering Hindi-Telugu.
>>>
>>> For other project ideas or places to get started, you can check the
>>> wiki page for each idea and do the coding challenge. If an idea is
>>> missing a coding challenge or you want to discuss the details of it,
>>> you'll get the quickest responses by talking to us on IRC:
>>> https://wiki.apertium.org/wiki/IRC
>>>
>>> Daniel
>>>
>>> On Thu, Feb 23, 2023 at 2:24 AM Lahari Sreeja Tallapaka
>>>  wrote:
>>> >
>>> > Greetings to the community,
>>> > I am Lahari Sreeja from the Indian Institute of Technology(IIT),
>>> Bhilai. I have taken Machine Learning, Natural Language processing, and
>>> Information retrieval courses and have experience in frontend web
>>> development. I know Telugu, Hindi, and English languages. And Im interested
>>> in adding English-Hindi/English-Telugu language pairs and there are a lot
>>> of projects that are interesting and are in my domain. It would be helpful
>>> to get guidance on where to start and some issues I can work on.
>>> > Cheers!
>>> > ___
>>> > Apertium-stuff mailing list
>>> > Apertium-stuff@lists.sourceforge.net
>>> > https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>>>
>>> ___
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Fwd: Re : GSOC 2023

2023-02-25 Thread Hèctor Alòs i Font

Hi Khushi,

As for Hindi, you should first test the coverage.

According to this page (last edited in 2019), the dictionary was some
37,000 (which is quite good, in principle) but only some 83.1% :
https://wiki.apertium.org/wiki/Languages.
So, you should see what is the current state of the package.

You should install Apertium and the Hindi package. A corpus is need: we
usually get Wikipedia, and select randomly several million sentences of it.
With this, you can calculate the naive coverage, and see if the dictionary
has grown significatively since 2019.

Once you have this, you can analyse where the problem comes: This low
coverage is basically due to missing words or morphological forms that are
not recognised, although the words do exist in the dictionary? With 37,000
words and 83% coverage, the latter seems likely (regardless of the fact
that it is always good to have more words in dictionaries). It is a
question of understanding what is missing: nominal morphological forms,
verbals?

It is also interesting to see if there are free sources from which the
dictionary could be expanded automatically or semi-automatically.

On this basis one can see if there is work for a project. Most probably
there is for a small or, at most, a medium-sized one.

Hèctor


Missatge de Khushi - <12khushi...@gmail.com> del dia ds., 25 de febr. 2023
a les 10:02:

> Thanks a lot for your feedback!
> It would be great if you could tell me how should I get started with this
> and what milestones should I aim to achieve in order to improve it.
>
> Regards,
> Khushi Harsure
>
> [image: Mailtrack]
> <https://mailtrack.io?utm_source=gmail_medium=signature_campaign=signaturevirality13;>
>  Email
> delivery certified by
> Mailtrack
> <https://mailtrack.io?utm_source=gmail_medium=signature_campaign=signaturevirality13;>
>  25/02/23,
> 16:46:50
>
> On Fri, 24 Feb 2023 at 23:46, Hèctor Alòs i Font 
> wrote:
>
>> Hi Kushi,
>>
>> First: Hindi-Marathi is already available on Google. I think you should
>> reason out the usefulness of developing it in Apertium. A priori, it does
>> not seem like a project that is going to be especially promising.
>>
>> As for your current question, why should the pair be created again from
>> scratch? Have you seen something wrong on it? In principle, I don't see at
>> all why the work that has been done before should be wasted. I would do it
>> only if, after analysing it, it turns out that it is appalling (which would
>> be weird).
>>
>> I don't know Hindi, but from what I saw two years ago, the morphological
>> analyser seems to have a lot of room for improvement. It might make sense
>> to concentrate on it and its morphological disambiguator. This would help
>> to subsequently develop translators between low-resource Indo-Aryan
>> languages and Hindi.
>>
>> Hèctor
>>
>> Missatge de Khushi - <12khushi...@gmail.com> del dia dv., 24 de febr.
>> 2023 a les 19:58:
>>
>>> Respected sir,
>>> Thanks a lot for your response. I am glad that you appreciate it. I
>>> wanted to clear up some doubts before I start working on it.
>>> I would like to know whether you want me to work on the existing marathi
>>> - hindi translator or should i create a new one from scratch. In the former
>>> case, what kind of improvements or contributions will be expected ?
>>> Looking forward to hearing from you soon !
>>>
>>> Regards,
>>> Khushi Harsure
>>>
>>>
>>>
>>> [image: Mailtrack]
>>> <https://mailtrack.io?utm_source=gmail_medium=signature_campaign=signaturevirality13;>
>>>  Email
>>> delivery certified by
>>> Mailtrack
>>> <https://mailtrack.io?utm_source=gmail_medium=signature_campaign=signaturevirality13;>
>>>  25/02/23,
>>> 02:48:57
>>>
>>> On Fri, 24 Feb 2023 at 20:05, Daniel Swanson 
>>> wrote:
>>>
>>>> Hi Khushi,
>>>>
>>>> Yeah, that sounds like a good project to me.
>>>>
>>>> Next steps would be opening a pull request on
>>>> https://github.com/apertium/apertium-mar-hin and requesting a wiki
>>>> account to write your workplan.
>>>>
>>>> Daniel
>>>>
>>>> On Fri, Feb 24, 2023 at 4:09 AM Khushi - <12khushi...@gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> -- Forwarded message -
>>>>> From: Khushi - <12khushi...@gmail.com>
>>>>> Date: Fri, 24 Feb 2023 at 14:24
>>>>> Subject: Re : [Apertium-stuff] GSOC 2023
>>>

Re: [Apertium-stuff] Fwd: Re : GSOC 2023

2023-02-24 Thread Hèctor Alòs i Font

Hi Kushi,

First: Hindi-Marathi is already available on Google. I think you should
reason out the usefulness of developing it in Apertium. A priori, it does
not seem like a project that is going to be especially promising.

As for your current question, why should the pair be created again from
scratch? Have you seen something wrong on it? In principle, I don't see at
all why the work that has been done before should be wasted. I would do it
only if, after analysing it, it turns out that it is appalling (which would
be weird).

I don't know Hindi, but from what I saw two years ago, the morphological
analyser seems to have a lot of room for improvement. It might make sense
to concentrate on it and its morphological disambiguator. This would help
to subsequently develop translators between low-resource Indo-Aryan
languages and Hindi.

Hèctor

Missatge de Khushi - <12khushi...@gmail.com> del dia dv., 24 de febr. 2023
a les 19:58:

> Respected sir,
> Thanks a lot for your response. I am glad that you appreciate it. I wanted
> to clear up some doubts before I start working on it.
> I would like to know whether you want me to work on the existing marathi -
> hindi translator or should i create a new one from scratch. In the former
> case, what kind of improvements or contributions will be expected ?
> Looking forward to hearing from you soon !
>
> Regards,
> Khushi Harsure
>
>
>
> [image: Mailtrack]
> 
>  Email
> delivery certified by
> Mailtrack
> 
>  25/02/23,
> 02:48:57
>
> On Fri, 24 Feb 2023 at 20:05, Daniel Swanson 
> wrote:
>
>> Hi Khushi,
>>
>> Yeah, that sounds like a good project to me.
>>
>> Next steps would be opening a pull request on
>> https://github.com/apertium/apertium-mar-hin and requesting a wiki
>> account to write your workplan.
>>
>> Daniel
>>
>> On Fri, Feb 24, 2023 at 4:09 AM Khushi - <12khushi...@gmail.com> wrote:
>>
>>>
>>>
>>> -- Forwarded message -
>>> From: Khushi - <12khushi...@gmail.com>
>>> Date: Fri, 24 Feb 2023 at 14:24
>>> Subject: Re : [Apertium-stuff] GSOC 2023
>>> To: 
>>>
>>>
>>> Hello !
>>>
>>> This is Khushi Harsure, an undergraduate student from India pursuing
>>> Computer Science. I'd like to participate in Google Summer Of Code 2023 at
>>> Apertium. The project involving addition of a new language pair has  caught
>>> my interest and being a native speaker, I was planning to work on addition
>>> of Hindi-Marathi pair. Previously Hindi-English and English-Marathi pairs
>>> have been added by past Gsoccers however Hindi-Marathi pair remains
>>> unworked upon. Before starting off, I wanted to get a confirmation whether
>>> this would be a potential Gsoc project.
>>>I would also like to know the steps that should
>>> be followed after doing the installation of Apertium other than giving the
>>> coding challenge. Looking forward to hearing from you.
>>>
>>> Regards,
>>> Khushi Harsure
>>>
>>>
>>>
>>> [image: Mailtrack]
>>> 
>>>  Email
>>> delivery certified by
>>> Mailtrack
>>> 
>>>  24/02/23,
>>> 14:23:55
>>>
>>> [image: Mailtrack]
>>> 
>>>  Email
>>> delivery certified by
>>> Mailtrack
>>> 
>>>  24/02/23,
>>> 14:38:51
>>> ___
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Error in translation English-Italian

2023-02-24 Thread Hèctor Alòs i Font

Grazie a te, Silvia, per aver segnalato il problema e perdona il mio sfogo.
In Apertium ci dedichiamo fondamentalmente allo sviluppo di traduttori
automatici per lingue con poche risorse (tipicamente lingue
minoritarie/minorizzate). Nel caso dell'italiano, abbiamo rilasciato il
sardo-italiano e so che si sta lavorando anche con l'italiano-siciliano.
Per la traduzione tra italiano, inglese, francese, russo, ecc. ci sono
altri sistemi che lo fanno molto meglio. Se c'è qualche sviluppo tra
l'inglese e l'italiano in Apertium, molto probabilmente è solo perché
qualcuno ha imparato come funziona il motore di traduzione.
Cordiali saluti,
Hèctor

Missatge de s_lombard...@yahoo.it  del dia dv., 24
de febr. 2023 a les 14:42:

> Thank you for answering and explaining some problems. I am a simple user,
> who tries to help!
> Silvia Lombardini
>
> Inviato da Yahoo Mail su Android
> <https://go.onelink.me/107872968?pid=InProduct=Global_Internal_YGrowth_AndroidEmailSig__AndroidUsers_wl=ym_sub1=Internal_sub2=Global_YGrowth_sub3=EmailSignature>
>
> Il Ven, 24 Feb, 2023 alle 9:45, Hèctor Alòs i Font
>  ha scritto:
> I'm a bit surprised that Apertium's eng-ita is used anywhere since it is
> not released. But I am even more surprised that Wikimedia uses a
> non-released pair while it is not using even some of the released ones. In
> fact, I have been asking Wikimedia without any success for several years to
> incorporate the new language pairs and the new versions we have released.
> It is very disappointing, for example, for the Arpitans, who cannot
> generate articles on their Wikipedia much more easily because Wikimedia has
> not made available the translator that we have had available for three
> years. The same goes for Sardinian and, I think, for Silesian. If someone
> could fix this once and for all, several linguistic minorities will be very
> grateful.
>
> As for eng-ita, I'm afraid that correcting the bug in Apertium could not
> solve the problem as Wikimedia does not seem to be updating its Apertium
> translators. In fact, it is even possible that the error has already been
> corrected in Apertium for some time and has not been changed.
>
> Hèctor
>
> Missatge de s_lombardini--- via Apertium-stuff <
> apertium-stuff@lists.sourceforge.net> del dia dv., 24 de febr. 2023 a les
> 10:23:
>
> I use Metawiki as a translator. Often I find that the English word 'for'
> is translated by Apertium with 'partorisca'. This Italian word is a verb
> meaning 'to give birth'. The correct translation is 'per'. Is it possible
> to fix it?
>
> Silvia Lombardini
> Inviato da Yahoo Mail su Android
> <https://go.onelink.me/107872968?pid=InProduct=Global_Internal_YGrowth_AndroidEmailSig__AndroidUsers_wl=ym_sub1=Internal_sub2=Global_YGrowth_sub3=EmailSignature>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Error in translation English-Italian

2023-02-24 Thread Hèctor Alòs i Font

I'm a bit surprised that Apertium's eng-ita is used anywhere since it is
not released. But I am even more surprised that Wikimedia uses a
non-released pair while it is not using even some of the released ones. In
fact, I have been asking Wikimedia without any success for several years to
incorporate the new language pairs and the new versions we have released.
It is very disappointing, for example, for the Arpitans, who cannot
generate articles on their Wikipedia much more easily because Wikimedia has
not made available the translator that we have had available for three
years. The same goes for Sardinian and, I think, for Silesian. If someone
could fix this once and for all, several linguistic minorities will be very
grateful.

As for eng-ita, I'm afraid that correcting the bug in Apertium could not
solve the problem as Wikimedia does not seem to be updating its Apertium
translators. In fact, it is even possible that the error has already been
corrected in Apertium for some time and has not been changed.

Hèctor

Missatge de s_lombardini--- via Apertium-stuff <
apertium-stuff@lists.sourceforge.net> del dia dv., 24 de febr. 2023 a les
10:23:

> I use Metawiki as a translator. Often I find that the English word 'for'
> is translated by Apertium with 'partorisca'. This Italian word is a verb
> meaning 'to give birth'. The correct translation is 'per'. Is it possible
> to fix it?
>
> Silvia Lombardini
> Inviato da Yahoo Mail su Android
> 
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Applying for GSOC 2023 projects

2023-02-23 Thread Hèctor Alòs i Font

I agree with Daniel. Moreover, experience shows that it is also unrealistic
to do such a complex project as a translator for distant languages in a
GSoC. I would look at the Telugu morphological analyser currenly available
in Apertium and concentrate on it. 90% of coverage could be a minimal goal.
The more the better. This will require a good command of Telugu grammar.
Lexc/lexd + twol should not be a problem for someone with a computer
science background.
Hèctor

Missatge de Daniel Swanson  del dia dj., 23 de
febr. 2023 a les 16:37:

> Hi Lahari,
>
> For translation pairs, Hindi-English has been tried several times
> without success. I would suggest considering Hindi-Telugu.
>
> For other project ideas or places to get started, you can check the
> wiki page for each idea and do the coding challenge. If an idea is
> missing a coding challenge or you want to discuss the details of it,
> you'll get the quickest responses by talking to us on IRC:
> https://wiki.apertium.org/wiki/IRC
>
> Daniel
>
> On Thu, Feb 23, 2023 at 2:24 AM Lahari Sreeja Tallapaka
>  wrote:
> >
> > Greetings to the community,
> > I am Lahari Sreeja from the Indian Institute of Technology(IIT), Bhilai.
> I have taken Machine Learning, Natural Language processing, and Information
> retrieval courses and have experience in frontend web development. I know
> Telugu, Hindi, and English languages. And Im interested in adding
> English-Hindi/English-Telugu language pairs and there are a lot of projects
> that are interesting and are in my domain. It would be helpful to get
> guidance on where to start and some issues I can work on.
> > Cheers!
> > ___
> > Apertium-stuff mailing list
> > Apertium-stuff@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC 2023 Mentors & Ideas?

2023-01-27 Thread Hèctor Alòs i Font

Missatge de Kevin Brubeck Unhammer  del dia dv., 27 de
gen. 2023 a les 23:41:

> > As far as rewriting the
> > transfer rules using apertium-recursive is concerned, a co-mentor with
> > experience in the module would be highly desirable.
>
> I can try to assist :)


Great, thanks, Kevin! It now remains to be seen whether all the conditions
are in place for there to be a solid proposal in this sense (starting with
Apertium being chosen by Google this year).

Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC 2023 Mentors & Ideas?

2023-01-27 Thread Hèctor Alòs i Font

Missatge de Aure Séguier  del dia dv., 27 de gen.
2023 a les 16:41:

> Hi
>
> I'm desperately trying to join Apertium IRC to talk about this but I
> can't, so I will use this mailing list instead.
>
> We would like to propose subjects for GSoC related to occitan language. We
> have 2 ideas :
> - Convert the occitan-french pair to recursive transfert. I think Hector
> Alos would agree to be the mentor on this one.
> - Add two occitan varieties (provençal and limousin) to the occitan-french
> pair. I could mentor this one, and Lo Congrès would provide lexicons (for
> conjugations and translations) to pour in Apertium monodix and bidix.
>
I will be happy to co-mentor either or even both (if Apertium receives
grants this year and suitable students appear). As far as rewriting the
transfer rules using apertium-recursive is concerned, a co-mentor with
experience in the module would be highly desirable. I have written most of
the rules using the traditional system of transfer rules, but I have no
experience with apertium-recursive.

 Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Capitalization Handling

2022-12-24 Thread Hèctor Alòs i Font

Looks very good, Daniel. Thanks in advance. I'll try to test in the next
days in the pairs I maintain.
Merry Christmas/Hanukkah/New Year/*.
Hèctor

Missatge de Daniel Swanson  del dia dv., 23 de
des. 2022 a les 0:41:

> Greetings Apertiumers!
>
> I have two updates to report:
>
> First, I have rewritten the postgenerator (again), this time as part
> of apertium-separable (and so not breaking the old one, unlike last
> time), and in such a way that postgenerator rules can both match on
> lemma and tags in addition to surface forms and iteratively apply to
> their own output.
>
> This is available as part of apertium-separable 0.7.0 and is
> documented at https://wiki.apertium.org/wiki/Postgenerator
>
> Second, I just added a pair of modules which move capitalization
> information into word-bound blanks at the beginning of the pipeline
> and then reapply them according to LRX-like rules at the end of the
> pipeline, allowing all intermediate modules to operate solely on
> dictionary case.
>
> This should be available after the next nightly build (i.e. tomorrow)
> in apertium 3.9.0, and is documented at
> https://wiki.apertium.org/wiki/Capitalization_restoration
>
> If anyone has questions or would like help trying this out for a
> language pair or if I missed something in the documentation, let me
> know.
>
> Thanks to Kevin Unhammer and Marc Riera for helping me figure out what
> the design of the capitalization module should be.
>
> Merry Christmas,
> Daniel
>
> P.S. To anyone not interested in either of these developments: your
> Christmas gift is that I accidentally made lexical selection quite a
> bit faster while I was working on these.
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Released: swe-nor 0.4.0, dan-nor 1.5.0

2022-12-24 Thread Hèctor Alòs i Font

The move from traditional to recursive transfer is inspiring. I'm too lazy
to redo tons of code, but I'm glad to confirm that it's more compact and
powerful. Congrats!
Hèctor

Missatge de Kevin Brubeck Unhammer  del dia dl., 19 de
des. 2022 a les 15:22:

> Goddag,
>
> I've just tagged new releases of swe-nor and dan-nor.
>
> The work on swe-nor is partially funded by the Norwegian News Agency,
> and dan-nor by Store norske leksikon.
>
> For both pairs, all directions now use apertium-separable (lsx) and
> recursive transfer (rtx), with testing by apertium-regtest.
>
> Most of the work has been focused on the nob→{swe,dan} direction, but
> all directions have of course improved vocabulary and seem to have
> improved quality. The directions into Nynorsk are also usable with style
> preferences (though it hasn't been added to the UI yet in this release).
>
> Some stats:
>
> dan-nor:
> - Over 22.000 new non-name bidix entries
> - Over 300 new lexical selection rules
> - Over 300 new lexical selection rules
> - ~60 separable/mwe entries, including comma insertion rules for
>   generating Danish
>
> swe-nor:
> - Over 20.000 new non-name bidix entries
> - Over 300 new lexical selection rules manually added
> - Nearly 7000 new lexical selection rules based on corpus frequencies
> - ~30 separable/mwe entries
>
> and the newer monolingual dependencies mean much better bokmål
> disambiguation (and some improvements there for the other languages as
> well) as well as much better compound epenthetic choices and tweaks all
> round.
>
> Moving from chunking transfer to recursive for these pairs was a joy. I
> have spent very little time on the rules, but they already cover more
> than the old rules did, in much fewer lines of code (including comments
> and everything, dan-nor has ~1011 lines of rtx in one file per
> direction, and 8347 of t?x with three files per direction). Each
> direction has about 20 rtx rules (where a rule is NP→n|ncmp n|…), 50 if
> you count alternatives. There's a lot less redundancy than before, and
> the recursion means we can have e.g. compounds of arbitrary length.
>
> -Kevin
>
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] New Occitan-French release

2022-11-05 Thread Hèctor Alòs i Font

Missatge de Kevin Brubeck Unhammer  del dia dv., 4
de nov. 2022 a les 11:31:
>
> What if you do
>
> lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin | cg-proc oci.rlx.bin | 
> …
>
> The first CG step would output a stream variable, so that what the next
> step sees is
>
> []
> ^que/que/que$
> [more text here]
>
> If the next step is CG, it's just
>
>  REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;
>
> ie. remove enunciatives whenever the var is set.

I see. Yes, this is much easier than I though. Thanks, Kevin (and Tino
for the second mail on the matter).

@Aure Séguier , the solution I think is this one and Tino has added
the syntax explanations. When you have time, you can make the rules
for enondetect.rlx, as you proposed. You'll do it better than me.
Adding this step in modes.xml for Gascon is trivial with Kevin's
system (for Languedocien it is not necessary). I don't think you need
many rules. It's more like having a slightly wide window. Bearing in
mind that in Gascon texts, where enunciatives are used, they must be
found in every sentence, I don't think that a very wide window is
necessary to find cases that allow us to conclude without any doubt
whether or not we are in front of a text with enunciatives.

Hèctor


___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] New Occitan-French release

2022-11-04 Thread Hèctor Alòs i Font

Missatge de Aure Séguier  del dia dv., 4 de nov.
2022 a les 15:00:

> Hi,
>
> I can help to make rules to know if there are enunciatives in a text or no.
>
> About recognising which variety of occitan we are translating, we are
> currently developping a tool that can differentiate every dialect of
> occitan, but it isn't very efficient. Between Gascon and other dialects,
> it's OK, because Gascon is so different (except for Aranese, which is a
> subdialect of Gascon). But between Languedocien and other varieties
> (Provençal, Limousin...) there are many confusions.
>
> At first, I was thinking about adding a "all dialects" dialect for the
> oc->fr direction. It would be useful when people don't know the dialect of
> a text, or for texts with many dialects (e.g. a website with articles in
> many dialects, like newspapers). Is that something that was already done
> for another language ? Is it something that could be easily done ?
>

This is the way it works for Catalan and Portuguese. They use the v tag
instead of the alt tag in the dictionaries. The people who initially
developed Occitan in Apertium preferred not to do so. Occitan is too
diverse. Each variety already has a lot of very frequent homographs because
the spelling rules have nothing to distinguish them (unlike French,
Spanish, Italian, Catalan...). But when several varieties are added, the
problem is much bigger. Think of the Provençal article. If we know that the
text is Provençal, disambiguation is much easier. Or if we know that it is
Gascon with enunciatives, we also know what we can find, etc. I myself
immediately switch to a "Gascon" mode when I read it because its syntax is
quite different from the rest (+ enclitics, + concordance of verb
tenses...). This information is basic to have a correct disambiguation.


> Thanks
> Aura Séguier, responsabla de projèctes e desvolopaira
> Lo Congrès permanent de la lenga occitana
> Ciutat - Creem !, 5-7 rue de la Fontaine, 64000 Pau
> T. +33 (0)5 32 00 00 64
> a.segu...@locongres.org
> www.locongres.org
> Le 04/11/2022 à 09:30, Kevin Brubeck Unhammer a écrit :
>
> What if you do
>
> lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin | cg-proc oci.rlx.bin | 
> …
>
> The first CG step would output a stream variable, so that what the next
> step sees is
>
> []
> ^que/que/que$
> [more text here]
>
> If the next step is CG, it's just
>
>  REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;
>
> ie. remove enunciatives whenever the var is set.
>
> One can also unset it in the middle of the stream (if doing corpus
> runs), so output of the enon-detector is
>
> []
> ^que/que/que$
> [more text here]
> []
> ^que/que/que$
> [more text here]
>
> and the REMOVE:var-is-set rule will remove enunciatives in the first
> part, not after seeing the REMVARIABLE.
>
>
> Then the problem of looking several windows ahead is restricted to that
> first enon-detector step.
>
>
> 
>
> Alternatively, if we assume all the input is of the same language, we
> just don't know what language it is ahead of time, then you could
> do several passes, where one is a detector pipeline like
>
> lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin
>
> that outputs the STREAMCMD and then Apy would grep for that, and insert
> the STREAMCMD at the start of the call to the regular pipeline
>
> lt-proc oci.automorf.bin | cg-proc oci.rlx.bin | …
>
> That won't automatically work in modes files, and won't work for corpus
> tests if the corpus has a mix, but OTOH you could use 'export
> AP_SETVAR=non-enon' to force the regular pipeline to insert the
> STREAMCMD at the start.
>
>
>
>
> ___
> Apertium-stuff mailing 
> listApertium-stuff@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] New Occitan-French release

2022-11-04 Thread Hèctor Alòs i Font

Missatge de Tino Didriksen  del dia dj., 3 de
nov. 2022 a les 15:58:
>
> On Tue, 1 Nov 2022 at 11:45, Kevin Brubeck Unhammer  wrote:
>>
>> Hèctor Alòs i Font 
>> čálii:
>>
>> > As for your proposal, I do not yet have sufficient knowledge of CG to fully
>> > understand it. My idea would be to make a first pass through a whole text
>> > to understand if enunciatives are used in it (for example, recognising
>> > other, more infrequent, but more easily recognisable enunciatives). In the
>> > solution you propose, it seems that this knowledge is acquired
>> > progressively, as sentences are translated. I fear that "que" is so messy
>> > that at least the first sentences of a text would have the same problems as
>> > we have now when we translate a Gascon text without enunciatives.
>>
>> That should be possible too, though I'm not sure how feasible it is to
>> get CG to go that far into a text. By default, CG keeps a context of two
>> windows, but that's configurable. It should be possible (perhaps with
>> minor modifications to cg-proc) to read a bunch of sentences and use
>> Window Spanning tests https://visl.sdu.dk/cg3/single/#test-spanning
>>
>> Tino, have you tried looking ahead several paragraphs, are there any
>> downsides? This should be a fairly simple rule file.
>
>
> The max I've seen in production is 9 windows, but there is no hard limit. 
> Just have to be careful of spanning tests, as they are going to look ahead 
> for every active window. A multi-pass system will perform better, and for 
> this particular task I'd say multi-pass is the correct approach.
>

So I thought, but then:

1) We need a first CG process that finds out whether the text has
enunciatives. Probably it should return somehow 0 or 1. How?
2) Depending on this, we will have two slightly different pipes, but
how? Should the syntax of the modes.xml be expanded to include a kind
of "if-else"?

More generally, it would be desirable to have a first step that
recognises from which variety of Occitan we are translating.
Currently, we force the user to say whether he is translating from
Languedocien (called "Occitan" in Apertium and "Occitan Languedocien"
in the translator of the Congrès Permanent de la Lenga Occitana). A
user does not necessarily know it. When there are two possibilities,
there is not too much of a problem: try one and, if it doesn't work
too well, try the other. But when we have four or more variants, it
will be less obvious. But, for now, the question is to differentiate
between two Gascon "flavours".

Hèctor

___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] New Occitan-French release

2022-11-01 Thread Hèctor Alòs i Font

Missatge de Kevin Brubeck Unhammer  del dia dt., 1 de
nov. 2022 a les 13:46:

> Hèctor Alòs i Font 
> čálii:
>
> > Enunciatives are a kind of adverbs that are put just before verbs in main
> > clauses (although they can also be found in subordinate clauses too). For
> > affirmative clauses, it works like the English reinforcement "do" in "I
> do
> > like", but it is syntactically compulsory for enunciative users, so it's
> > not seen as a reinforcement. The problem is that for affirmative clauses
> > the enunciative is "que", which can be cnjsub (=that), rel (=that,
> which),
> > prn.itg (=what, which) and a comparative (=than). Note that cnjsub, rel
> and
> > prn.itg are often right in front of the verb in Occitan too. For
> negative,
> > interrogative and exclamatory clauses other words can be used, but also
> > "que"... which makes all the thing a big mess. (And there are more with
> > dubitative, emphatic, etc. meanings).
> >
> > As for your proposal, I do not yet have sufficient knowledge of CG to
> fully
> > understand it. My idea would be to make a first pass through a whole text
> > to understand if enunciatives are used in it (for example, recognising
> > other, more infrequent, but more easily recognisable enunciatives). In
> the
> > solution you propose, it seems that this knowledge is acquired
> > progressively, as sentences are translated. I fear that "que" is so messy
> > that at least the first sentences of a text would have the same problems
> as
> > we have now when we translate a Gascon text without enunciatives.
>
> That should be possible too, though I'm not sure how feasible it is to
> get CG to go that far into a text. By default, CG keeps a context of two
> windows, but that's configurable. It should be possible (perhaps with
> minor modifications to cg-proc) to read a bunch of sentences and use
> Window Spanning tests https://visl.sdu.dk/cg3/single/#test-spanning
>
> Tino, have you tried looking ahead several paragraphs, are there any
> downsides? This should be a fairly simple rule file.
>
> > This sounds perfect for Occitan. Is there a documentation in the wiki?
>
> There is! See:
>
> https://wiki.apertium.org/wiki/Dialectal_or_standard_variation#Overlapping_variants


Thanks a lot, Kevin, especially for the new updates!

Best,

Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] New Occitan-French release

2022-10-31 Thread Hèctor Alòs i Font

Thanks a lot for the feedback, Kevin. Some comments added in-line.

Missatge de Kevin Brubeck Unhammer  del dia dl., 31
d’oct. 2022 a les 23:31:

> Congrats on the release!
>
> And that documentation is impressive :)
>
> > 1) We have a serious problem in the translation from Gascon into French.
> > The basic issue is that some Gascon speakers use something called
> > enunciatives and others do not. These enunciatives, when they are used,
> are
> > found in every sentence and, what is worse, they are homographs with
> other
> > words of very high frequency. At present, we take it for granted that
> > Gascon sentences have an enunciative. The problem is that if they do not,
> > the disambiguator tends to assign the enunciative function to homographs
> > because, by definition, there must be at least one enunciative in every
> > sentence.
>
> (With the caveat that I have no idea what enunciatives are), one option
> might be to set a variable in CG if you find evidence that the text
> doesn't use enunciatives, and then for the remainder of the text remove
> enunciative readings if the variable is set. If every sentence of an
> enon speaker must have one enon, then finding a sentence without one
> would be evidence they don't speak enon:
>
>   SETVARIABLE (non-enon) (1) (*) IF (NEGATE 0* (enon)) ;
>
> If you know that "que" can't be enon before "xyzzy", you could prepend
> that rule with
>
>   "" REMOVE (enon) IF (1 ("xyzzy")) ;
>
> and so on, so that the rule is more likely to hit.
>
> Then just
>
>   REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;
>
> which will keep removing for all sentences of the translation.
>
> That will have to be reset at some point, especially if using in server
> (I can't remember if cg-proc already resets all variables on null
> flush?) or for corpus runs. At the very least
>
>   REMVARIABLE (non-enon) IF (0C (enon)) ;
>
> Testing it sounds challenging.
>


Enunciatives are a kind of adverbs that are put just before verbs in main
clauses (although they can also be found in subordinate clauses too). For
affirmative clauses, it works like the English reinforcement "do" in "I do
like", but it is syntactically compulsory for enunciative users, so it's
not seen as a reinforcement. The problem is that for affirmative clauses
the enunciative is "que", which can be cnjsub (=that), rel (=that, which),
prn.itg (=what, which) and a comparative (=than). Note that cnjsub, rel and
prn.itg are often right in front of the verb in Occitan too. For negative,
interrogative and exclamatory clauses other words can be used, but also
"que"... which makes all the thing a big mess. (And there are more with
dubitative, emphatic, etc. meanings).

As for your proposal, I do not yet have sufficient knowledge of CG to fully
understand it. My idea would be to make a first pass through a whole text
to understand if enunciatives are used in it (for example, recognising
other, more infrequent, but more easily recognisable enunciatives). In the
solution you propose, it seems that this knowledge is acquired
progressively, as sentences are translated. I fear that "que" is so messy
that at least the first sentences of a text would have the same problems as
we have now when we translate a Gascon text without enunciatives.


>
> > 2) Occitan is very diverse: not only because of its six major dialects (+
> > transition areas + regions outside the borders of France with other
> contact
> > languages), but also because of the internal variation within each of
> them.
> > The example of the Gascon enunciative is just one of the stuff that could
> > be mentioned from Gascon alone. It would be interesting to use the system
> > implemented for Nynorsk to produce sub-varieties.
>
> Highly recommended. We have 52 preference choices now (that's 2^52
> possible combinations? which I believe may be higher than the number of
> Nynorsk users), but with
>
> * only one generator fst
> * only one bidix fst
>
> ie. no compilation slowdown, and a cleaner Nynorsk dix – because we had
> to clean up stuff in order to do this (previously variants "løk and
> "lauk" were separate lemmas, now they're one lemma with a spelling
> pardef applied).
>

This sounds perfect for Occitan. Is there a documentation in the wiki?

Best,
Hèctor



> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] New Occitan-French release

2022-10-31 Thread Hèctor Alòs i Font

Hi,

A new version of the French-Occitan translator is ready to be packaged and
hopefully will be soon available in the Apertium site.

The previous version was done as a result of 2018 Claudi Balaguer's GSoC. A
one-direction translator from French into Languedocien Occitan was
released. The Occitan dictionary was based on the bidirectional
Occitan-Catalan and Occitan-Spanish translators that are still to date
functioning in self-contained packages of their own, without using shared
dictionaries.

The current version is bidirectional and bidialectal: Languedocien and
Gascon. It has been done with the Congrès permanent de la lenga occitana,
the organisation in charge of the standardisation of the Occitan language.
The Congrès has made available its dictionaries and collaborated in the
development. Mention must also be made of Daniel Swanson, who has been
developing numerous new utilities that we have used. A version using
additional copyrighted dictionaries is available on the Congrès website:
https://revirada.locongres.com

The architecture of the translator is explained here:
https://wiki.apertium.org/wiki/Paire_Occitan-Fran%C3%A7ais (in French). In
short: it uses a multi-level transfer (8-10 transfer steps), lexical
selection and the separable module (bidix: c. 45,000 entries per dialect,
excluding proper nouns; c. 2,000 word selection rules; c. 1,200 multi-word
rules).

There has been no systematic evaluation of the quality of the translator.
Usability tests show that translations into the two variants of Occitan are
frankly good. On the other side, quality is good, but lesser. The great
variety of each of the Occitan variants is a challenge.

The future of development is unclear, but there are three likely directions.

1) We have a serious problem in the translation from Gascon into French.
The basic issue is that some Gascon speakers use something called
enunciatives and others do not. These enunciatives, when they are used, are
found in every sentence and, what is worse, they are homographs with other
words of very high frequency. At present, we take it for granted that
Gascon sentences have an enunciative. The problem is that if they do not,
the disambiguator tends to assign the enunciative function to homographs
because, by definition, there must be at least one enunciative in every
sentence. The way to solve this could be:

a) automatically recognise whether the input text uses enunciatives, and

b) automatically select the translation with a
Gascon_with_enunciative-French or Gascon_without_enunciative-French mode.

Frankly, I don't have much idea how to do either one or the other. Ideas
welcome.

2) Occitan is very diverse: not only because of its six major dialects (+
transition areas + regions outside the borders of France with other contact
languages), but also because of the internal variation within each of them.
The example of the Gascon enunciative is just one of the stuff that could
be mentioned from Gascon alone. It would be interesting to use the system
implemented for Nynorsk to produce sub-varieties.

3) There is a desire to introduce two more varieties of Occitan, including
Provençal. But this is likely to involve a major overhaul of the system
used so far to manage the varieties.

The cause is that the current system makes massive use of the alt tag in
dictionaries to mark varieties. This is inherited from the first Occitan
translators developed some 15 years ago. This tag is similar to the v tag
used to manage the Catalan and Portuguese varieties, but is more
restrictive. The alt tag makes a dictionary entry visible only for the
variety under consideration, while the v tag makes the entry readable, but
not generable, for the other varieties as well. Alt is useful because the
diversity of Occitan is very large and so is its homography (which poses
very serious problems for morphological disambiguation). But alt s not very
suited to deal with transitional varieties. Moreover, it causes a lot of
duplication or near-duplication in dictionaries, which makes them less
readable and manageable. And this with only two varieties: with four or
more it's going to be terrible. And let's not talk about the compilation
time, which are already too long to generate the current four translators
every time we type "make").

Kind regards,
Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Create new modes

2022-07-11 Thread Hèctor Alòs i Font

Hi Helena,
You should also change the Makefile.am file too. You can look at spa-cat
(Catalan side) or por-cat (Portuguese side), for instance.
Best,
Hèctor

Missatge de Helena Egea Piñeiro  del dia dl., 11 de
jul. 2022 a les 18:46:

> Hi! I wanted to ask for a pair how could I create a new modes. I've tried
> changing the modes.xml and adding a new mode there. But I know there is
> something else that I need to do on "apertium" file for execution or the
> make file. I just couldn't do it. I need to add another step on the way and
> use -b configuration and that's why I'm trying it.
> Do you know how it could be?
>
> Thank you very much
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Pair releases

2022-04-29 Thread Hèctor Alòs i Font

Tino, it would be better not to release yet oci-fra. It is not ready, and
when it will, the developers will do a formal public announcement.
Hèctor

Missatge de Hèctor Alòs i Font  del dia dj., 28
d’abr. 2022 a les 18:05:

> 1) I think apertium-ca-it is deprecated. apertium-cat-ita is newer and
> works much better.
> 2) Does the new versions will be available in apertium.org?
> 3) apertium-oci-fra is almost ready for a release, but I'd need to confirm
> this with Lo Congrès. I think it'd be better to wait for their own release.
> This would release in Apertium 3 new translators: oci > fra, fra >
> oci_gascon and oci_gascon > fra (besides the current fra > oci).
> Hèctor
>
> Missatge de Tino Didriksen  del dia dt., 26
> d’abr. 2022 a les 8:38:
>
>> G'dair pair developers,
>>
>> Unless someone has good reason to block a given pair, this weekend I will
>> make releases from current master/main state and push to Debian for all
>> repos listed in
>> https://qa.debian.org/developer.php?login=tino%40didriksen.cc
>>
>> -- Tino Didriksen
>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Pair releases

2022-04-28 Thread Hèctor Alòs i Font

1) I think apertium-ca-it is deprecated. apertium-cat-ita is newer and
works much better.
2) Does the new versions will be available in apertium.org?
3) apertium-oci-fra is almost ready for a release, but I'd need to confirm
this with Lo Congrès. I think it'd be better to wait for their own release.
This would release in Apertium 3 new translators: oci > fra, fra >
oci_gascon and oci_gascon > fra (besides the current fra > oci).
Hèctor

Missatge de Tino Didriksen  del dia dt., 26 d’abr.
2022 a les 8:38:

> G'dair pair developers,
>
> Unless someone has good reason to block a given pair, this weekend I will
> make releases from current master/main state and push to Debian for all
> repos listed in
> https://qa.debian.org/developer.php?login=tino%40didriksen.cc
>
> -- Tino Didriksen
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] rus-ukr new words

2022-02-05 Thread Hèctor Alòs i Font

Data from Wiktionaries is freely available for non-commercial purposes.
Several Apertium pairs have used them extensively to complete their own
dictionaries. Sure, you can use the Ukrainian, Russian or any other
Wiktionary.
As for the words you have sent until now, it would be better to
differentiate between nouns and proper nouns.  I am not familiar with the
Russian and Ukrainian Apertium dictionaries, but Apertium almost always
distinguishes between different categories of proper names. I would not
deal with them in the first place. It is obvious that frequent common words
are missing, and these are more important.
Hèctor

Missatge de Nazar Kotsur  del dia ds., 5 de febr.
2022 a les 16:48:

> Thank you for answer! Do you use data from Ukrainian wiktionary? I am
> mostly contributing to it
>
> сб, 5 лют. 2022, 15:31 користувач Hèctor Alòs i Font 
> пише:
>
>> Wow! мужчина and черный are not set in this released pair! I wonder what
>> its coverage will be. Instead of adding some words by hand, it would
>> probably be better to use the Russian Wiktionary, which has many
>> translations.
>> Hèctor
>>
>> Missatge de Nazar Kotsur  del dia ds., 5 de febr.
>> 2022 a les 15:54:
>>
>>>
>>> ___
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] rus-ukr new words

2022-02-05 Thread Hèctor Alòs i Font

Wow! мужчина and черный are not set in this released pair! I wonder what
its coverage will be. Instead of adding some words by hand, it would
probably be better to use the Russian Wiktionary, which has many
translations.
Hèctor

Missatge de Nazar Kotsur  del dia ds., 5 de febr.
2022 a les 15:54:

>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] English-Santali output is not working in beta.apertium.com website

2022-01-11 Thread Hèctor Alòs i Font

Hi Prasanta,

Do you have install="yes" in modes.xml for the relevant modes you want to
see in beta.apertium?

For this kind of questions, I recommend you to go to the IRC channel. It's
usually quicker.

Regards,
Hèctor

Missatge de Prasanta Hembram  del dia dt., 11
de gen. 2022 a les 10:23:

> Hi,
>
> I was testing with Translation for a recently added eng-sat pair, it is
> working well for English to Santali offline in my system and tested well in
> Apertium Viewer. But it is not working in the online version of Apertium (
> beta.apertium.com). When I try to give an input example: "House" or any
> other word in the source language and set language to English and in the
> target language when I set it to Santali. Then the output gives
> "Translation not yet available!" . *Interestingly when I use "Translate a
> Document " and upload a list of English sentences in a txt file and click
> on translate with output in Santali then everything seems to be fine.*
> I'm not able to find where the error lies.
>
> Steps to reproduce:
> 1. Go to beta.apertium.org
> 2. Set source language to English and target language to Santali
> 3. Give any input in English like "House" and click on translate.
> 4. Instead of ᱚᱲᱟᱜ as an output the output gives "Translation not yet
> available!" and hangs.
>
> --
> Thanks
> with best regards
> Prasanta Hembram
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Left elisions: recognition and blanks

2022-01-02 Thread Hèctor Alòs i Font

Thanks a lot, Daniel!!! It is working now. I never saw any "preblank"
section, nor I noticed its existence in the documentation (although it is
referred aside with "postblank").
Hèctor

Missatge de Daniel Swanson  del dia dg., 2 de
gen. 2022 a les 17:25:

> Would putting the first element in  or the
> second element in  rather than  type="standard"> work? Then one of the elements is distinct in
> combination from what it is on its own, making it postblank or
> preblank will insert an extra space after or before, respectively.
>
> Daniel
>
> On Sun, Jan 2, 2022 at 5:01 AM Hèctor Alòs i Font 
> wrote:
> >
> > In Occitan, in many cases the spelling rules require the elision of the
> beginning of pronouns and determiners, e.g. "que o > que'u". There are also
> numerous cases of fusions, e.g. `de lo > del` or `de lo > deu` or `de lo >
> deth` depending on the variety of Occitan. If we add to this the great
> (sub)dialectal variety of Occitan, the result is almost a combinatorial
> explosion. At present, we have hundreds of lines in the Occitan monodix to
> try to deal with them, but it is not enough.
> >
> > One of the embarrassing problems with this is the issue I have had this
> morning: `çò que’u`. `çò que` is one of the many forms of a given relative
> pronoun (but it can be also analysed as the pronoun `çò` followed by the
> word `que` that may be here at least a kind of adverb). The issue is that
> we don't have a definition in the Occitan monodix for `çò que’u` as `çò
> que` + `u` (nor as `çò` + `que` + `u`), using  (it is not in the
> hundreds we have). The result is that the translation has been done almost
> correctly, but the translations of `çò que` and `u` have been put together
> without a blank, since there is not a blank in the input. That's why we
> have to define so many combinations using ``:
> >
> > ```
> > $ echo "00192. Lo privilègi de l’editorialista qu’es de poder escríver
> **çò que’u** passa peu cap." | apertium -d . oci_gascon-fra
> > 00192. Le privilège de l'éditorialiste  est de pouvoir écrire **ce
> quela**  passe pour la tête.
> > ```
> >
> > Does anyone have any ideas on how not to solve this "the hard way" (as
> we have done so far)?
> >
> > Hèctor
> > ___
> > Apertium-stuff mailing list
> > Apertium-stuff@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] Left elisions: post-generation

2022-01-02 Thread Hèctor Alòs i Font

The question of elisions on the left involves a problem specular to that of
recognition, which I discussed in my last message: that of post-generation.

Currently we have:

déjà tu me dis > ja me dises

But it should be:

déjà tu me dis > ja'm dises

For any word ending in a vowel the pronoun *me* has to be converted into
*'m* and immediately attached to the previous word without any blank.

I don't know of any way to deal with this in postgeneration. Does anyone
have an idea?

Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] Left elisions: recognition and blanks

2022-01-02 Thread Hèctor Alòs i Font

In Occitan, in many cases the spelling rules require the elision of the
beginning of pronouns and determiners, e.g. "que o > que'u". There are also
numerous cases of fusions, e.g. `de lo > del` or `de lo > deu` or `de lo >
deth` depending on the variety of Occitan. If we add to this the great
(sub)dialectal variety of Occitan, the result is almost a combinatorial
explosion. At present, we have hundreds of lines in the Occitan monodix to
try to deal with them, but it is not enough.

One of the embarrassing problems with this is the issue I have had this
morning: `çò que’u`. `çò que` is one of the many forms of a given relative
pronoun (but it can be also analysed as the pronoun `çò` followed by the
word `que` that may be here at least a kind of adverb). The issue is that
we don't have a definition in the Occitan monodix for `çò que’u` as `çò
que` + `u` (nor as `çò` + `que` + `u`), using  (it is not in the
hundreds we have). The result is that the translation has been done almost
correctly, but the translations of `çò que` and `u` have been put together
without a blank, since there is not a blank in the input. That's why we
have to define so many combinations using ``:

```
$ echo "00192. Lo privilègi de l’editorialista qu’es de poder escríver **çò
que’u** passa peu cap." | apertium -d . oci_gascon-fra
00192. Le privilège de l'éditorialiste  est de pouvoir écrire **ce quela**
 passe pour la tête.
```

Does anyone have any ideas on how not to solve this "the hard way" (as we
have done so far)?

Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Ôdp: Ôdp: Ôdp: Questions about lexical selection

2021-12-22 Thread Hèctor Alòs i Font

Missatge de Grzegorz Kulik  del dia dc., 22 de
des. 2021 a les 16:35:

> I have another question. In Polish "ich" can mean both "their" and "them".
> The tagger always chooses the "their" meaning, so I want to create a
> negative match, so that if the word is not followed by a noun, it would
> choose "them". I'm trying to put it together based on the documentation but
> I must be missing something.
>
> REMOVE ICH IF (0 ICH LINK NOT 1 NOUN);
>
> If this worked, it would tag the wordform as an inflected personal
> pronoun. What am I doing wrong?
>

Not necessarily. I don't know what exactly "ICH" is supposed to be. If it
is the lemma and both ICH ("their" and "them") have the same lemma "ich",
you can choose e.g.

REMOVE prn IF (0 ICH LINK NOT 1 NOUN);

Hèctor

>
> Best
> Greg
>
> We wtorek, 21 gru 2021 ô godzinie 17:59, Grzegorz Kulik (
> gregorykku...@gmail.com) pisze:
>
> Now I get it. Thank you very much!
>
> Best
> Greg
>
> We wtorek, 21 gru 2021 ô godzinie 10:27, Daniel Swanson (
> awesomeevildu...@gmail.com) pisze:
>
> (1C NOUN) would match ^a/b/c$ or ^a/b$ but it would not match
>
> ^a/b/c$ - you can read it as "if the next word can only be a
>
> noun".
>
>
> On Tue, Dec 21, 2021 at 10:10 AM Grzegorz Kulik <
>
> gregorykku...@gmail.com
>
> > wrote:
>
>
> Thank you both for the suggestions. I never considered CG because it looked 
> complicated but I actually got a grip of it right away. I went with:
>
>
> REMOVE NOUN IF (0 DET) (0 NOUN) (1 (n mp));
>
>
> and it works perfectly. It did not work with 1C there. I looked up the C 
> symbol in the documentation and it says "Every reading this position must 
> match the pattern (normally only 1 has to)". I don't know what this sentence 
> means. Every time this position is read, it must match the pattern? Can I 
> find any elaboration on this anywhere? I checked
>
> http://beta.visl.sdu.dk/cg3/single/
>
>  but can't seem to find anything about it there.
>
>
> Thank you!
>
> Greg
>
>
> We wtorek, 21 gru 2021 ô godzinie 09:25, Hèctor Alòs i Font (
>
> hectora...@gmail.com
>
> ) pisze:
>
>
>
>
> Missatge de Daniel Swanson <
>
> awesomeevildu...@gmail.com
>
> > del dia dt., 21 de des. 2021 a les 7:57:
>
>
> Hi Greg,
>
>
> The file where you want to write rules for this is
>
> https://github.com/apertium/apertium-pol/blob/master/apertium-pol.pol.rlx
>
>
>
> If you want something like "tacy is  before ", you could get that with
>
>
> SELECT DET IF (0 DET) (0 NOUN) (1 NOUN) ;
>
>
>
> The problem with this rule is that (1 NOUN) is not necessarily a noun, but 
> something that can be analysed as a noun at the moment this rule is executed. 
> Similarly, the 0 word may be correctly analysed as something else, like an 
> adjective. So, a more cautious rule can be, for instance:
>
>
> REMOVE NOUN IF (0 DET) (0 NOUN) (1C NOUN) ;
>
>
> The problem with this alternative variant of the rule is that it matches less 
> often than the first one. It may not solve cases Daniel's version solve, 
> although it probably makes less wrong decisions. Your knowledge of the 
> language, and testing on corpus, should help you decide what is better, or 
> maybe you will choose something else in the middle. Tuning can be done adding 
> a few rules, previous to the general one, for often words/cases.
>
>
> Hèctor
>
>
>
>
> Daniel
>
>
> On Mon, Dec 20, 2021 at 1:40 PM Grzegorz Kulik <
>
> gregorykku...@gmail.com
>
> > wrote:
>
>
> Hello all,
>
>
> I haven't contacted you for some time, I hope you are all well. I developed 
> the pol-szl pair and although the translation is quite reasonable, I decided 
> to make it better by improving the lexical selection. I've been reading the 
> documentation and managed to write several rules for forms that need 
> disambiguation and are the same parts of speech. However, I cannot find any 
> information anywhere about what to do if there is a form that can mean two 
> completely different things. Example in Polish:
>
>
> tacy (such) = taki
>
> tacy (of a tablet) = 
> taca/taca/taca
>
>
> The first meaning is obviously much more frequent but the translator chooses 
> the second one, which is less than desirable.
>
>
> What can I do to remedy this? Can I write rules for that manually? Should I 
> train the tagger? If so, what method would be the best? There's multiple 
> training methods and I don't know which one to choose for my pair. Could you 
> recomme

Re: [Apertium-stuff] Questions about lexical selection

2021-12-20 Thread Hèctor Alòs i Font

Missatge de Daniel Swanson  del dia dt., 21 de
des. 2021 a les 7:57:

> Hi Greg,
>
> The file where you want to write rules for this is
> https://github.com/apertium/apertium-pol/blob/master/apertium-pol.pol.rlx
>
> If you want something like "tacy is  before ", you could get that
> with
>
> SELECT DET IF (0 DET) (0 NOUN) (1 NOUN) ;
>

The problem with this rule is that (1 NOUN) is not necessarily a noun, but
something that can be analysed as a noun at the moment this rule is
executed. Similarly, the 0 word may be correctly analysed as something
else, like an adjective. So, a more cautious rule can be, for instance:

REMOVE NOUN IF (0 DET) (0 NOUN) (1C NOUN) ;

The problem with this alternative variant of the rule is that it matches
less often than the first one. It may not solve cases Daniel's version
solve, although it probably makes less wrong decisions. Your knowledge of
the language, and testing on corpus, should help you decide what is better,
or maybe you will choose something else in the middle. Tuning can be done
adding a few rules, previous to the general one, for often words/cases.

Hèctor


>
> Daniel
>
> On Mon, Dec 20, 2021 at 1:40 PM Grzegorz Kulik 
> wrote:
> >
> > Hello all,
> >
> > I haven't contacted you for some time, I hope you are all well. I
> developed the pol-szl pair and although the translation is quite
> reasonable, I decided to make it better by improving the lexical selection.
> I've been reading the documentation and managed to write several rules for
> forms that need disambiguation and are the same parts of speech. However, I
> cannot find any information anywhere about what to do if there is a form
> that can mean two completely different things. Example in Polish:
> >
> > tacy (such) = taki
> > tacy (of a tablet) =
> taca/taca/taca
> >
> > The first meaning is obviously much more frequent but the translator
> chooses the second one, which is less than desirable.
> >
> > What can I do to remedy this? Can I write rules for that manually?
> Should I train the tagger? If so, what method would be the best? There's
> multiple training methods and I don't know which one to choose for my pair.
> Could you recommend me the best approach?
> >
> > Thank you in advance
> > Greg
> > ___
> > Apertium-stuff mailing list
> > Apertium-stuff@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] English-Santali Plural form not working

2021-12-19 Thread Hèctor Alòs i Font

Hi Prasanta,

I'm delighted you are working on Santali.

I have seen your code in github, and I have seen nothing that could
generate this kind of error. So, I downloaded your code, I fixed the
eng-sat.t1x file because it has a syntactic error, and I ran the
translation of "cows"... but I can't get anything, seemingly because of a
problem in the postchunk module (although it is standard):

$ echo "cow" | apertium -d . eng-sat-interchunk
^ᱰᱟᱹᱝᱜᱽᱨᱤ$^sent{^᱾$}$
$ echo "cow" | apertium -d . eng-sat-postchunk
^᱾$

In any case, if you somehow get something of the type "X/Y" (like "ᱰᱟᱹᱝᱜᱽᱨᱤ
ᱠᱤᱱ/ ᱰᱟᱹᱝᱜᱽᱨᱤ ᱠᱚ") this is because in the target dictionary there are two
generations for  ᱰᱟᱹᱝᱜᱽᱨᱤ. Probably you have:


  
 ᱠᱤᱱ 
 ᱠᱚ 


instead of:


  
 ᱠᱤᱱ 
 ᱠᱚ 


Regards,
Hèctor


Missatge de Prasanta Hembram  del dia dg., 19
de des. 2021 a les 19:09:

> Hi, I'm working on a new language pair English -Santali pair and trying to
> learn everyday how can i improve this pair. Today I had a few doubts.
>
> Doubt 1: Plural rules are not working when translating from English to
> Santali, The English plural form "Cows" returns output " ᱰᱟᱹᱝᱜᱽᱨᱤ " instead
> of "ᱰᱟᱹᱝᱜᱽᱨᱤ ᱠᱤᱱ/ ᱰᱟᱹᱝᱜᱽᱨᱤ ᱠᱚ" . I have set up paradef but no luck. How to
> get the correct output??
>
> The correct forms are as follows
> Cow = ᱰᱟᱹᱝᱜᱽᱨᱤ
> Cows = ᱰᱟᱹᱝᱜᱽᱨᱤ ᱠᱤᱱ  or Two Cows = ᱰᱟᱹᱝᱜᱽᱨᱤ ᱠᱤᱱ
> Cows = ᱰᱟᱹᱝᱜᱽᱨᱤ ᱠᱚ
>
> My Santali Monolingual Dictionary link:
> https://github.com/Prasanta-Hembram/apertium-sat
>
> English-Santali Bilingual Dictionary link :
> https://github.com/Prasanta-Hembram/apertium-eng-sat
>
> echo "cows" | apertium -d . eng-sat-transfer
>
> returns wrong output: ^ᱰᱟᱹᱝᱜᱽᱨᱤ$^sent{^᱾$}$
>
>
>
>
>
>
> 
>  
>ᱠᱤᱱ  
>ᱠᱚ 
> 
>
> --
> Thanks
> with best regards
> Prasanta Hembram
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Github Actions

2021-12-10 Thread Hèctor Alòs i Font

Missatge de Xavi Ivars  del dia dv., 10 de des. 2021
a les 21:05:

> Actually, this has helped me already to identify two issues:
> apertium-spa-cat and apertium-fra tests (using apertium-regtest) are broken.
>
> I'm pretty sure the contributors to those packages don't really know how
> regtest works, and what needs to be done to fix those tests.
>

I am working on apertium-fra. Hopefully tomorrow it will be solved
Hèctor


>
> Missatge de Xavi Ivars  del dia dv., 10 de des.
> 2021 a les 18:53:
>
>> While playing this morning with some fixes on Apertium Apy, I added a
>> Github Action to make sure tests were running and code coverage stats were
>> pushed to Coveralls.io, and I really liked how Github Actions work.
>>
>> To better test the possibilities Github Actions gives us, I setup the
>> same scripts we already had in TravisCI (not working now) on Github
>> Actions, for apertium-spa, apertium-cat and apertium-spa-cat.
>>
>> And I found it's way more powerful, and we can do pretty cool things.
>>
>> This is what I've setup:
>>
>> *Apertium Github Actions repository [1]*
>>
>> This repository contains reusable workflows and actions, that can be
>> referenced by other repository specific workflows.
>>
>> There are currently 2 workflows already implemented, one for monolingual
>> [2] and another one for bilingual [3] modules.
>>
>> *Apertium Github repository [4] *
>>
>> This repository already existed, and was used to define Apertium's
>> profile in Github. I've added there "workflow templates", what makes it
>> extremely easy to setup a reusable workflows into an existing module.
>>
>> As an example, this is what would be required to setup the monolingual
>> workflow in apertium-fra:
>>
>>1. Go to actions tab
>>2. (In case workflows already exist) click on create new workflow
>>3. Scroll down, and select the appropriate workflow. In case of
>>apertium-fra, it would be the monolingual
>>
>> [image: image.png]
>> 4. In case of bilingual replace xxx-yyy, xxx and yyy with the language
>> codes of your bilingual module, and the monolinguals.
>> 5. DONE! Tests are integrated!
>>
>>
>>
>> References:
>> [1] https://github.com/apertium/github-actions/tree/master
>> [2]
>> https://github.com/apertium/github-actions/tree/master#monolingual-buildyml
>> [3]
>> https://github.com/apertium/github-actions/tree/master#bilingual-buildyml
>> [4] https://github.com/apertium/.github
>> --
>> < Xavi Ivars >
>> < http://xavi.ivars.me >
>>
>
>
> --
> < Xavi Ivars >
> < http://xavi.ivars.me >
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Problem with letter case in interchunk (or after)

2021-10-10 Thread Hèctor Alòs i Font

Thanks a lot, Daniel.
It is more or less working. It seems I have a bug related with blanks in
the postchunk, but this another question.
Hèctor

Missatge de Daniel Swanson  del dia ds., 9
d’oct. 2021 a les 23:24:

>  or similar should work just fine in
> postchunk. If not, that's a bug I need to look into.
>
> You might not be able to use tags in , but I'd have to
> double check that.
>
> On Sat, Oct 9, 2021 at 4:20 PM Hèctor Alòs i Font 
> wrote:
> >
> > As far as I know, in postchunk only the name of the chunk can be seen.
> We cannot read any tag of the header (I don't know why). So, I don't see
> how to put a condition in interchunk for the postchunk, except, maybe,
> changing the name of the chunk.
> >
> > Missatge de Daniel Swanson  del dia ds., 9
> d’oct. 2021 a les 22:32:
> >>
> >> You could set the case of the chunk pseudolemma or else append a tag
> >> to it and then condition on that in postchunk.
> >>
> >> On Sat, Oct 9, 2021 at 3:28 PM Hèctor Alòs i Font 
> wrote:
> >> >
> >> > In the interchunk stage I am adding a word, which happens to be very
> often at the beginning of a sentence. So I face two issues. On the one
> hand, to put an initial capital letter to the word I add, if needed, and,
> on the other hand, to remove the initial capital letter from the following
> word, if needed too. For the former, I can more or less figure out how to
> do it, although I have my doubts. For the following word, I don't see how:
> in interchunk I don't have access to the word itself, so I don't know
> whether it is capitalised or not; in postchunk I only have access to the
> individual words, so I lack context; and in postgeneration it doesn't seem
> to be possible to change words from uppercase to lowercase.
> >> >
> >> > Does anyone have any suggestions?
> >> >
> >> > $ echo "C'est une maison." | apertium -d . fra-oci_gascon
> >> > qu'Ei un ostau.
> >> >
> >> > Hèctor
> >> > ___
> >> > Apertium-stuff mailing list
> >> > Apertium-stuff@lists.sourceforge.net
> >> > https://lists.sourceforge.net/lists/listinfo/apertium-stuff
> >>
> >>
> >> ___
> >> Apertium-stuff mailing list
> >> Apertium-stuff@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
> >
> > ___
> > Apertium-stuff mailing list
> > Apertium-stuff@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Problem with letter case in interchunk (or after)

2021-10-09 Thread Hèctor Alòs i Font

As far as I know, in postchunk only the name of the chunk can be seen. We
cannot read any tag of the header (I don't know why). So, I don't see how
to put a condition in interchunk for the postchunk, except, maybe, changing
the name of the chunk.

Missatge de Daniel Swanson  del dia ds., 9
d’oct. 2021 a les 22:32:

> You could set the case of the chunk pseudolemma or else append a tag
> to it and then condition on that in postchunk.
>
> On Sat, Oct 9, 2021 at 3:28 PM Hèctor Alòs i Font 
> wrote:
> >
> > In the interchunk stage I am adding a word, which happens to be very
> often at the beginning of a sentence. So I face two issues. On the one
> hand, to put an initial capital letter to the word I add, if needed, and,
> on the other hand, to remove the initial capital letter from the following
> word, if needed too. For the former, I can more or less figure out how to
> do it, although I have my doubts. For the following word, I don't see how:
> in interchunk I don't have access to the word itself, so I don't know
> whether it is capitalised or not; in postchunk I only have access to the
> individual words, so I lack context; and in postgeneration it doesn't seem
> to be possible to change words from uppercase to lowercase.
> >
> > Does anyone have any suggestions?
> >
> > $ echo "C'est une maison." | apertium -d . fra-oci_gascon
> > qu'Ei un ostau.
> >
> > Hèctor
> > ___
> > Apertium-stuff mailing list
> > Apertium-stuff@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] Problem with letter case in interchunk (or after)

2021-10-09 Thread Hèctor Alòs i Font

In the interchunk stage I am adding a word, which happens to be very often
at the beginning of a sentence. So I face two issues. On the one hand, to
put an initial capital letter to the word I add, if needed, and, on the
other hand, to remove the initial capital letter from the following word,
if needed too. For the former, I can more or less figure out how to do it,
although I have my doubts. For the following word, I don't see how: in
interchunk I don't have access to the word itself, so I don't know whether
it is capitalised or not; in postchunk I only have access to the individual
words, so I lack context; and in postgeneration it doesn't seem to be
possible to change words from uppercase to lowercase.

Does anyone have any suggestions?

$ echo "C'est une maison." | apertium -d . fra-oci_gascon
qu'Ei un ostau.

Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] EMNLP 2021

2021-05-06 Thread Hèctor Alòs i Font

 EMNLP 2021
SIXTH CONFERENCE ON
MACHINE TRANSLATION (WMT21) November 10-11, 2021
Punta Cana, Dominican Republic Shared Task: Unsupervised MT and Very Low
Resource Supervised MT

There is no machine translation available for most of the ~7000 languages
spoken on the planet Earth. In the unsupervised and very low resource
translation task, we collaborate with the local communities in providing
resources and developing MT systems for local minority languages. Like last
year, the task will include translation of Upper Sorbian, a minority Slavic
language spoken in Germany. This year, we added a translation of closely
related Lower Sorbian and Chuvash, a minority Turcic language spoken in
south Russia.

This year, the tasks are:

   - Unsupervised Machine Translation: German to Lower Sorbian. Lower
   Sorbian to German.
   - Very Low Resource Supervised Machine Translation: German to Upper
   Sorbian. Upper Sorbian to German.
   - Low Resource Supervised Machine Translation: Russian to Chuvash.
   Chuvash to Russian.


https://www.statmt.org/wmt21/unsup_and_very_low_res.html
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Usage of mfn tag in the Hindi dictionary

2021-05-05 Thread Hèctor Alòs i Font

Missatge de Anuradha Pandey  del dia dc., 5 de
maig 2021 a les 15:51:

> Hello everyone,
> I have been working on a new language pair, and I was having a look at the
> word forms defined in the Hindi paradigms. The "mfn" tag seems suspicious
> for Hindi. It stands for gender-neutral by definition, like "it" in
> English.  Hindi nouns have two grammatical genders: masculine and feminine.
> There is no neutral gender for nouns in Hindi. The mfn tag has been used at
> 3 places -
>
>1.  "गलत__adj"
>2. "स/ा__adj"
>3. "एक__det"
>
> The last paradigm makes sense since a determiner can be
> gender-neutral. However, I was curious about their usage in the case of
> adjectives. The definitions of these have used the "mfn" tag along with the
> "sp" tag(which is wherein singular and plural are equivalent I suppose). I
> couldn't come up with an example where the adjective is gender-neutral and
> are singular and plural are equivalent.
>

Even if the determiner has the same form for both genders, masculine and
feminine, I would expect an "mf" tag, not an "mfn" one.
In fact the whole paradigm is quite strange:


  


So, there is only one single form, just for singular and for the oblique
case, and the order of the tags is not the expected: gender, number and
case (as the adjectives and the nous have).

Other paradigms determinants have other unexpected forms, with only one
form and without any gender and/or case tags.

This kind of things are unexpected for a released language. If these
paradigms are changed in the Hindi dictionary and the Hindi-Urdu released
pair relies on them, it could not work.

Hèctor


>
> If someone who has worked with the Hindi dictionary can clarify the logic
> behind using this tag, and give an example for better clarity, it would be
> really helpful.
>
> Regards,
> Anuradha Pandey
> IRC: Anuradha_Pandey
>
>
>
>
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC proposal draft - A morphological analyzer and generator for Romeyka

2021-04-14 Thread Hèctor Alòs i Font

Hi Utku,

Your proposal seems interesting.

1. Did you take look to apertium-ell ? How much could it help?
2. In your proposal, you speak about a corpus. You intend to reach 80%
coverage. From what kind of corpus are you speaking? How much Romeyka is
written?
3. Could you explain on what you understand by "modelling allomorphy"? Is
that Apertium's morphological disambiguation?
4. Could you also explain how do you intend to tag "content phenomenon"?
5. I couldn't find anything about your coding challenge. The coding
challenge is a must. It shows that you know to install and have a basic
understanding of Apertium.

Hèctor

Missatge de Utku Turk  del dia dc., 14 d’abr. 2021
a les 15:58:

> Hi,
>
> My name is Utku Türk. I am a linguistics student at Boğaziçi University,
> Turkey. I want to attend GSoC with a Romeyka morphological analyzer
> project.
>
> Romeyka is one of the many Modern Greek dialects spoken in Asia Minor. It
> has no NLP footprint, and I believe it is an important first step for
> Quantitative Language Contact and Dialectology studies. Its morphology and
> lexicon are heavily influenced by Ancient Greek, Turkish, and Laz.
>
> The following link[1] is my draft for the GSoC proposal. Any feedback is
> very much appreciated!
>
> [1]:
> https://docs.google.com/document/d/1CJrD7TRJvFKKD5qsW_fnLdNbk1iQ2t4_dim3MA3g4_E/edit?usp=sharing
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSOC proposal draft - Create a usable version of these language pair: English--Igbo

2021-04-13 Thread Hèctor Alòs i Font

Hi Okonkwo,

The problem with this second version of your project is that it seems
almost the same. The Igbo morphology seems to complex for developing a
translator in a single GSoC, so forget about working on a bilingual
dictionary. You should focus on the analyser, so the phases of your project
should deal about which parts of the morphological analyser you implement,
and the number of words you add. Your proposal should prove that you
understand the difficulties of constructing such an analyser. Could you
reach a naive coverage of 85% or 90%? At the end of the project, you could
also think about a morphological disambiguator (maybe during a couple of
weeks). At least, this is my opinion. Someone else can have another.

Hèctor

Missatge de Okonkwo Ifeanyichukwu  del dia dt.,
13 d’abr. 2021 a les 11:34:

> Thanks, Jonathan, Daniel has converted the existing work to lexd as he
> offered to do it. It is really great this way, I will wait for it to
> be merged in order to pull it.
>
> Creating a morphological analyser is one of my goals, I don't understand
> when you say I did not mention morphology. Please I need more detail.
>
> --
> Okonkwo
>
>
> On Mon, Apr 12, 2021 at 3:13 PM Jonathan Washington <
> jonathan.n.washing...@gmail.com> wrote:
>
>> Hi Okonkwo,
>>
>> Thank you for your continued interest in Apertium!
>>
>> My main comment on this latest version of your proposal is that you
>> don't mention morphology.  This should be a main focus of your
>> work—not just expanding the lexicon, but making it productive.
>>
>> Also, given the range of non-suffixational morphology in Igbo, I think
>> it might be a good idea to implement the dictionary in lexd instead of
>> lexc.  Daniel has offered to help convert your existing work.  What do
>> you think?
>>
>> --
>> Jonathan
>>
>> 12 apr 2021, B.e. tarixində 07:33 tarixində Okonkwo Ifeanyichukwu
>>  yazdı:
>> >
>> > Thanks, Sevilay, Hèctor and  Ngadou Yopa for reviewing my project
>> proposal. I have taken all the suggestions here into consideration and made
>> some changes to my proposal. Below is the link to the recent changes I made.
>> >
>> > link to the proposal:
>> >
>> https://docs.google.com/document/d/1iK_9VTqb5ZHH1bEjl5UAqBP77ijaNKm6p4HIT2JO5qk/edit?usp=sharing
>> >
>> > Okonkwo
>> >
>> > On Mon, Apr 12, 2021 at 9:47 AM Ngadou Yopa 
>> wrote:
>> >>
>> >> Hello Okonkwo,
>> >>
>> >> I agree with @hectora...@gmail.com. You should probably consider
>> rescoping your project to produce a monodix of good quality.
>> >> One week is definitely not enough to work on transfer rules.
>> >>
>> >> Best,
>> >> Ngadou Yopa
>> >>
>> >> On Sat, 10 Apr 2021 at 14:52, Hèctor Alòs i Font 
>> wrote:
>> >>>
>> >>> Hi Okonkwo,
>> >>>
>> >>> My remark is slightly different to Sevilay's. Igbo seems to be a
>> language with quite a complex morphology. Wouldn't it make sense to work
>> just on the morphological analyser (
>> https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Morphological_analyser
>> ) ? Currently, apertium-ibo lexc file has some 200 words and no
>> morphotactics. So the analyser should be done from scratch (maybe previous
>> work in Apertium on another Niger-Congo language can help a bit).
>> >>> Otherwise, as Sevilay points, you'll have not enough time to work on
>> transfer rules. And transfer between two very distant languages, like Igbo
>> and English, is a major challenge. At least one whole GSoC should be
>> devoted to it.
>> >>>
>> >>> Hèctor
>> >>>
>> >>> Missatge de Sevilay Bayatlı  del dia ds.,
>> 10 d’abr. 2021 a les 11:14:
>> >>>>
>> >>>>
>> >>>> Hi Okonkwo,
>> >>>>
>> >>>> How many words you will be able to add into the monodix (and how
>> many in the bidix?), and what is your WER goal?
>> >>>>
>> >>>> Do you think 1 week is enough to work on transfer rules?
>> >>>>
>> >>>> Another thing, as I understand from your proposal your main focus on
>> a bilingual dictionary, the monodix needs more focus otherwise you can't
>> good result or you have to work simultaneously.
>> >>>>
>> >>>>
>> >>>> Sevilay
>> >>>>
>> >>>>
>> >>>>
>> >>>&

Re: [Apertium-stuff] GSOC proposal draft - building a prototype MT system

2021-04-09 Thread Hèctor Alòs i Font

Yes, my own experience is that more or less simultaneous update of the
dictionaries is the quickest option.
I usually work on a spreadsheet with words in decreasing order of
frequency, and I write a script that reads it and generates the XML code
for inserting in the dictionaries. It's quick and it avoids lots of silly
errors.
Hèctor

Missatge de Sevilay Bayatlı  del dia dv., 9
d’abr. 2021 a les 11:58:

> Hi Anuradha,
> You need to update your proposal based on what Hèctor suggested, yeah it
> is better to work on both monodix and bidix simultaneously, but for a good
> lexicon, you need to take a small corpus and analysis the sentences and
> adding words.
>
> Sevilay
>
> On Thu, Apr 8, 2021 at 9:24 AM Anuradha Pandey 
> wrote:
>
>> Thank you for your response, Hèctor. I read the proposal for the
>> Hindi-Bengali translator. There aren't open-source dictionaries for the
>> Bhojpuri language (though there are resources for getting a Bhojpuri
>> corpus), so I was using a hardcopy of a BHO-HIN dictionary for manually
>> adding the pairs. I did some rough calculations, and I shall be able to add
>> at least 8,000 words to the monodix. And, based on my experience with
>> Apertium, I think simultaneously adding words in the bidix makes the work
>> easier, so I think roughly the same number of words in the bidix too. But,
>> I don't think I will be able to achieve a WER below 20% with 8000 words.
>> Should I aim for a WER of nearly 30% then?
>>
>> Since the time for GSoC has been reduced, I am planning to modify my
>> proposal and the inputs from mentors would be extremely helpful.
>>
>> On Wed, 7 Apr 2021 at 20:24, Hèctor Alòs i Font 
>> wrote:
>>
>>> Hi, Anuradha.
>>>
>>> Thanks for your proposal draft. First, I would like to tell you that if
>>> Apertium is a rule-based translation system, it is because this paradigm
>>> still makes sense for many languages (indeed, for the vast majority of
>>> them). If Bhojpuri has extensive electronic language resources and,
>>> particularly, bilingual linguistic corpora, then Apertium is probably not
>>> the best approach. But this is probably not the case. If it was, it would
>>> probably already be on Google Translate.
>>>
>>> As for the project. I would advise you to look at Gourab Chakraborty's
>>> proposal for a Hindi-Bengali translator and the comments on it. Most of the
>>> comments apply to your proposal as well. The following message would be
>>> useful to you, for instance:
>>> https://sourceforge.net/p/apertium/mailman/message/37251899/
>>>
>>> Your proposal seems to me unrealistic. 10,000 words in the monodix (and
>>> how many in the bidix?) are not enough for a WER below 20%, I think (maybe
>>> for two extremely close related languages).
>>>
>>> For better evaluation your proposal I'd like to find the answer for some
>>> basic questions:
>>>
>>> * Which is the current state of Bhojpuri language and, eventually,
>>> the Bhojpuri-Hindi language pair in Apertium?
>>> * Would you have to write a whole Bhojpuri morphological analyser from
>>> scratch and, afterwards, to add some 10,000 words manually assigning them
>>> to a given paradigm? How much time you'll need for this?
>>> * From where would you get the bilingual dictionary? Would you have to
>>> create it yourself? Are there freely available bilingual electronic
>>> dictionaries (like e.g. Wiktionary)?
>>> * Would you work on a Bhojpuri-to-Hindi translator or on a
>>> Hindi-to-Bhojpuri one? In any case there will be a quite a lot of work in
>>> the morphological disambiguation. But for one side you'll have it only
>>> once. If both Hindi-to-Bhojpuri and Hindi-to-Bengali are chosen (which is
>>> entirely possible), this work can be divided by the two projects.
>>>
>>> There is nothing wrong to this all this work by hand, if needed. It
>>> depends on the state of the language resources for the given language. But
>>> it is necessary to know to what extent you will have to do this
>>> time-consuming work.
>>>
>>> When we had twice the time in most of the cases the projects couldn't
>>> reach to create a working translator for a new language pair. In the
>>> current conditions, it is even more difficult.
>>>
>>> Hèctor
>>>
>>>
>>>
>>>
>>> Missatge de Anuradha Pandey  del dia dc., 7
>>> d’abr. 2021 a les 16:28:
>>>
>>>> Hello everyone,
>>>> I am Anuradha Pandey, a sophomore stud

Re: [Apertium-stuff] GSoC proposal draft: Developing a Morphological Analyzer for Torwali Language

2021-04-07 Thread Hèctor Alòs i Font

Hi Naeem,

Thanks a lot for your very good and interesting draft application. Torwali
is an excellent language for Apertium. You know the challenges it presents
and the work on it, and you prove to be committed to the language and the
project. I am not a specialist on lexc-twol, but I see a few general things
to improve your application:

* The coding challenge is very important. It proves you understand how
Apertium works (not only theoretically) and that you can do the job. So, do
it as well as you can now. Don't leave it until after the application
period.

* Your 30 hours commitment per week is to be welcome, but bear in mind that
it is much more than what Google is asking for this year.

* You want to enter 50,000+ words in the morphological analyser. That's a
huge amount. But in your work plan you don't say when you are going to do
it. It would be necessary to show how many words and which grammatical
categories you would add in each time slot (two weeks in your case).
Usually we start with the closed categories. When you detail these numbers
in your proposal, we will see how many words you will be able to reach.

* I have no idea how it is in the case of Dardic languages, but the
assignment of words to categories is not usually trivial in Indo-European
languages. Do existing works already have lists of words assigned to
paradigms? For example: lists of verbs following one model or another. If
not, the time needed for assignment increases. It is necessary to know this
in order to calculate the feasibility of introducing 50,000, 30,000 or
20,000 words.

* Are there extensive lists of words available in electronic format, with
their grammatical category, which you could use for your work? They should
be free. If they were copyrighted they could not be (semi-)automatically
uploaded to Apertium.

* It is very likely that, with the very limited time we have this year for
GSoC projects, a complete morphological analyser from scratch is perfectly
reasonable. Still, before putting so many words into it (especially if you
have to add them manually), I think it would be reasonable to spend a
couple of weeks training a morphological disambiguator.

Hèctor

Missatge de Naeemuddin Hadi  del dia dj., 8
d’abr. 2021 a les 1:46:

> Hello everyone,
>
> I am Naeem, a student of UET Peshawar. I want to participate in GSoC
> 2021.  I am working to create a morphological analyzer for an endangered
> language of northern Pakistan called Torwali.
> I have prepared a draft proposal and will appreciate feedbacks before
> final submission. links related to coding challenge are included in the
> draft.
>
> link (Draft) :
> https://drive.google.com/file/d/1hnu6gRWVN3LjjxOj0BvimvJ56AIKfe6q/view?usp=sharing
>
>
> Regards,
> Naeem
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSOC proposal draft - building a prototype MT system

2021-04-07 Thread Hèctor Alòs i Font

Hi, Anuradha.

Thanks for your proposal draft. First, I would like to tell you that if
Apertium is a rule-based translation system, it is because this paradigm
still makes sense for many languages (indeed, for the vast majority of
them). If Bhojpuri has extensive electronic language resources and,
particularly, bilingual linguistic corpora, then Apertium is probably not
the best approach. But this is probably not the case. If it was, it would
probably already be on Google Translate.

As for the project. I would advise you to look at Gourab Chakraborty's
proposal for a Hindi-Bengali translator and the comments on it. Most of the
comments apply to your proposal as well. The following message would be
useful to you, for instance:
https://sourceforge.net/p/apertium/mailman/message/37251899/

Your proposal seems to me unrealistic. 10,000 words in the monodix (and how
many in the bidix?) are not enough for a WER below 20%, I think (maybe for
two extremely close related languages).

For better evaluation your proposal I'd like to find the answer for some
basic questions:

* Which is the current state of Bhojpuri language and, eventually,
the Bhojpuri-Hindi language pair in Apertium?
* Would you have to write a whole Bhojpuri morphological analyser from
scratch and, afterwards, to add some 10,000 words manually assigning them
to a given paradigm? How much time you'll need for this?
* From where would you get the bilingual dictionary? Would you have to
create it yourself? Are there freely available bilingual electronic
dictionaries (like e.g. Wiktionary)?
* Would you work on a Bhojpuri-to-Hindi translator or on a
Hindi-to-Bhojpuri one? In any case there will be a quite a lot of work in
the morphological disambiguation. But for one side you'll have it only
once. If both Hindi-to-Bhojpuri and Hindi-to-Bengali are chosen (which is
entirely possible), this work can be divided by the two projects.

There is nothing wrong to this all this work by hand, if needed. It depends
on the state of the language resources for the given language. But it is
necessary to know to what extent you will have to do this time-consuming
work.

When we had twice the time in most of the cases the projects couldn't reach
to create a working translator for a new language pair. In the current
conditions, it is even more difficult.

Hèctor




Missatge de Anuradha Pandey  del dia dc., 7
d’abr. 2021 a les 16:28:

> Hello everyone,
> I am Anuradha Pandey, a sophomore student at BITS Pilani. I am interested
> I participating in GSoC 2021, on the project - "*Develop a prototype MT
> system for a strategic language pair*".
>
> I have prepared a rough draft for the same and I am planning to build
> Bhojpuri(BHO)-Hindi(HIN) MT pair. I am improving my translation system for
> the coding challenge and I will update my work on the GitHub repository
> mentioned in the draft. It would be really helpful if I could get some
> feedback before I make the final submission.
>
> Link to the draft -
>
> https://docs.google.com/document/d/1U19gJ3TMKYkYsp-FRthrvXkCRJUnNYSYKi46XhvZGOE/edit?usp=sharing
>
> Thanks & Regards,
> Anuradha Pandey
> IRC: Anuradha_Pandey
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Review for Apertium Hin-Ben

2021-03-30 Thread Hèctor Alòs i Font

Hi, Gourab.

I don't know if you already got other reviews in the IRC channel. Here are
my five cents:

1) Did you do the coding challenge? This is a must.

2) It would be good to know much about the current state of the hin-ben
pair. Because there isn't any information on this in your proposal, I've
taken a look at the repositories on GitHub. I've been surprised that there
is no hin-ben yet created in the Apertium repository (although there is
https://github.com/srj31/apertium-ben-hin) The hin monodix has 30,000+
entries and the ben monodix some 8,000. Furthermore, as I imagined, the
morphological disambiguator for Hindi has very few rules (I guess they are
not very necessary for translating to Urdu).

So there is quite a lot of work. It'll be very hard to really create a
translator with a WER below 25% (except if srj31's project has already
quite a lot of work and may be used).

3) Are there any free sources than can be used to fill the bidix (e.g. the
Wiktionary)? Or do you plan to translate by hand at least 10,000 Hindi
words? (much better 12,000-14,000 words for getting a WER bellow 30%). How
many words will you be able to translate per day? Only this would take most
of your time. And, since there are only 8,000 words in the Bengali monodix,
you'll need to add many of them in the Bengali monodix, which also needs
quite a lot of time. Again the same question: we'll you need to create
these words (and maybe the paradigms) in the monodix, or you'll be able to
get many new words (and their association to Apertium paradigms) from free
electronic sources?

4) In fact, your targets seem to be more a wish than something able. I
recommend that you try to create a calendar per week, in order to better
understand how much time you'll have to add words, create transfer rules,
morphological disambiguation rules and lexical selection rules. I don't
know anything on Indo-Iranian languages, but all Indo-European languages I
know need quite a lot of work on morphological disambiguation and, despite
this, it is one of the main sources of errors in the Apertium translators.

You can take a look on this work plans:
https://wiki.apertium.org/wiki/Grfro3d/proposal_apertium_cat-srd_and_ita-srd#Workplan

https://wiki.apertium.org/wiki/User:Hectoralos/GSOC_2020_proposal:_French-Arpitan#Workplan
(but take into account that in the previous years the number of hours
devoted to a GSoC project were twice as high as this year's)

5) Why do you have to improve the Bengali morphological analyser? Why
adding inflections for both Bangladeshi Bengali and Indian Bengali? The
project is already too complex and overloaded to add the possibility of
generating two flavours of Bengali (because it would be a matter of
generating Bengali, not of parsing it for translating into Hindi). I would
generate the Bengali that is currently in the Bengali monodix (the Indian
one, I guess).

Best,
Hèctor

Missatge de Gourab Chakraborty IIIT Dharwad <19bcs...@iiitdwd.ac.in> del
dia dl., 29 de març 2021 a les 20:20:

> Hi all,
> I am planning to create the Apertium Hindi-Bengali language pair as per
> the suggestions I was given by the developers. The GSoC application window
> would begin soon, so I request the mentors to kindly give a review of my
> final proposal, for any last minute changes that are required.
>
> Thanks a lot!
> --
> Gourab Chakraborty
> IRC: gourab337
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] Begin of sentence

2021-02-17 Thread Hèctor Alòs i Font

Is there any form to match a "begin of sentence" in lexical selection or in
transfer? In transfer, usually the point of the previous sentence is used,
but I want to match even the beginning of the first sentence of the text.

Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Proper noun classification considered harmful

2021-02-03 Thread Hèctor Alòs i Font

Missatge de Xavi Ivars  del dia dc., 3 de febr. 2021
a les 1:10:

> Hèctor, please correct me if I am wrong.
>
> In Catalan, for example, we have gender annotated for proper nouns,
> because as Hèctor explained, it's useful in the some cases when translating
> to French. So Catalan monolingual generates rich tags for np.
>
> However, when translating to Spanish, that information (from Catalan) is
> not that useful, so we didn't bother adding genders there. And the way we
> managed it was adding RL rules both in spa and cat that consume
> "genderless" nps, regardless of how they are generated.
>
> So I can think that could be an approach: annotate only when output is
> useful, but account for simpler input when generating.
>
>
Well, I would say that when translating between Spanish and Catalan (as
happens when translating between French and Occitan or Arpitan, or between
Italian and Sardinian) is that the minority language calques a lot the
majority one. So genders are less necessary when translating between
Spanish and Catalan because everything works OK in 99% of cases if we
maintain the source gender of proper names; nor an article has to be added
or removed because the construction is the same in Catalan and Spanish.
This is not the case when working e.g. between Catalan and French or
Italian, although they are also very close languages (and even in some
aspects arguably Catalan is closer to French than to Spanish). For
instance, something as trivial as "a Citroën" is not a problem for the
Spanish-Catalan pair, but it is in the French-Catalan or Italian-Catalan
(as "Citroën" currently lacks gender in the Catalan monodix precisely
because it has no importance when translating into Spanish).

Hèctor


> --
> Xavi Ivars
> < http://xavi.ivars.me >
>
> El dt., 2 de febr. 2021, 22:40, Kevin Brubeck Unhammer 
> va escriure:
>
>> Hèctor Alòs i Font 
>> čálii:
>>
>> > I am more sceptical about the need to distinguish between toponyms and
>> > hydronyms. In some languages one will have an article and the other will
>> > not, but these are rare cases. On the other hand, we do not distinguish
>> > between countries (or regions) and cities, which in French is quite
>> > important both for generating the article and the preposition preceding
>> it,
>> > if you translate from Catalan or Spanish: for instance, "New-York" is
>> the
>> > city, but "le New-York" is the state, so will have "à New-York" or "au
>> > New-York" for "in New-York" (or "à Paris" but "en France").  The
>> generation
>> > of articles may also not be the same whether "Barcelona" stands for the
>> > city or the (football or whatever) team, nor is the gender often the
>> same.
>> > So, are we then going to create more and more subtypes ad nauseam?
>> Better
>> > not!
>> >
>> > In short, we can find casuistries in certain pairs that may make us
>> think
>> > that some distinctions are appropriate, but adding them in monolingual
>> > dictionaries and forcing them to be maintained for all languages seems
>> > doubtful to me.
>>
>> So the city-vs-region distinction is only useful for target (structural)
>> generation, not source analysis/disambiguation/anaphora. I think that
>> can be a good guide to when something should be in monodixen or not.
>>
>> One solution here would be to add it in bidix (with a pardef so you
>> don't need it when going the other way) and strip it in transfer, or
>> even just use a def-list in the transfer files.
>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Proper noun classification considered harmful

2021-02-03 Thread Hèctor Alòs i Font

Missatge de Kevin Brubeck Unhammer  del dia dc., 3 de
febr. 2021 a les 0:40:

> Hèctor Alòs i Font 
> čálii:
>
> > I am more sceptical about the need to distinguish between toponyms and
> > hydronyms. In some languages one will have an article and the other will
> > not, but these are rare cases. On the other hand, we do not distinguish
> > between countries (or regions) and cities, which in French is quite
> > important both for generating the article and the preposition preceding
> it,
> > if you translate from Catalan or Spanish: for instance, "New-York" is the
> > city, but "le New-York" is the state, so will have "à New-York" or "au
> > New-York" for "in New-York" (or "à Paris" but "en France").  The
> generation
> > of articles may also not be the same whether "Barcelona" stands for the
> > city or the (football or whatever) team, nor is the gender often the
> same.
> > So, are we then going to create more and more subtypes ad nauseam? Better
> > not!
> >
> > In short, we can find casuistries in certain pairs that may make us think
> > that some distinctions are appropriate, but adding them in monolingual
> > dictionaries and forcing them to be maintained for all languages seems
> > doubtful to me.
>
> So the city-vs-region distinction is only useful for target (structural)
> generation, not source analysis/disambiguation/anaphora. I think that
> can be a good guide to when something should be in monodixen or not.
>

I am not sure to see you point. Let's see the example of New-York in
French. The city is "New-York" without any article but the state in "le
New-York". The prepositions used in both cases are different in some cases
(which come to be often in Wikipedia texts). So, they have different
behaviour in French. In principle, it makes sense to differentiate them in
the monodix... although I have preferred not to innovate too much, and, as
you suggest, I've used long def-lists in the transfer files.


>
> One solution here would be to add it in bidix (with a pardef so you
> don't need it when going the other way) and strip it in transfer, or
> even just use a def-list in the transfer files.
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Proper noun classification considered harmful

2021-02-02 Thread Hèctor Alòs i Font

Missatge de Kevin Brubeck Unhammer  del dia dt., 2 de
febr. 2021 a les 13:35:

> Flammie A Pirinen  čálii:
>
> > Hi all,
> >
> > I've written a handful of apertium-fin-* prototypes and I usually end up
> > spending way too much time with all the useless subclasses of proper
> > nouns we have (cogs, ants, als, tops, orgs, and to top all that,
> > sometimes ms and fs for some extra (mis)gendering). Could we just get
> > rid of those or those someone have a good use for them? Most of the time
> > it's very random anyways and we aren't really doing NERing or anything.
> > I think if these are used in e.g. cg or whatever we should probably have
> > different way of introducing them that doesn't intervene with
> > analysis-generation stuffs, like we talked passing by in the last
> > apertium zoom meeting? Or is there some smart way to bypass them I
> > haven't thought of (probably)
>
> Genders are useful when anaphora resolving / in transfer, though only on
> person names. There are some place/org names from swe that have genders
> (originally from SALDO) which bled into other scandipairs – I'd be happy
> to remove those since they seem quite useless for us.
>
> The ,  and  tags are used quite a bit in the nob
> disambiguator, but not in transfer.
>
> I tend to underspecify np's in bidix:
>
>  IranIran
>  ThielThiel
>  SarumanSaruman
>  ContrasContras
>
> so just the monodixen need to be synced. If there is an actual
> bidix-relevant difference, e.g. some place name gets translated but not
> if it's a person name, then one can specify the tags for just that
> entry.
>
> The remaining problem is when the analyser gives ^Saruman$ and
> you try to send that into a generator that expects ^Saruman$.
>
> We could perhaps use the Giellatekno solution for that, where dixen have
> RL entries that just contain  (ie., no cog/ant/al), and some
> transfer step cleans off the tags. Should be a fairly simple change, and
> it's tried and tested in giella-pairs. Since lttoolbox is used mostly
> for languages where np pardefs are small, adding the RL's is like max
> 10 extra lines; for languages requiring hfst it's probably a fairly
> simple twol or xfregex rule?
>
>
The question of np is complex, and it certainly needs to be thought
through. The problem is that for some pairs some differences are relevant,
but probably for most they are not.

As Kevin says, gender may be useful for the anaphora resolution, but the
truth is that I have not dared to put Navratilová or Kurnikova as feminine
surnames in the dictionaries of Romance languages I have worked on.

On the other hand, the difference between names and surnames is important,
as the former are sometimes translated, while the latter are rarely
translated (it's more of a transliteration problem, since in almost every
Romance language Russian surnames are spelt differently: it's deadly!)

The difference between people and places is important in the languages I
deal with: prepositions, for example, can have different translations.

I am more sceptical about the need to distinguish between toponyms and
hydronyms. In some languages one will have an article and the other will
not, but these are rare cases. On the other hand, we do not distinguish
between countries (or regions) and cities, which in French is quite
important both for generating the article and the preposition preceding it,
if you translate from Catalan or Spanish: for instance, "New-York" is the
city, but "le New-York" is the state, so will have "à New-York" or "au
New-York" for "in New-York" (or "à Paris" but "en France").  The generation
of articles may also not be the same whether "Barcelona" stands for the
city or the (football or whatever) team, nor is the gender often the same.
So, are we then going to create more and more subtypes ad nauseam? Better
not!

In short, we can find casuistries in certain pairs that may make us think
that some distinctions are appropriate, but adding them in monolingual
dictionaries and forcing them to be maintained for all languages seems
doubtful to me. I would remove the distinctions in some pairs between np.org
and np.al and np.hyd and np.top, for example. I agree that the gender of
place names in Romance languages is often a burden and not clear at all,
but this could be solved if we define them as "mf" instead of "m" or "f".

Hèctor




> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Need help with language/pair releases

2020-12-31 Thread Hèctor Alòs i Font

Hi, Tino.
We are working on a new release of apertium-srd-ita that should be ready in
c. ten days. There have been quite a lot of changes in apertium-srd, so
apertium-cat-srd will have to wait until the end of January. Is this
schedule acceptable?
Hèctor

Missatge de Tino Didriksen  del dia dj., 31 de des.
2020 a les 18:42:

> Tracking progress in https://github.com/apertium/organisation/issues/23
>
> I need someone to sign off on or fix these. By "sign off on" I simply mean
> that someone can check that the most recent commit of a pair and its
> dependency languages is in a state that works as good as or better than
> previous formal release.
>
> - apertium-arg needs new release
> - apertium-crh needs new release
> - apertium-cym needs new release
> - apertium-kaz needs new release
> - apertium-srd needs new release
> - apertium-tat needs new release
> - apertium-tur needs new release
> - apertium-ukr needs new release
> - apertium-eo-fr see https://github.com/apertium/apertium-eo-fr/issues/4
> - apertium-es-gl needs new release, and to depend on a specific release of
> apertium-spa
> - apertium-arg-cat needs new release, and to depend on a specific release
> of apertium-arg and apertium-cat
> - apertium-cat-srd needs new release, and to depend on a specific release
> of apertium-cat and apertium-srd
> - apertium-crh-tur needs new release, and to depend on a specific release
> of apertium-crh and apertium-tur
> - apertium-cym-eng needs new release, and to depend on a specific release
> of apertium-cym and apertium-eng
> - apertium-eng-spa needs new release
> - apertium-kaz-tat needs new release, and to depend on a specific release
> of apertium-kaz and apertium-tat
>
> At the bottom of the issue there's also languages and pairs I haven't
> checked yet.
>
> -- Tino Didriksen
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] The French-Arpitan translator is ready to be packed

2020-12-15 Thread Hèctor Alòs i Font

Thanks a ton, Tino.
Hèctor

Missatge de Tino Didriksen  del dia dt., 15 de des.
2020 a les 19:38:

> apertium-fra, apertium-frp, and apertium-fra-frp are now finally also
> tagged on Github and in the release repo. It's been on apertium.org for a
> while, but was otherwise held up by the core tool packages.
>
> And for Debian:
> - https://salsa.debian.org/science-team/apertium-fra-frp v1.0.0, bundling
> apertium-fra v1.10.0 and apertium-frp v1.0.0. Requires latest cg3,
> lttoolbox, apertium, -lex-tools, and -separable.
>
> -- Tino Didriksen
>
>
> On Tue, 1 Sept 2020 at 16:29, Hèctor Alòs i Font 
> wrote:
>
>> As a result of this year's GSoC, I've prepared a French-Arpitan
>> bidirectional translator. In principle, it is ready to be packed. It uses
>> apertium-separator and this summer's improvements of apertium-lex-tools
>> done by Daniel Swanson.
>>
>> A bit of a detailed explanation of the pair can be found here:
>> https://wiki.apertium.org/wiki/Hectoralos/GSOC_2020_rapport_final (in
>> French). The WER from French to Arpitan is 5.7% and from Arpitan to French
>> is 15.5% (this final results are consistent with the first results I got in
>> a first test at the end of July). This unexpectedly low WER in the
>> French-Arpitan side is the result of a great involvement of two language
>> specialists, Dominique Stich and Alan Favro, with whom I've been
>> continuously in touch throughout the whole project. I would also like to
>> thank Tino Didriksen, Daniel Swanson, Marc Riera and my supervisors Xavi
>> Ivars and Gianfranco Fronteddu for their support during the development.
>>
>> Hèctor
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] From "German" to "English" not possible?

2020-12-13 Thread Hèctor Alòs i Font

beta.apertium.org is correctly translating web pages (e.g.
https://beta.apertium.org/index.frp.html?dir=ita-srd=https%3A%2F%2Fwww.repubblica.it%2F#webpageTranslation
). I guess, the problem is that the English-German pair in Apertium is
little developed.
Hèctor

Missatge de Benedikt Freisen  del dia dg., 13 de des.
2020 a les 17:51:

> I'm not quite sure whether beta.apertium.org even supports webpage
> translation. The translator itself works. Well, sort of.
>
> Am 13.12.20 um 14:06 schrieb Evelyn Pereira Souza via Apertium-stuff:
> > On 13.12.20 11:08, Benedikt Freisen wrote:
> >> There is no released language pair for translation between English and
> >> German.
> >> There is an unreleased pair, though.
> >> It is available on beta.apertium.org.
> >>
> >> Greetings,
> >> Benedikt
> >
> > Hi Benedikt
> >
> > thank you
> >
> > The language is listed, but I get the error "Translation not yet
> > available!". I waited a few minutes and still the same state.
> >
> > the site I wish to translate: https://github.com/hoppscotch/hoppscotch
> >
> > Is this known?
> >
> > best regards
> > Evelyn
> >
> > (re-send the mail without attachment, because "Message body is too big:
> > 120550 bytes with a limit of 110 KB")
> >
> >
> > ___
> > Apertium-stuff mailing list
> > Apertium-stuff@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/apertium-stuff
> >
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] An easy tool to report bad translations and propose alternatives

2020-12-05 Thread Hèctor Alòs i Font

A Sardinian collaborator commented to me: "Wouldn't it be possible that
every time there are more possible translations these come out in a little
window where the user chooses the right solution, as in spell checkers"?

This could be an idea for a GSoC tool project. Nevertheless, I don't think
that, as he puts it, this is the best option because, in general, we have
few multiple options in the bilingual dictionaries. Probably, another type
of interface would be more appropriate. Is there anything done in the GSoC
projects that could be used?

With him, we use a simple spreadsheet in a Google Documents-like system. He
enters a word or phrase, the current translation, the suitable translation
and the context (sentence). This is not at all intuitive, nor easy, for a
conventional user, but it is very useful. We have already dealt with
several hundred errors in the Italian-Sardinian translator.

Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] A talk evaluating Apertium

2020-10-19 Thread Hèctor Alòs i Font

El dl., 19 oct. 2020, 19.58, Xavi Ivars  va escriure:

> Well, that's only "part" of the corpus... and for the Europarl, that part
> of corpus was not left "as is" after Apertium, but also postedited.
>

Wow! Did you postedited the whole Europarl corpus?! No matter if you used
Apertium or not, it's clear that you did tons of work. If it is explained
somewhere how Softcatalà did the work, with how much resources (time,
volunteers, money), please let us know. It has to be an excellent test case
to show wether a (real) under-resourced language can or cannot reach the
stuff needed for neural translation.
And again, congrats!
Hèctor


> The talk was specifically about eng-cat, and in that case, for the NMT
> model, Apertium was not involved.
> --
> < Xavi Ivars >
> < http://xavi.ivars.me >
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] A talk evaluating Apertium

2020-10-17 Thread Hèctor Alòs i Font

Xavi, I am impressed that you could in Softcatalà get enough bilingual
texts to create an English-Catalan neural translator. Congratulations on
the results! I am curious to know how big the corpus you collected has
been, as well as from which sources to ensure the quality of the
translations. I'd maybe add that probably it would not be possible to
collect such a corpus for Valencian Catalan, so I guess we face in this
neural translator a typical problem with lesser-user languages/varieties.
If it is ever considered necessary to generate Valencian, this will have to
be done by translating it into "reference" Catalan and then automatically
adapting it. In fact the same happens for the many flavours we currently
have in Apertium for Catalan, both Valencian and "Catalonian".
By the way, is Softcatalà trying to create a neural translator for the
Spanish-Catalan pair?
Hèctor

Missatge de Xavi Ivars  del dia dg., 18 d’oct. 2020 a
les 0:04:

> I actually think apertium eng-cat may be one of the english-romanic pairs
> that has received more care. Marc has worked quite a lot on it since he did
> it during a GSoC, but has continued working on his it adding newly
> developer modules (perceptron tagger, separable, lexical selection,
> anaphora resolution,...)
>
> I agree with Mikel that the evaluation is well made. And after having
> tried the neural a few times, it will probably be hard to get there, in
> terms if fluency, with Apertium. Translations with Apertium are, usually,
> more "robotic".
>
> --
> Xavi Ivars
> < http://xavi.ivars.me >
>
> El dv., 16 d’oct. 2020, 19:35, Tanmai Khanna  va
> escriure:
>
>> Not to get too defensive haha but in the talk makes certain statements,
>> such as "Neural machine translation is more fluent and adequate compared to
>> RBMT" (not verbatim), and later on a comparison of post-editability, but
>> doesn't really comment on the status of these tools, i.e. how much data
>> were the NMT systems trained on, and how much is the English-Catalan system
>> of Apertium is worked on (which we know is not a lot), so I was just
>> surprised to see general statements against RBMT. Not saying this as a
>> member of Apertium but generally evaluations like this should be done
>> carefully and all the factors should be considered before comparing
>> multiple systems, without that, it's easy to arrive at false conclusions.
>>
>> My two cents :)
>> *तन्मय खन्ना *
>> *Tanmai Khanna*
>>
>>
>> On Thu, Oct 15, 2020 at 9:27 PM Mikel L. Forcada  wrote:
>>
>>> Dear Apertiumers:
>>>
>>> here's a 20-minute talk from Vicent Briva where he evaluates Apertium
>>> English–Catalan in comparison with the SoftCatalà neural engine and
>>> Google Translate.
>>>
>>> I think the evaluation is quite well made.
>>>
>>> https://www.youtube.com/watch?v=IiRVhAYpecw
>>>
>>> We do not fare too well but, hey, we know this language pair needs love.
>>>
>>> All the best,
>>>
>>>
>>> Mikel
>>>
>>>
>>>
>>> ___
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] We now have markup handling and reordering in Apertium!

2020-09-04 Thread Hèctor Alòs i Font

Missatge de Tanmai Khanna  del dia dv., 4 de set.
2020 a les 9:22:

> Hèctor,
> Yes, the new improvements aren't backwards compatible but that's because
> they're better than the system we had earlier. Here's the changes:
>
> So, you are saying that the new stuff is not backwards compatible, aren't
>> you? There aren't any  in the rule, but , which is not
>> the same. Until now,  means explicitly putting a blank, while > pos="1,2..."/> means copying to the output whatever is in the input in a
>> given point.
>>
>
>  and  now do exactly the same thing. You don't need
> to replace all of the former with the latter but even if you do or don't it
> won't change anything. Until now it meant what you said but now it means
> that if you see a  or a  then print one blank from
> the blank queue in the output.
>
> Superblanks most of the time are blanks, but, as you now probably know
>> better than anyone else, they can be lots of things; they can even contain
>> no blanks at all. Even in some cases, like in Romance-language enclitics,
>> we know there shouldn't be any blank at all before them, but we had to
>> add  for not loosing information on italics, bold letters,
>> etc.
>>
>
> You're right, except now we have a completely different system to deal
> with italics, bold letters, and all markup, i.e. wordbound blanks, which
> aren't considered blanks. Now that there is no information to lose, we
> didn't want to burden the people who write transfer rules to explicitly
> define positions of blanks. In cases where you don't want a space in the
> output, you just don't put a  in the output rule.
>
>
>> I'm not really ready to change all  in the hundreds of
>> rules I've been writing in several language pairs. Specifically for
>> apertium-fra-frp, I hope it will be able to publish it before the new
>> version of the Apertium core you are preparing, so they are needed right
>> now.
>>
>
> You won't have to change all of them. Most of them will work as it is. The
> new system prints blanks in the same order as they were input, so it won't
> harm most of the rules. The *only thing *you'll have to change, is rules
> where you don't want a space in the output between LUs, you remove the  pos="1"/> from those rules. This is because now, an empty blank isn't
> considered a blank anymore. This was because we want the users to have
> control about whether they want a blank or not between their output LUs,
> regardless of the input blanks. If we consider an empty blank, your problem
> will be solved, but other problems will come up, where empty blanks will
> appear in the output regardless of s in the output.
>
> So to conclude, the only thing you need to remove is the  from
> rules where you know you don't want a space in the output, like num_n, and
> maybe some enclitics. Apart from that, everything will work as it is. To
> improve the system, at some point we'll have to add a change that isn't
> strictly backwards compatible, and several people agree that after
> wordbound blanks, we should stop handling blank positions in transfer rules.
>

The problem is that in 99% of the cases I want a blank in num_n, that is
between the numeral and the name. In most of the cases we have "two cows",
"3 dogs", etc. In Romance languages, the rule is needed mostly for gender
agreement. The problem is that sometimes, as we see, we got something else.
So the question is not whether I want a blank there or not. I want whatever
was there. So, let me try to formulate it in another way. If I want to
preserve what was written between two words, I shouldn't write , but if I want to add a blank, I have to add . Am I
right? If this is correct, it comes to remove all . It
seems it would be easier that they wouldn't be taken into account, and thus
avoiding any change in the language pairs. Am I missing something?

Hèctor


>
> If this isn't acceptable, we can discuss other possible solutions :)
>
> *तन्मय खन्ना *
> *Tanmai Khanna*
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] We now have markup handling and reordering in Apertium!

2020-09-03 Thread Hèctor Alòs i Font

Missatge de Tanmai Khanna  del dia dj., 3 de set.
2020 a les 23:10:

> Hèctor,
> The extra blank there is because there's a blank in your rule output. See:
>
> $ echo "^052/052$^F/F$" | apertium-transfer
> -z -b 'apertium-fra-frp.frp-fra.t1x' 'frp-fra.t1x.bin'
>
> ^num_n{^052$ ^F$}$
>
>
> The rule for num_n has a  in the rule output and hence there's a
> space. The reason earlier there wasn't space was because an empty string
> was considered a blank. Now, if you don't want a space between the LUs in
> the rule output, you just don't put a . So if you remove the  from
> the num_n rule it will start working properly. Earlier you used to add a
>  everytime the rule had multiple LUs in the output but now *you only
> add a  if you want a space/blank between the output words.*
>
>
> Try removing the  and it should work.
>


So, you are saying that the new stuff is not backwards compatible, aren't
you? There aren't any  in the rule, but , which is not
the same. Until now,  means explicitly putting a blank, while  means copying to the output whatever is in the input in a
given point. Superblanks most of the time are blanks, but, as you now
probably know better than anyone else, they can be lots of things; they
can even contain no blanks at all. Even in some cases, like in
Romance-language enclitics, we know there shouldn't be any blank at all
before them, but we had to add  for not
loosing information on italics, bold letters, etc.

I'm not really ready to change all  in the hundreds of
rules I've been writing in several language pairs. Specifically for
apertium-fra-frp, I hope it will be able to publish it before the new
version of the Apertium core you are preparing, so they are needed right
now.

Hèctor


>
> As for the discussion about  Iér o 5e, we all agreed that we
> don't want them in the dictionaries and hence you can analyse them as
> individual LUs and then using apertium-separable you can combine them into
> one LU. Finally, the space between l and ér shouldn't appear in the rule
> output and it is because of an issue that's still being fixed. But it'll be
> fine soon :)
>
>
>
> *तन्मय खन्ना *
> *Tanmai Khanna*
>
>
> On Thu, Sep 3, 2020 at 11:46 PM Hèctor Alòs i Font 
> wrote:
>
>> Hi Tanmai,
>>
>> Yes, hyphens and quotes (") seem to be solved. But the system persists to
>> add blanks where there were not. For instance, this causes that we get now
>> strange Unicode codes:
>>
>> 05076. Table des caractères Unicode U+0500 à U+052F.
>> < 05076. Tâbla des caractèros Unicode *U+0500 a *U+052F.
>> ---
>> > 05076. Tâbla des caractèros Unicode *U+0500 a *U+052 F.
>>
>> The same for names of standards (e.g. 802.3j), road names, car (Fiat
>> 621RN) or plane (EA-18G Growler) models, etc.
>>
>> On the ... I wouldn't say that it is very beautiful. It could
>> be misleading if there is just one character, as it often happens, like in
>> 5e. In any case, what most interests me is how to deal with these things
>> in the dictionaries. That's not a problem of the new blank-treatment or
>> Transfuse. That's a problem we already had, but I never thought about it. I
>> wouldn't like to have Iér o 5e in the dictionaries. It may cause
>> problems, i.a. because ér and e can be words of their own, so we'll get a
>> wrong morphological analysis.
>>
>> Hèctor
>>
>>
>>
>>
>> Missatge de Tanmai Khanna  del dia dj., 3 de
>> set. 2020 a les 18:57:
>>
>>> Hèctor can you check the page on beta now? The hyphen and the
>>> superscript issues are solved. Of course, there's now a space between l and
>>> ér. If that's a big problem we can discuss other solutions.
>>>
>>> *तन्मय खन्ना *
>>> *Tanmai Khanna*
>>>
>>>
>>> On Thu, Sep 3, 2020 at 8:09 PM Tino Didriksen 
>>> wrote:
>>>
>>>> I have adjusted Transfuse with how spaces are treated for Apertium, and
>>>> implemented adding temporary spaces around  and . Changes are
>>>> deployed on beta.
>>>>
>>>> I repeat my plea that all symbols should have an analysis. It breaks
>>>> markup that things like - and : are not tokens.
>>>>
>>>> -- Tino Didriksen
>>>>
>>>>
>>>> On Wed, 2 Sep 2020 at 13:23, Tino Didriksen 
>>>> wrote:
>>>>
>>>>> That's not something the pipe ever sees - you can't fix it on your
>>>>> end. It's something I have to adjust in Transfuse.
>>>>>
>>>>> https://github.com/TinoDidriksen/Transfuse/blob/master/src/dom.cpp#L604
>>>>> and L629 e

Re: [Apertium-stuff] We now have markup handling and reordering in Apertium!

2020-09-03 Thread Hèctor Alòs i Font

Hi Tanmai,

Yes, hyphens and quotes (") seem to be solved. But the system persists to
add blanks where there were not. For instance, this causes that we get now
strange Unicode codes:

05076. Table des caractères Unicode U+0500 à U+052F.
< 05076. Tâbla des caractèros Unicode *U+0500 a *U+052F.
---
> 05076. Tâbla des caractèros Unicode *U+0500 a *U+052 F.

The same for names of standards (e.g. 802.3j), road names, car (Fiat 621RN)
or plane (EA-18G Growler) models, etc.

On the ... I wouldn't say that it is very beautiful. It could be
misleading if there is just one character, as it often happens, like in 5e.
In any case, what most interests me is how to deal with these things in the
dictionaries. That's not a problem of the new blank-treatment or Transfuse.
That's a problem we already had, but I never thought about it. I wouldn't
like to have Iér o 5e in the dictionaries. It may cause problems,
i.a. because ér and e can be words of their own, so we'll get a wrong
morphological analysis.

Hèctor




Missatge de Tanmai Khanna  del dia dj., 3 de set.
2020 a les 18:57:

> Hèctor can you check the page on beta now? The hyphen and the superscript
> issues are solved. Of course, there's now a space between l and ér. If
> that's a big problem we can discuss other solutions.
>
> *तन्मय खन्ना *
> *Tanmai Khanna*
>
>
> On Thu, Sep 3, 2020 at 8:09 PM Tino Didriksen 
> wrote:
>
>> I have adjusted Transfuse with how spaces are treated for Apertium, and
>> implemented adding temporary spaces around  and . Changes are
>> deployed on beta.
>>
>> I repeat my plea that all symbols should have an analysis. It breaks
>> markup that things like - and : are not tokens.
>>
>> -- Tino Didriksen
>>
>>
>> On Wed, 2 Sep 2020 at 13:23, Tino Didriksen 
>> wrote:
>>
>>> That's not something the pipe ever sees - you can't fix it on your end.
>>> It's something I have to adjust in Transfuse.
>>>
>>> https://github.com/TinoDidriksen/Transfuse/blob/master/src/dom.cpp#L604
>>> and L629 expands inline tags to encompass surrounding plain text, because
>>> it is unfortunately common for formatting to be partially on a word while
>>> you really want the whole word translated as a unit.
>>>
>>> However, for HTML I should add spaces around  and  so that
>>> they can't gobble up their surroundings. Tracked as
>>> https://github.com/TinoDidriksen/Transfuse/issues/7
>>>
>>> -- Tino Didriksen
>>>
>>>
>>> On Wed, 2 Sep 2020 at 12:58, Hèctor Alòs i Font 
>>> wrote:
>>>
>>>> I'm taking a look on how this list of names on Wikipedia:
>>>> https://frp.wikipedia.org/wiki/Lista_des_comtos_et_ducs_de_Savou%C3%A8
>>>> and how it is translated in beta.apertium:
>>>> https://beta.apertium.org/index.fra.html?dir=frp-fra=https%3A%2F%2Ffrp.wikipedia.org%2Fwiki%2FLista_des_comtos_et_ducs_de_Savou%25C3%25A8#webpageTranslation
>>>>
>>>> There still are quite a few problems with HTML-tags if we look that the
>>>> whole Iér is becoming a superscript, and also with italics. The space after
>>>> the hyphen is an already known problem.
>>>>
>>>> By the way, I wonder whether it is possible to match in our
>>>> dictionaries Iér. I have Iér in the dictionary, but when the
>>>> ending ér stays as a superscript, as usually done in the texts, it is not
>>>> matched. Should I add Iér to the dictionary?
>>>>
>>>> Hèctor
>>>>
>>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] We now have markup handling and reordering in Apertium!

2020-09-02 Thread Hèctor Alòs i Font

I'm taking a look on how this list of names on Wikipedia:
https://frp.wikipedia.org/wiki/Lista_des_comtos_et_ducs_de_Savou%C3%A8
and how it is translated in beta.apertium:
https://beta.apertium.org/index.fra.html?dir=frp-fra=https%3A%2F%2Ffrp.wikipedia.org%2Fwiki%2FLista_des_comtos_et_ducs_de_Savou%25C3%25A8#webpageTranslation

There still are quite a few problems with HTML-tags if we look that the
whole Iér is becoming a superscript, and also with italics. The space after
the hyphen is an already known problem.

By the way, I wonder whether it is possible to match in our dictionaries
Iér. I have Iér in the dictionary, but when the ending ér stays
as a superscript, as usually done in the texts, it is not matched. Should I
add Iér to the dictionary?

Hèctor

Missatge de Tanmai Khanna  del dia dj., 27 d’ag.
2020 a les 22:45:

> Unhammer I think I've implemented this in:
> https://github.com/apertium/apertium/pull/102 . If it looks good I can
> implement in interchunk and postchunk as well.
>
> The blanks are stored as a queue and output in available s in the rule
> output. If any are remaining they're output after the rule output.
>
> *तन्मय खन्ना *
> *Tanmai Khanna*
>
>
> On Thu, Aug 27, 2020 at 3:05 PM Kevin Brubeck Unhammer 
> wrote:
>
>> Tanmai Khanna 
>> čálii:
>>
>> > So what I'll try to do, is after the blanks are collected, lets say X is
>> > the number of source LUs in the pattern and Y is the number of output
>> LUs.
>> > If X = Y then we can keep them in the same place, if X < Y, then we can
>> > keep them in the first X gaps the rest can be spaces or whatever the
>> user
>> > denotes. If X > Y, then we can print the first Y blanks and then flush
>> the
>> > remaining. After this the  option will become useless. Does that
>> > sound good?
>>
>> By "gaps" do you mean where the rule is outputting a ? So if input
>> is "ab c" and a rule matching that has two 's in its , the
>>  gets output on the first  and then on the second  we get
>> a regular space. If the rule has three 's, the third one is also
>> a regular space. If the rule has no 's, the  gets output after
>> the rule output. That would be nice (though I could also live with the
>>  always ending up after the rule as long as I never have to think
>> about pos="…")
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] The French-Arpitan translator is ready to be packed

2020-09-01 Thread Hèctor Alòs i Font

As a result of this year's GSoC, I've prepared a French-Arpitan
bidirectional translator. In principle, it is ready to be packed. It uses
apertium-separator and this summer's improvements of apertium-lex-tools
done by Daniel Swanson.

A bit of a detailed explanation of the pair can be found here:
https://wiki.apertium.org/wiki/Hectoralos/GSOC_2020_rapport_final (in
French). The WER from French to Arpitan is 5.7% and from Arpitan to French
is 15.5% (this final results are consistent with the first results I got in
a first test at the end of July). This unexpectedly low WER in the
French-Arpitan side is the result of a great involvement of two language
specialists, Dominique Stich and Alan Favro, with whom I've been
continuously in touch throughout the whole project. I would also like to
thank Tino Didriksen, Daniel Swanson, Marc Riera and my supervisors Xavi
Ivars and Gianfranco Fronteddu for their support during the development.

Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Fixing Phonological Processes

2020-08-31 Thread Hèctor Alòs i Font

I'm glad to see that someone is working on an endangered austronesian
language. Who is he/she? I guess lexd/twol is used mostly for Wamesa
because of its phonotactics, isn't? Morphology seems not to be an issue as
big as in the other three languages.
Hèctor

Missatge de Jonathan Washington  del dia
ds., 29 d’ag. 2020 a les 17:27:

> Hi Zanga,
>
> Given the highly agglutinative nature of Yao morphology, using dix to
> model it is probably not a great option.  Also, as you and Hèctor have
> concluded, the morphophonology will be much easier to model using twol.
>
> Given the extent to which the morphology involves prefixes, lexc (what we
> traditionally use with twol) is probably also a poor choice for modeling
> the morphology.  However, lexd was designed as a replacement for lexc for
> languages like Yao (and works well with twol).  I think this is the route
> you should take.
>
> Documentation is available here:
>
> https://github.com/apertium/lexd/blob/master/Usage.md
>
> Some languages in Apertium whose morphologies are already implemented in
> lexd (none are entirely complete yet, but some are pretty far along):
>
> Swahili: https://github.com/apertium/apertium-swa
> Lingala: https://github.com/apertium/apertium-lin
> Nivkh: https://github.com/apertium/apertium-niv
> Wamesa: https://github.com/apertium/apertium-wad
>
> I probably forgot a few, but these should provide good models (and two are
> related to Yao).  There are also a couple other languages being developed
> using lexd that aren't public (yet).
>
> And of course you can message this list if you have trouble, or ask in
> real time in the IRC channel.
>
> --
> Jonathan
>
> On Sat, Aug 29, 2020, 02:30 Zanga Chimombo  wrote:
>
>> Yes. I think I should be using twol
>>
>> On Fri, Aug 28, 2020 at 3:56 PM Hèctor Alòs i Font 
>> wrote:
>> >
>> > I don't think you have to do anything with the modes or the compilation
>> file. The problem is in the post-yao.dix file.
>> > If you add , it works:
>> >
>> > 
>> >   
>> > nk
>> > ng
>> >   
>> >   
>> > 
>> >
>> > $ echo "~nka" | lt-proc -p yao.autopgen.bin
>> > nga
>> > $ echo "~nkb" | lt-proc -p yao.autopgen.bin
>> > nkb
>> >
>> > I don't know why without  there is no match, but in any case you
>> need to add  to the relevant places (words, affixes, etc.) you want to
>> trigger this rule. If you want that always nk + vowel should be ng, you
>> should this in twol, not here.
>> >
>> > Hèctor
>> >
>> > Missatge de Zanga Chimombo  del dia dv., 28
>> d’ag. 2020 a les 15:41:
>> >>
>> >> I am still not getting anywhere and both modes.xml and the Makefile
>> >> seem ok. My code is here:
>> >> https://gitlab.com/zangaphee/CiBantu/-/tree/master/twoc/apertium-yao
>> >>
>> >> On Fri, Aug 28, 2020 at 7:36 AM Hèctor Alòs i Font <
>> hectora...@gmail.com> wrote:
>> >> >
>> >> > The relevant files are modes.xml and Makefile.am I recommend taking
>> a look to them in e.g. apertium-fra and apertium-fra-cat (or any other
>> released pair using post-generation). In the first one you define the
>> pipeline, so copy and adapt the call to autopgen in the end. In the second
>> one you have the actual compilation of the programme.
>> >> >
>> >> > Missatge de Zanga Chimombo  del dia dv., 28
>> d’ag. 2020 a les 7:52:
>> >> >>
>> >> >> Hi again, I actually have:
>> >> >>
>> >> >> 
>> >> >>   
>> >> >> nk
>> >> >> ng
>> >> >>   
>> >> >>   
>> >> >> 
>> >> >>
>> >> >> But it doesn't seem to get executed. Is there a missing flag/ switch
>> >> >> that I was supposed to initialise/ build with? I am not seeing
>> >> >> anything relating to building autopgen in the modes.xml file in the
>> >> >> monolingual directory...?
>> >> >>
>> >> >> On Thu, Aug 27, 2020 at 2:57 PM Hèctor Alòs i Font <
>> hectora...@gmail.com> wrote:
>> >> >> >
>> >> >> > Yes, it is in the monodix. It is just a mark put on the right
>> side, e.g.
>> >> >> >
>> >> >> >   que
>> >> >> >   que   que> n="itg"/&g

Re: [Apertium-stuff] Update about superblanks in transfer

2020-08-30 Thread Hèctor Alòs i Font

Missatge de Tanmai Khanna  del dia dg., 30 d’ag.
2020 a les 12:06:

> Hi Hèctor,
> I'm dealing with the issues I see one by one.
> 1. I was flushing the remaining blanks after processOut because I thought
> usually we only have one .. block in the rule, but in some of
> your rules there's multiple, so in the latest commit to apertium/apertium,
> I made them flush after the rule is finished outputting entirely. This
> solves some of the issues such as:
>
> $ echo "au lycée Louis-le-Grand" | apertium -d .. fra-frp
>
> u licê Louis-lo-Grant.
>
>
>
It's too difficult to have a single  when dealing with complex
structures. For instance, in French there is "not + verb + secondary-not",
but in Arpitan I have "verb + not". Furthermore, the verb can be in a past
tense in the source language but needs "aux + participle" in the target
language (and I have to deal with which of the auxiliaries to use). More :
the verb can be pronominal in the output language, but not in the source.
So I use macros that deal with each of these issues and add or remove
stuff. The result is a kind of multi-step output (and I'm not the only that
does it).


> 2. The spaces between numbers in your output are probably coming because
> you have  in the rules. If you remove those, the spaces will go away.
>

I can't remove  in the rules. They are added when a new word is added,
so I must add a blank too, at its beginning or its end.


>
> I'm still evaluating some other issues.
>
>
> *तन्मय खन्ना *
> *Tanmai Khanna*
>
>
> On Sun, Aug 30, 2020 at 1:21 PM Hèctor Alòs i Font 
> wrote:
>
>>
>>
>> Missatge de Tanmai Khanna  del dia dg., 30
>> d’ag. 2020 a les 9:49:
>>
>>> My guess is, the transfer rule for Franco-Japanese has a two word input,
>>> so the stored blank is "-". Now the output has 3 words "una
>>> Franco-Japonêsa", since the blanks are printed in order, they're printed in
>>> the first available  spot in the output rules.
>>>
>>
>> Yes, that is. "Franco" is a prefix and it is analysed as such. I have
>> some tens of prefixes for avoiding having hundreds of words in the
>> dictionaries and, more important, to be able to deal to unknown pairs like
>> "franco-tibétain" or "franco-silésien".
>>
>>
>>>
>>> There's a few possible solutions for this. One idea is to have two kinds
>>> of blank markers - one that will print a space always, and one that will
>>> print available input blanks. This can also be implemented by having a >> v=" "/> in the output rule and then  in the next spot. If this seems
>>> too hacky a solution we can discuss other options.
>>>
>>> *तन्मय खन्ना *
>>> *Tanmai Khanna*
>>>
>>>
>>> On Sun, Aug 30, 2020 at 12:09 PM Tanmai Khanna 
>>> wrote:
>>>
>>>> Hèctor,
>>>> No worries I'll look into this. Can you send the input sentences? I
>>>> want to see the transfer rules that are applying to the erroneous parts.
>>>> They might need some changing.
>>>>
>>>> तन्मय खन्ना
>>>> Tanmai Khanna
>>>>
>>>> --
>>>> *From:* Hèctor Alòs i Font 
>>>> *Sent:* Sunday, August 30, 2020 11:57:16 AM
>>>> *To:* [apertium-stuff] 
>>>> *Subject:* Re: [Apertium-stuff] Update about superblanks in transfer
>>>>
>>>> Unfortunately, I found a lot of problems cased by superblanks,
>>>> especially with the handling of hyphens. See a couple of differences in
>>>> translations of my French test corpus into Arpitan before and after the
>>>> update:
>>>>
>>>> < 00607. Tandis que les Tétes Broulâyes sont en *permission sur
>>>> *Espritos Marcos, tomba amouerox de Yvonne, una Franco-Japonêsa.
>>>> ---
>>>> > 00607. Tandis que les Tétes Broulâyes sont en *permission sur
>>>> *Espritos Marcos, tomba amouerox de Yvonne, una- Franco Japonêsa.
>>>>
>>>> < 00748. On povêt per ègzemplo parlar, sot Charlo-lo-Pelâ, de la
>>>> "*foresta" des pêrches de la Sêna.
>>>> ---
>>>> > 00748. On povêt per ègzemplo parlar, sot Charlo-lo- Pelâ, de la
>>>> "*foresta" des pêrches de la Sêna.
>>>>
>>>> Hèctor
>>>>
>>>> Missatge de Tanmai Khanna  del dia ds., 29
>>>> d’ag. 2020 a les 16:50:
>>>>
>>>> Hey guys!
>>>> The wordbound blanks project handles

Re: [Apertium-stuff] Update about superblanks in transfer

2020-08-30 Thread Hèctor Alòs i Font

Missatge de Tanmai Khanna  del dia dg., 30 d’ag.
2020 a les 9:49:

> My guess is, the transfer rule for Franco-Japanese has a two word input,
> so the stored blank is "-". Now the output has 3 words "una
> Franco-Japonêsa", since the blanks are printed in order, they're printed in
> the first available  spot in the output rules.
>

Yes, that is. "Franco" is a prefix and it is analysed as such. I have some
tens of prefixes for avoiding having hundreds of words in the
dictionaries and, more important, to be able to deal to unknown pairs like
"franco-tibétain" or "franco-silésien".


>
> There's a few possible solutions for this. One idea is to have two kinds
> of blank markers - one that will print a space always, and one that will
> print available input blanks. This can also be implemented by having a  v=" "/> in the output rule and then  in the next spot. If this seems
> too hacky a solution we can discuss other options.
>
> *तन्मय खन्ना *
> *Tanmai Khanna*
>
>
> On Sun, Aug 30, 2020 at 12:09 PM Tanmai Khanna 
> wrote:
>
>> Hèctor,
>> No worries I'll look into this. Can you send the input sentences? I want
>> to see the transfer rules that are applying to the erroneous parts. They
>> might need some changing.
>>
>> तन्मय खन्ना
>> Tanmai Khanna
>>
>> --
>> *From:* Hèctor Alòs i Font 
>> *Sent:* Sunday, August 30, 2020 11:57:16 AM
>> *To:* [apertium-stuff] 
>> *Subject:* Re: [Apertium-stuff] Update about superblanks in transfer
>>
>> Unfortunately, I found a lot of problems cased by superblanks, especially
>> with the handling of hyphens. See a couple of differences in translations
>> of my French test corpus into Arpitan before and after the update:
>>
>> < 00607. Tandis que les Tétes Broulâyes sont en *permission sur *Espritos
>> Marcos, tomba amouerox de Yvonne, una Franco-Japonêsa.
>> ---
>> > 00607. Tandis que les Tétes Broulâyes sont en *permission sur *Espritos
>> Marcos, tomba amouerox de Yvonne, una- Franco Japonêsa.
>>
>> < 00748. On povêt per ègzemplo parlar, sot Charlo-lo-Pelâ, de la
>> "*foresta" des pêrches de la Sêna.
>> ---
>> > 00748. On povêt per ègzemplo parlar, sot Charlo-lo- Pelâ, de la
>> "*foresta" des pêrches de la Sêna.
>>
>> Hèctor
>>
>> Missatge de Tanmai Khanna  del dia ds., 29
>> d’ag. 2020 a les 16:50:
>>
>> Hey guys!
>> The wordbound blanks project handles blanks that are supposed to be
>> reordered. Therefore, we no longer need the user to be worried about blank
>> positions in transfer rules. The latest update to the apertium code makes
>> it such that  is now the same as  . You can change the > pos="X"/> in your transfer rules to just  and it'll work.
>>
>> Now, the only thing you need to worry about when writing transfer rules
>> is whether you want a blank between the two LUs or not. *Input blanks
>> will be stored as a queue and will be printed in order in all
>> available  spots in the rule output. *
>>
>> *Note:*
>> - If the output rule has more blank spots than input blanks, then the
>> remaining blank spots will be spaces.
>> - If the output rule has less blank spots than input blanks, then the
>> remaining input blanks will be output after the rule output.
>> - If the input blank is an empty string, it is stored as a space.
>>
>> In some transfer rules, there are input patterns which don't have a space
>> between them. In the output section of these transfer rules, 
>>  used to give an empty string, but it will now give a space. To remove
>> the blank from the output, you will need to remove the  from
>> the transfer rule and it will be fine.
>>
>> Here are some examples from the tests.
>>
>> EXAMPLE 1:
>> Input:
>>
>> [blank1] ^worda/wordta$ ;[blank2]; ^wordb/wordtb$ 
>> [blank3];  ^hun/ho$ [blank4]
>>
>> There's no  in rule output, so all blanks are after flushed after
>> rule output.
>>
>> Output:
>>
>> [blank1] ^test1{^wordta$^wordtb$^ho$}$ ;[blank2];  
>> [blank3];   [blank4]
>>
>> EXAMPLE 2:
>> Input:
>>
>> [blank1] ^wordb/wordtb$ ;[blank2]; ^worda/wordta$ 
>> [blank3];  ^hun/ho$ [blank4]
>>
>> There's one  in rule output, so it prints one and flushes the rest.
>>
>> Output:
>>
>> [blank1] ^test1{^wordta$ ;[blank2]; ^ho$}$ [blank3];   
>> [blank4]
>>
>> This has been implemented for the chunker, interchunk, and postchunk.
>>
>> If you have any questions, suggestions, comments, etc., I'll be happy to
>> respond to them.
>>
>> Thanks and Regards,
>> *तन्मय खन्ना *
>> *Tanmai Khanna*
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Update about superblanks in transfer

2020-08-30 Thread Hèctor Alòs i Font

Unfortunately, I found a lot of problems cased by superblanks, especially
with the handling of hyphens. See a couple of differences in translations
of my French test corpus into Arpitan before and after the update:

< 00607. Tandis que les Tétes Broulâyes sont en *permission sur *Espritos
Marcos, tomba amouerox de Yvonne, una Franco-Japonêsa.
---
> 00607. Tandis que les Tétes Broulâyes sont en *permission sur *Espritos
Marcos, tomba amouerox de Yvonne, una- Franco Japonêsa.

< 00748. On povêt per ègzemplo parlar, sot Charlo-lo-Pelâ, de la "*foresta"
des pêrches de la Sêna.
---
> 00748. On povêt per ègzemplo parlar, sot Charlo-lo- Pelâ, de la
"*foresta" des pêrches de la Sêna.

Hèctor

Missatge de Tanmai Khanna  del dia ds., 29 d’ag.
2020 a les 16:50:

> Hey guys!
> The wordbound blanks project handles blanks that are supposed to be
> reordered. Therefore, we no longer need the user to be worried about blank
> positions in transfer rules. The latest update to the apertium code makes
> it such that  is now the same as  . You can change the  pos="X"/> in your transfer rules to just  and it'll work.
>
> Now, the only thing you need to worry about when writing transfer rules is
> whether you want a blank between the two LUs or not. *Input blanks will
> be stored as a queue and will be printed in order in all
> available  spots in the rule output. *
>
> *Note:*
> - If the output rule has more blank spots than input blanks, then the
> remaining blank spots will be spaces.
> - If the output rule has less blank spots than input blanks, then the
> remaining input blanks will be output after the rule output.
> - If the input blank is an empty string, it is stored as a space.
>
> In some transfer rules, there are input patterns which don't have a space
> between them. In the output section of these transfer rules,  used
> to give an empty string, but it will now give a space. To remove the blank
> from the output, you will need to remove the  from the
> transfer rule and it will be fine.
>
> Here are some examples from the tests.
>
> EXAMPLE 1:
> Input:
>
> [blank1] ^worda/wordta$ ;[blank2]; ^wordb/wordtb$ 
> [blank3];  ^hun/ho$ [blank4]
>
> There's no  in rule output, so all blanks are after flushed after
> rule output.
>
> Output:
>
> [blank1] ^test1{^wordta$^wordtb$^ho$}$ ;[blank2];  
> [blank3];   [blank4]
>
> EXAMPLE 2:
> Input:
>
> [blank1] ^wordb/wordtb$ ;[blank2]; ^worda/wordta$ 
> [blank3];  ^hun/ho$ [blank4]
>
> There's one  in rule output, so it prints one and flushes the rest.
>
> Output:
>
> [blank1] ^test1{^wordta$ ;[blank2]; ^ho$}$ [blank3];   
> [blank4]
>
> This has been implemented for the chunker, interchunk, and postchunk.
>
> If you have any questions, suggestions, comments, etc., I'll be happy to
> respond to them.
>
> Thanks and Regards,
> *तन्मय खन्ना *
> *Tanmai Khanna*
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Fixing Phonological Processes

2020-08-28 Thread Hèctor Alòs i Font

I don't think you have to do anything with the modes or the compilation
file. The problem is in the post-yao.dix file.
If you add , it works:


  
nk
ng
  
  


$ echo "~nka" | lt-proc -p yao.autopgen.bin
nga
$ echo "~nkb" | lt-proc -p yao.autopgen.bin
nkb

I don't know why without  there is no match, but in any case you need
to add  to the relevant places (words, affixes, etc.) you want to
trigger this rule. If you want that always nk + vowel should be ng, you
should this in twol, not here.

Hèctor

Missatge de Zanga Chimombo  del dia dv., 28 d’ag.
2020 a les 15:41:

> I am still not getting anywhere and both modes.xml and the Makefile
> seem ok. My code is here:
> https://gitlab.com/zangaphee/CiBantu/-/tree/master/twoc/apertium-yao
>
> On Fri, Aug 28, 2020 at 7:36 AM Hèctor Alòs i Font 
> wrote:
> >
> > The relevant files are modes.xml and Makefile.am I recommend taking a
> look to them in e.g. apertium-fra and apertium-fra-cat (or any other
> released pair using post-generation). In the first one you define the
> pipeline, so copy and adapt the call to autopgen in the end. In the second
> one you have the actual compilation of the programme.
> >
> > Missatge de Zanga Chimombo  del dia dv., 28 d’ag.
> 2020 a les 7:52:
> >>
> >> Hi again, I actually have:
> >>
> >> 
> >>   
> >> nk
> >> ng
> >>   
> >>   
> >> 
> >>
> >> But it doesn't seem to get executed. Is there a missing flag/ switch
> >> that I was supposed to initialise/ build with? I am not seeing
> >> anything relating to building autopgen in the modes.xml file in the
> >> monolingual directory...?
> >>
> >> On Thu, Aug 27, 2020 at 2:57 PM Hèctor Alòs i Font <
> hectora...@gmail.com> wrote:
> >> >
> >> > Yes, it is in the monodix. It is just a mark put on the right side,
> e.g.
> >> >
> >> >   que
> >> >   que   que n="itg"/>
> >> >
> >> > If you want, you may not put it, but if you have in the post-dix file
> something like:
> >> >
> >> > 
> >> >   
> >> > nk
> >> > ng
> >> >   
> >> > 
> >> >
> >> > ... then every nk will be substituted by ng. That is not what you
> want, for sure. So better to put a mark in the dictionnary to know which
> "nk" may be changed (in some contexts) to nk.
> >> >
> >> > Missatge de Zanga Chimombo  del dia dj., 27
> d’ag. 2020 a les 15:18:
> >> >>
> >> >> Looking at the examples in apertium-fra.post-fra.dix it is clear that
> >> >> the tilde/ ~/  is inserted as some sort of marker earlier in the
> >> >> pipeline so that the PG recognises it and actions on it.
> >> >>
> >> >> Where in the pipeline is it inserted? Could you give me a line number
> >> >> of the insertion within the monodix perhaps?
> >> >>
> >> >> On Thu, Aug 27, 2020 at 12:12 PM Hèctor Alòs i Font
> >> >>  wrote:
> >> >> >
> >> >> > You can take a look, for instance to
> https://github.com/apertium/apertium-fra/blob/master/apertium-fra.post-fra.dix
> >> >> >
> >> >> > For example (at line 633) :
> >> >> > nen'
> >> >> >
> >> >> > Missatge de Hèctor Alòs i Font  del dia
> dj., 27 d’ag. 2020 a les 13:07:
> >> >> >>
> >> >> >> There two things in:
> >> >> >>
> >> >> >> 
> >> >> >>   
> >> >> >> nk
> >> >> >> ng
> >> >> >>   
> >> >> >> 
> >> >> >>
> >> >> >> First is the  that must precede (that's the ~ Kevin said
> because it is shown as a tilde in the output). If you don't have it, there
> won't be any matching.
> >> >> >>
> >> >> >> Second, is the , i.e. a space. So nk- will not match, but
> only nk followed by a blank (a preceded by an ). If matched, it will be
> replaced by ng followed by a blank to.
> >> >> >>
> >> >> >> Hèctor
> >> >> >>
> >> >> >>
> >> >> >> Missatge de Zanga Chimombo  del dia dj.,
> 27 d’ag. 2020 a les 12:31:
> >> >> >>&g

Re: [Apertium-stuff] Fixing Phonological Processes

2020-08-27 Thread Hèctor Alòs i Font

The relevant files are modes.xml and Makefile.am I recommend taking a look
to them in e.g. apertium-fra and apertium-fra-cat (or any other released
pair using post-generation). In the first one you define the pipeline, so
copy and adapt the call to autopgen in the end. In the second one you have
the actual compilation of the programme.

Missatge de Zanga Chimombo  del dia dv., 28 d’ag.
2020 a les 7:52:

> Hi again, I actually have:
>
> 
>   
> nk
> ng
>   
>   
> 
>
> But it doesn't seem to get executed. Is there a missing flag/ switch
> that I was supposed to initialise/ build with? I am not seeing
> anything relating to building autopgen in the modes.xml file in the
> monolingual directory...?
>
> On Thu, Aug 27, 2020 at 2:57 PM Hèctor Alòs i Font 
> wrote:
> >
> > Yes, it is in the monodix. It is just a mark put on the right side, e.g.
> >
> >   que
> >   que   que n="itg"/>
> >
> > If you want, you may not put it, but if you have in the post-dix file
> something like:
> >
> > 
> >   
> > nk
> > ng
> >   
> > 
> >
> > ... then every nk will be substituted by ng. That is not what you want,
> for sure. So better to put a mark in the dictionnary to know which "nk" may
> be changed (in some contexts) to nk.
> >
> > Missatge de Zanga Chimombo  del dia dj., 27 d’ag.
> 2020 a les 15:18:
> >>
> >> Looking at the examples in apertium-fra.post-fra.dix it is clear that
> >> the tilde/ ~/  is inserted as some sort of marker earlier in the
> >> pipeline so that the PG recognises it and actions on it.
> >>
> >> Where in the pipeline is it inserted? Could you give me a line number
> >> of the insertion within the monodix perhaps?
> >>
> >> On Thu, Aug 27, 2020 at 12:12 PM Hèctor Alòs i Font
> >>  wrote:
> >> >
> >> > You can take a look, for instance to
> https://github.com/apertium/apertium-fra/blob/master/apertium-fra.post-fra.dix
> >> >
> >> > For example (at line 633) :
> >> > nen'
> >> >
> >> > Missatge de Hèctor Alòs i Font  del dia dj.,
> 27 d’ag. 2020 a les 13:07:
> >> >>
> >> >> There two things in:
> >> >>
> >> >> 
> >> >>   
> >> >> nk
> >> >> ng
> >> >>   
> >> >> 
> >> >>
> >> >> First is the  that must precede (that's the ~ Kevin said because
> it is shown as a tilde in the output). If you don't have it, there won't be
> any matching.
> >> >>
> >> >> Second, is the , i.e. a space. So nk- will not match, but only
> nk followed by a blank (a preceded by an ). If matched, it will be
> replaced by ng followed by a blank to.
> >> >>
> >> >> Hèctor
> >> >>
> >> >>
> >> >> Missatge de Zanga Chimombo  del dia dj., 27
> d’ag. 2020 a les 12:31:
> >> >>>
> >> >>> Not sure I know what you mean by "~"...? Sorry. I'm new to this
> >> >>>
> >> >>> The input is "nkutenda". Expected output: "ngutenda".
> >> >>>
> >> >>> On Thu, Aug 27, 2020 at 11:26 AM Kevin Brubeck Unhammer
> >> >>>  wrote:
> >> >>> >
> >> >>> > Zanga Chimombo 
> >> >>> > čálii:
> >> >>> >
> >> >>> > > One of the processes that occurs in one of the languages I am
> dealing
> >> >>> > > with is "nk-" becoming "ng-"
> >> >>> > >
> >> >>> > > I thought I would be able to fix this using the post generator
> here:
> >> >>> > >
> https://gitlab.com/zangaphee/CiBantu/-/blob/master/twoc/apertium-yao/apertium-yao.post-yao.dix
> >> >>> > >
> >> >>> > > However, that doesn't fix it. Have I done it incorrectly?
> Should I
> >> >>> > > even be using PG to do this?
> >> >>> >
> >> >>> > If there's a ~ before every nk, then I think that should
> >> >>> > work. What's the exact input to pgen?
> >> >>> >
> >> >>> > (There's an open issue on not requiring the
> >> >>> > `~` https://github.com/apertium/lttoo

Re: [Apertium-stuff] GSoC 2020 Code Collections - need info

2020-08-27 Thread Hèctor Alòs i Font

Awesome, thanks, Tino!

Missatge de Tino Didriksen  del dia dj., 27 d’ag.
2020 a les 15:41:

> First run is now online at https://apertium.projectjj.com/gsoc2020/
>
> Collected for elmurod1202, hectoralos, khannatanmai, priyankmodiPM.
> Collection period is 2020-05-04 through 2020-08-31.
>
> -- Tino Didriksen
>
>
> On Sat, 22 Aug 2020 at 14:21, Tino Didriksen 
> wrote:
>
>> As for previous years, I will run a code collection for GSoC changes.
>>
>> I just need to know who and what. Usernames and repos.
>>
>> -- Tino Didriksen
>>
>> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC 2020 Code Collections - need info

2020-08-27 Thread Hèctor Alòs i Font

Hi Tino,

My commits can be already collected. As said, that is all commits in:

https://github.com/apertium/apertium-frp
https://github.com/apertium/apertium-fra
https://github.com/apertium/apertium-fra-frp

username: hectoralos

I'd collect everything since June 1st (the first official day of GSoC) but
May 1st or even earlier would be OK, too.

Thank you very much in advance!
Hèctor

Missatge de Hèctor Alòs i Font  del dia ds., 22 d’ag.
2020 a les 17:53:

> Thank you very much, Tino!
>
> My stuff are all commits of the user hectoralos in:
> https://github.com/apertium/apertium-frp
> https://github.com/apertium/apertium-fra
> https://github.com/apertium/apertium-fra-frp
> ... but it is not ready yet. I have to work a few more days.
>
> Hèctor
>
>
> Missatge de Tino Didriksen  del dia ds., 22 d’ag.
> 2020 a les 15:22:
>
>> As for previous years, I will run a code collection for GSoC changes.
>>
>> I just need to know who and what. Usernames and repos.
>>
>> -- Tino Didriksen
>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Fixing Phonological Processes

2020-08-27 Thread Hèctor Alòs i Font

You can take a look, for instance to
https://github.com/apertium/apertium-fra/blob/master/apertium-fra.post-fra.dix

For example (at line 633) :
nen'

Missatge de Hèctor Alòs i Font  del dia dj., 27 d’ag.
2020 a les 13:07:

> There two things in:
>
>   nkng  
> 
>
> First is the  that must precede (that's the ~ Kevin said because it is
> shown as a tilde in the output). If you don't have it, there won't be any
> matching.
>
> Second, is the , i.e. a space. So nk- will not match, but only nk
> followed by a blank (a preceded by an ). If matched, it will be
> replaced by ng followed by a blank to.
>
> Hèctor
>
>
> Missatge de Zanga Chimombo  del dia dj., 27 d’ag.
> 2020 a les 12:31:
>
>> Not sure I know what you mean by "~"...? Sorry. I'm new to this
>>
>> The input is "nkutenda". Expected output: "ngutenda".
>>
>> On Thu, Aug 27, 2020 at 11:26 AM Kevin Brubeck Unhammer
>>  wrote:
>> >
>> > Zanga Chimombo 
>> > čálii:
>> >
>> > > One of the processes that occurs in one of the languages I am dealing
>> > > with is "nk-" becoming "ng-"
>> > >
>> > > I thought I would be able to fix this using the post generator here:
>> > >
>> https://gitlab.com/zangaphee/CiBantu/-/blob/master/twoc/apertium-yao/apertium-yao.post-yao.dix
>> > >
>> > > However, that doesn't fix it. Have I done it incorrectly? Should I
>> > > even be using PG to do this?
>> >
>> > If there's a ~ before every nk, then I think that should
>> > work. What's the exact input to pgen?
>> >
>> > (There's an open issue on not requiring the
>> > `~` https://github.com/apertium/lttoolbox/issues/42 )
>> > ___
>> > Apertium-stuff mailing list
>> > Apertium-stuff@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Fixing Phonological Processes

2020-08-27 Thread Hèctor Alòs i Font

There two things in:

  nkng  

First is the  that must precede (that's the ~ Kevin said because it is
shown as a tilde in the output). If you don't have it, there won't be any
matching.

Second, is the , i.e. a space. So nk- will not match, but only nk
followed by a blank (a preceded by an ). If matched, it will be
replaced by ng followed by a blank to.

Hèctor


Missatge de Zanga Chimombo  del dia dj., 27 d’ag.
2020 a les 12:31:

> Not sure I know what you mean by "~"...? Sorry. I'm new to this
>
> The input is "nkutenda". Expected output: "ngutenda".
>
> On Thu, Aug 27, 2020 at 11:26 AM Kevin Brubeck Unhammer
>  wrote:
> >
> > Zanga Chimombo 
> > čálii:
> >
> > > One of the processes that occurs in one of the languages I am dealing
> > > with is "nk-" becoming "ng-"
> > >
> > > I thought I would be able to fix this using the post generator here:
> > >
> https://gitlab.com/zangaphee/CiBantu/-/blob/master/twoc/apertium-yao/apertium-yao.post-yao.dix
> > >
> > > However, that doesn't fix it. Have I done it incorrectly? Should I
> > > even be using PG to do this?
> >
> > If there's a ~ before every nk, then I think that should
> > work. What's the exact input to pgen?
> >
> > (There's an open issue on not requiring the
> > `~` https://github.com/apertium/lttoolbox/issues/42 )
> > ___
> > Apertium-stuff mailing list
> > Apertium-stuff@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC 2020 Code Collections - need info

2020-08-22 Thread Hèctor Alòs i Font

Thank you very much, Tino!

My stuff are all commits of the user hectoralos in:
https://github.com/apertium/apertium-frp
https://github.com/apertium/apertium-fra
https://github.com/apertium/apertium-fra-frp
... but it is not ready yet. I have to work a few more days.

Hèctor


Missatge de Tino Didriksen  del dia ds., 22 d’ag.
2020 a les 15:22:

> As for previous years, I will run a code collection for GSoC changes.
>
> I just need to know who and what. Usernames and repos.
>
> -- Tino Didriksen
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] We need more evaluators

2020-08-21 Thread Hèctor Alòs i Font

According to my experience, no, it is impossible. The quality of the test
would be awful.

Missatge de Samuel Sloniker  del dia ds., 22 d’ag.
2020 a les 7:29:

> Could this be done using a dictionary without knowing one of the languages?
>
> On Fri, Aug 21, 2020, 02:24 Mikel L. Forcada  wrote:
>
>> Dear Daniel:
>>
>> Thanks a million for your message.
>>
>> please contact Shashwat Goel 
>>  so that he can send you the
>> evaluation files.
>>
>> As to contributing to the eng-fra language pair, you would have to
>>
>> (a) learn how to install Apertium as needed for language pair development
>> (
>> https://wiki.apertium.org/wiki/Installation#For_those_who_want_to_install_Apertium_locally.2C_and_developers
>> ).
>>
>> (b) learn how dictionary entries look like (a lot can be learn by
>> imitating existing entries), make them, and try them locally.
>>
>> (c) get access to GitHub (and learn to use git, etc.)
>>
>> (d) submit pull requests with your changes
>>
>> You can get a lot of guidance from fellow developers at our IRC channel
>> #apertium at irc.freenode.net: https://wiki.apertium.org/wiki/IRC
>>
>> Hope this helped!
>>
>> Mikel
>> El 20/8/20 a les 19:09, Daniel Lamontagne ha escrit:
>>
>> Hi,
>> I know very well French and English-but I am a French native-so
>> French-to-English would be great.
>> I'd like to help get the French-English going a little better by getting
>> it out of the 'incubator' area.
>> Something not too complicated and good instructions on how to help.
>> Thanks.
>> P.S. I already am doing one for myself-database in csv file and Python
>>
>> Keep on the good work.
>> Daniel Lamontagne-KesKiDit
>>
>>
>> On Thu, Aug 20, 2020 at 7:54 AM Xavi Ivars  wrote:
>>
>>> Hey Mikel,
>>>
>>> You can add me to en-ca. I don't know too much french, but I guess I
>>> could also help there
>>>
>>> Missatge de Mikel L. Forcada  del dia dj., 20 d’ag.
>>> 2020 a les 9:39:
>>>
 Dear Apertiumers:

 About 10 days ago I wrote a message asking for apertiumers to volunteer
 in evaluating sets of 150 bilingual dictionary entries. We still need more
 evaluators.

 We need two more evaluators for Esperanto–English, two more for
 French–Catalan and one for Occitan–French.

 Please contact our GSoC student Shashwat Goel
  
 if you can help.

 Thanks a million!

 All the best

 Mikel Forcada

 [New] En - Es - Assigned to Hector, Mikel, Jorge. More people not
 required
 [New] En - Ca - Assigned to Hector, Mikel. One more person would be
 helpful.
 Eo - En - Done by Hector. Two more people would be helpful.
 Fr - Ca - Done by Hector. Two more people would be helpful.
 Oc - Fr - Done by Serge. Assigned to Gisele. One more person would be
 helpful.

 --
 Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
 Departament de Llenguatges i Sistemes Informàtics
 Universitat d'Alacant
 E-03690 Sant Vicent del Raspeig
 Spain
 Office: +34 96 590 9776

 ___
 Apertium-stuff mailing list
 Apertium-stuff@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/apertium-stuff

>>>
>>>
>>> --
>>> < Xavi Ivars >
>>> < http://xavi.ivars.me >
>>> ___
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>>
>>
>> ___
>> Apertium-stuff mailing 
>> listApertium-stuff@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>> --
>> Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
>> Departament de Llenguatges i Sistemes Informàtics
>> Universitat d'Alacant
>> E-03690 Sant Vicent del Raspeig
>> Spain
>> Office: +34 96 590 9776
>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Infinite loop in testvoc???

2020-08-16 Thread Hèctor Alòs i Font

Hi Gabriel,
You may be interested in beta.apertium.org
Hèctor

Missatge de Medina, Gabriel  del dia dg., 16 d’ag. 2020
a les 8:52:

> When I pick a language, it only highlights specific ones I can only
> choose, which depends on the language you pick, which makes translating
> languages very limited.
>
> On Sat, Aug 15, 2020 at 3:25 PM Jonathan Washington <
> jonathan.n.washing...@gmail.com> wrote:
>
>> Gabriel,
>>
>> Could you clarify what you mean by "language barriers"?
>>
>> --
>> Jonathan
>>
>> On Sat, Aug 15, 2020, 13:40 Medina, Gabriel  wrote:
>>
>>> Does this have something to do with the language barriers for all
>>> languages on the online translator?
>>>
>>> On Sat, Aug 15, 2020 at 1:03 PM Marc Riera Irigoyen <
>>> marc.riera.irigo...@gmail.com> wrote:
>>>
>>>> I've been able to reproduce the loop and fix it. It was mainly due to
>>>> an unexpected pattern in the testvoc script, but there was also a typo in
>>>> the bidix that contributed to the problem.
>>>>
>>>> 1. The testvoc script did not account for bidix entries with empty
>>>> translations and would add extra slashes in many cases. These are used to
>>>> test multiple translations for a single entry, which is done by an awk
>>>> script in a while loop that could not be escaped. I have fixed the issue
>>>> with the extra slashes and changed the while loop to a for limited to 50
>>>> iterations. This should be enough for any pair and the loop includes a
>>>> condition to escape it before the 50 iterations, so there is no extra
>>>> unnecessary processing. I'll post a pull request directly to the repo with
>>>> the fixes shortly.
>>>> 2. There is an entry in the bidix (and probably Arpitan monodix as
>>>> well, because it generates properly), "Salinas de Gotari", with a line
>>>> break after the last tag. It looks like a typo. This typo appears to be
>>>> valid in Apertium format but the testvoc script assumes an entry per line
>>>> and the double slashes occurred here too. Thanks to the loop limit, testvoc
>>>> doesn't get blocked anymore by this entry (and it doesn't appear in the
>>>> list of errors, because it generates properly), but it should be fixed.
>>>>
>>>> Regards,
>>>>
>>>> *Marc Riera*
>>>>
>>>>
>>>> Missatge de Marc Riera Irigoyen  del
>>>> dia ds., 15 d’ag. 2020 a les 11:53:
>>>>
>>>>> Hello Hèctor,
>>>>>
>>>>> I see that the testvoc script you're using is the one I developed
>>>>> based on previous scripts used in several pairs. It shouldn't be producing
>>>>> a loop and have never found it before. Given that it's happening only when
>>>>> translating from Arpitan to French, I guess there may be something that I
>>>>> didn't account for when developing the script. I'll take a look and try to
>>>>> recreate it.
>>>>>
>>>>> Regards,
>>>>>
>>>>> *Marc Riera*
>>>>>
>>>>>
>>>>> Missatge de Hèctor Alòs i Font  del dia ds., 15
>>>>> d’ag. 2020 a les 10:46:
>>>>>
>>>>>> I am experiencing a very strange behaviour in the fra-frp testvoc.
>>>>>> While there is not any problem in the frp2fra side (the test is finished 
>>>>>> in
>>>>>> less than 30 minutes in my computer), in the fra2frp there is a kind of
>>>>>> infinitive loop. The same fine is again and again created and deleted and
>>>>>> the tesvoc does not end even waiting during more than 24 hours. The file
>>>>>> which is deleted and created again and again (always with the same name)
>>>>>> has exactly the same content. The first lines are:
>>>>>>
>>>>>>
>>>>>> [\^frère$]^frère/~/frâre$+^./~/.$
>>>>>>
>>>>>> [\^frère$]^frère/~/frâre$+^./~/.$
>>>>>>
>>>>>> [\^frère$]^frère/~/frâre$+^./~/.$
>>>>>>
>>>>>> [\^frère$]^frère/~/frâre$+^./~/.$
>>>>>>
>>>>>> [\^frère$]^frère/~/frâre$+^./~/.$
>>>>>>
>>>>>> [\^frère$]^frère/~/frâre$+^./~/.$
>>>>>>
>>>>>> [\^frère$]^frère/~/frâre$+^./~/.$
>>>>>>
>>>>>

[Apertium-stuff] Infinite loop in testvoc???

2020-08-15 Thread Hèctor Alòs i Font

I am experiencing a very strange behaviour in the fra-frp testvoc. While
there is not any problem in the frp2fra side (the test is finished in less
than 30 minutes in my computer), in the fra2frp there is a kind of
infinitive loop. The same fine is again and again created and deleted and
the tesvoc does not end even waiting during more than 24 hours. The file
which is deleted and created again and again (always with the same name)
has exactly the same content. The first lines are:

[\^frère$]^frère/~/frâre$+^./~/.$
[\^frère$]^frère/~/frâre$+^./~/.$
[\^frère$]^frère/~/frâre$+^./~/.$
[\^frère$]^frère/~/frâre$+^./~/.$
[\^frère$]^frère/~/frâre$+^./~/.$
[\^frère$]^frère/~/frâre$+^./~/.$
[\^frère$]^frère/~/frâre$+^./~/.$
[\^frère$]^frère/~/frâre$+^./~/.$
[\^frère$]^frère/~/frâre$+^./~/.$
[\^frère$]^frère/~/frâre$+^./~/.$
[\^1er$]^1er/~/1ér$+^./~/.$
[\^1er$]^1er/~/1ér$+^./~/.$
[\^1er$]^1er/~/1ér$+^./~/.$
[\^1er$]^1er/~/1ér$+^./~/.$
[\^abattu$]^abattu/~/abatu$+^./~/.$
[\^abattu$]^abattu/~/dèfêt$+^./~/.$
[\^abattu$]^abattu/~/dèchesu$+^./~/.$
[\^abattu$]^abattu/~/abatu$+^./~/.$

I have never seen such a thing before and I cannot imagine what can cause
this behaviour. Any ideas?

Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Get a glossary from a TMX memory

2020-06-28 Thread Hèctor Alòs i Font

I think nobody answered to this post. Gianfranco has a problem in getting
bilingual dictionaries from OmegaT and loading them into the Apertium
dictionaries.

As far as I understood, he tried:
https://wiki.apertium.org/wiki/Getting_bilingual_dictionaries_from_OmegaWiki

I don't know anything on OmegaT, but it would be pretty important for the
Italian-Sardinian pair to get data from translation memories used to
translate administrative texts in several town councils in Sardinia.

Could someone help him?

Hèctor

Missatge de Gianfranco Fronteddu  del dia dt., 16 de juny
2020 a les 23:13:

> Hello everyone
>
> I put myself in contact with you to solve a small problem. Héctor and I
> are updating apertium ita-srd and we have a TMX translation memory,
> perfectly aligned. We tried to automatically extract a bilingual glossary
> with OmegaT but failed. Do any of you know if it is possible to get a
> glossary from a TMX to be able to load pairs directly into dictionaries?
>
> We also noticed that Sardinian is not present on OmegaWiki. Is it also
> possible to give space to the Sardinian so that from now on it is possible
> to make a cross dix directly from Omegawiki?
>
> Thanks so much
> Best Regards
> Gianfranco
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

2020-06-18 Thread Hèctor Alòs i Font

Missatge de Francis Tyers  del dia dj., 18 de juny
2020 a les 1:59:

> El 2020-06-17 21:46, Hèctor Alòs i Font escribió:
> > Missatge de Hèctor Alòs i Font  del dia dc.,
> > 17 de juny 2020 a les 23:36:
> >
> >> Missatge de Francis Tyers  del dia dc., 17 de
> >> juny 2020 a les 21:12:
> >>
> >>> El 2020-06-15 17:38, Hèctor Alòs i Font escribió:
> >>>> Here come several practical examples. I tried to select them for
> >>> their
> >>>> variety. The result is more a wish list than something
> >>> structured.
> >>>
> >>> These really are great! Thanks :) Sorry the reply has taken so
> >>> long.
> >>>
> >>>> Let's begin with "je la baise". Depending on the context this
> >>> may be
> >>>> "I kiss her" or "I fuck her". The context can tell us if we are
> >>> in a
> >>>> formal or colloquial type of language. Another issue is that in
> >>> this
> >>>> case the anaphora resolution can also help us: if the pronoun
> >>>> reference is "hand", it can only be "kiss"; if it is a person,
> >>> the
> >>>> doubt persists.
> >>>
> >>> For this, I would like to look at a concordance* of a large number
> >>> of
> >>> examples to see what kind of information can be used to
> >>> disambiguate.
> >>>
> >>> Intuitively it seems like knowing the genre (e.g. formal/informal)
> >>> would
> >>> help. But probably also statistics about subjects, objects and
> >>> adjuncts,
> >>> and what they (co-)refer with.
> >>>
> >>> * I tried to search on DuckDuckGo, but in the "internet" domain it
> >>> is very hard to find examples with "kiss", even with "moderated
> >>> search"
> >>> turned on.
> >>>
> >>> In fact, perhaps that could be a genre "safe translation"... :D
> >>>
> >>> Incidentally Google gives "I fuck her" as the translation. I'm
> >>> able to
> >>> get
> >>> "kiss" by adding "bouche" or "main".
> >>>
> >>> I think if we want to go by frequency we should have "fuck" if we
> >>> go
> >>> by safety we should have "kiss".
> >>>
> >>> Probably "humblement" or "vous" are also good indicators of the
> >>> "kiss"
> >>> meaning.
> >>>
> >>> Any better than that would require further investigation with a
> >>> concordance.
> >>>
> >>> In terms of the module, if we want to do informal/formal then my
> >>> previous
> >>> suggestion would work fine.
> >>>
> >>>> Another kind of problem is the Arpitan words "chamô" ("camel";
> >>> plural
> >>>> "camels") and "chamôs ("chamois"; unchanged in plural). So,
> >>>> translating into French, I got yesterday chamois in a Bible text
> >>> of
> >>>> Exodus xD  I solved it deciding in a CG rule that all "chamôs"
> >>>> (without nothing around in singular) are camels.
> >>>
> >>> As this is a different morphological paradigm, I would go with the
> >>>
> >>> superscript
> >>> notation ¹²³...
> >>>
> >>>> (Similar cases in
> >>>> French: fil/fils, foi/fois, cour/cours)
> >>>
> >>> These have different lemmas, e.g.
> >>>
> >>> ^fils/fil/fils$ threads / son*
> >>> ^fois/foi/fois$ faiths / time*
> >>> ^cours/cour/cours$  courts / course*
> >>>
> >>> The 'cour/cours' example can potentially be disambiguated by the
> >>> gender.
> >>>
> >>> The others I suppose rules could be written, but I suspect they
> >>> would be
> >>> quite brittle. My guess is that the  ones are more frequent.
> >>> So
> >>> those
> >>> should be default, then the question is finding specific contexts
> >>> where
> >>> it should be the others. A concordance would help, but I'm not
> >>> sure how
> >>> they would be split by genre or semantic field. This is really a
> >>> problem
> >>>

Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

2020-06-17 Thread Hèctor Alòs i Font

Missatge de Hèctor Alòs i Font  del dia dc., 17 de
juny 2020 a les 23:36:

> Missatge de Francis Tyers  del dia dc., 17 de juny
> 2020 a les 21:12:
>
>> El 2020-06-15 17:38, Hèctor Alòs i Font escribió:
>> > Here come several practical examples. I tried to select them for their
>> > variety. The result is more a wish list than something structured.
>>
>> These really are great! Thanks :) Sorry the reply has taken so long.
>>
>> > Let's begin with "je la baise". Depending on the context this may be
>> > "I kiss her" or "I fuck her". The context can tell us if we are in a
>> > formal or colloquial type of language. Another issue is that in this
>> > case the anaphora resolution can also help us: if the pronoun
>> > reference is "hand", it can only be "kiss"; if it is a person, the
>> > doubt persists.
>>
>> For this, I would like to look at a concordance* of a large number of
>> examples to see what kind of information can be used to disambiguate.
>>
>> Intuitively it seems like knowing the genre (e.g. formal/informal) would
>> help. But probably also statistics about subjects, objects and adjuncts,
>> and what they (co-)refer with.
>>
>> * I tried to search on DuckDuckGo, but in the "internet" domain it
>> is very hard to find examples with "kiss", even with "moderated search"
>> turned on.
>>
>> In fact, perhaps that could be a genre "safe translation"... :D
>>
>> Incidentally Google gives "I fuck her" as the translation. I'm able to
>> get
>> "kiss" by adding "bouche" or "main".
>>
>> I think if we want to go by frequency we should have "fuck" if we go
>> by safety we should have "kiss".
>>
>> Probably "humblement" or "vous" are also good indicators of the "kiss"
>> meaning.
>>
>> Any better than that would require further investigation with a
>> concordance.
>>
>> In terms of the module, if we want to do informal/formal then my
>> previous
>> suggestion would work fine.
>>
>> > Another kind of problem is the Arpitan words "chamô" ("camel"; plural
>> > "camels") and "chamôs ("chamois"; unchanged in plural). So,
>> > translating into French, I got yesterday chamois in a Bible text of
>> > Exodus xD  I solved it deciding in a CG rule that all "chamôs"
>> > (without nothing around in singular) are camels.
>>
>> As this is a different morphological paradigm, I would go with the
>> superscript
>> notation ¹²³...
>>
>> > (Similar cases in
>> > French: fil/fils, foi/fois, cour/cours)
>>
>> These have different lemmas, e.g.
>>
>> ^fils/fil/fils$ threads / son*
>> ^fois/foi/fois$ faiths / time*
>> ^cours/cour/cours$  courts / course*
>>
>> The 'cour/cours' example can potentially be disambiguated by the gender.
>>
>> The others I suppose rules could be written, but I suspect they would be
>> quite brittle. My guess is that the  ones are more frequent. So
>> those
>> should be default, then the question is finding specific contexts where
>> it should be the others. A concordance would help, but I'm not sure how
>> they would be split by genre or semantic field. This is really a problem
>> with how world-knowledge is encoded.
>>
>> I wonder if something could be done with word embeddings here. For
>> example
>> my guess is that in the target language the two variants should not
>> be close in the vector space. And they should be closer to words in the
>> same semantic field. This could then be something like a
>> reweighting of the translations according to target language semantic
>> coherence.
>>
>> Note that it would require information to be "backpropagated" from the
>> target
>> language to the source language. Perhaps you could have something like
>> per-reading embeddings that are trained using target language
>> information,
>>
>> so e.g. (fils, fil) [0.323, 0.423, 0.11, 0.595]
>>  (fils, fils) [0.53, 0.605, 0.54, 0.639]
>>
>> Felipe did something like this in his thesis, but he only looked at
>> sequences of part of speech tags. Here we need to know information about
>> the actual analyses.
>>
>> > In French there are plenty of words with different meanings, depending
>> > on the genre: livre, page, tour, etc. The problem is that

Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

2020-06-17 Thread Hèctor Alòs i Font

Missatge de Francis Tyers  del dia dc., 17 de juny
2020 a les 21:12:

> El 2020-06-15 17:38, Hèctor Alòs i Font escribió:
> > Here come several practical examples. I tried to select them for their
> > variety. The result is more a wish list than something structured.
>
> These really are great! Thanks :) Sorry the reply has taken so long.
>
> > Let's begin with "je la baise". Depending on the context this may be
> > "I kiss her" or "I fuck her". The context can tell us if we are in a
> > formal or colloquial type of language. Another issue is that in this
> > case the anaphora resolution can also help us: if the pronoun
> > reference is "hand", it can only be "kiss"; if it is a person, the
> > doubt persists.
>
> For this, I would like to look at a concordance* of a large number of
> examples to see what kind of information can be used to disambiguate.
>
> Intuitively it seems like knowing the genre (e.g. formal/informal) would
> help. But probably also statistics about subjects, objects and adjuncts,
> and what they (co-)refer with.
>
> * I tried to search on DuckDuckGo, but in the "internet" domain it
> is very hard to find examples with "kiss", even with "moderated search"
> turned on.
>
> In fact, perhaps that could be a genre "safe translation"... :D
>
> Incidentally Google gives "I fuck her" as the translation. I'm able to
> get
> "kiss" by adding "bouche" or "main".
>
> I think if we want to go by frequency we should have "fuck" if we go
> by safety we should have "kiss".
>
> Probably "humblement" or "vous" are also good indicators of the "kiss"
> meaning.
>
> Any better than that would require further investigation with a
> concordance.
>
> In terms of the module, if we want to do informal/formal then my
> previous
> suggestion would work fine.
>
> > Another kind of problem is the Arpitan words "chamô" ("camel"; plural
> > "camels") and "chamôs ("chamois"; unchanged in plural). So,
> > translating into French, I got yesterday chamois in a Bible text of
> > Exodus xD  I solved it deciding in a CG rule that all "chamôs"
> > (without nothing around in singular) are camels.
>
> As this is a different morphological paradigm, I would go with the
> superscript
> notation ¹²³...
>
> > (Similar cases in
> > French: fil/fils, foi/fois, cour/cours)
>
> These have different lemmas, e.g.
>
> ^fils/fil/fils$ threads / son*
> ^fois/foi/fois$ faiths / time*
> ^cours/cour/cours$  courts / course*
>
> The 'cour/cours' example can potentially be disambiguated by the gender.
>
> The others I suppose rules could be written, but I suspect they would be
> quite brittle. My guess is that the  ones are more frequent. So
> those
> should be default, then the question is finding specific contexts where
> it should be the others. A concordance would help, but I'm not sure how
> they would be split by genre or semantic field. This is really a problem
> with how world-knowledge is encoded.
>
> I wonder if something could be done with word embeddings here. For
> example
> my guess is that in the target language the two variants should not
> be close in the vector space. And they should be closer to words in the
> same semantic field. This could then be something like a
> reweighting of the translations according to target language semantic
> coherence.
>
> Note that it would require information to be "backpropagated" from the
> target
> language to the source language. Perhaps you could have something like
> per-reading embeddings that are trained using target language
> information,
>
> so e.g. (fils, fil) [0.323, 0.423, 0.11, 0.595]
>  (fils, fils) [0.53, 0.605, 0.54, 0.639]
>
> Felipe did something like this in his thesis, but he only looked at
> sequences of part of speech tags. Here we need to know information about
> the actual analyses.
>
> > In French there are plenty of words with different meanings, depending
> > on the genre: livre, page, tour, etc. The problem is that often the
> > immediate surrounding context does not disambiguate: des livres, les
> > pages, de tour, etc.
>
> This sounds like it would work with some kind of longer distance,
> bag-of-wordsy context module.
>
> > A similar but slightly different case is the word
> > pairs homicide mf/homicide m, féminicide mf/féminicide m, parricide
> > mf/parricide, etc.: the one with the genre "mf" is a person and the
>

Re: [Apertium-stuff] New release for apertium-fra-cat

2020-06-17 Thread Hèctor Alòs i Font

Thanks, Tino!

Missatge de Tino Didriksen  del dia dc., 17 de juny
2020 a les 9:28:

> Fixed.
>
> No clue why the previous didn't take effect - the commands were in history
> and some of the installed files had the right timestamp, but not all of
> them.
>
> -- Tino Didriksen
>
>
> On Tue, 16 Jun 2020 at 23:17, Hèctor Alòs i Font 
> wrote:
>
>> apertium.org still has the old apertium-fra-cat version. Could someone
>> update it?
>> Thanks in advance.
>> Hèctor
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] New release for apertium-fra-cat

2020-06-16 Thread Hèctor Alòs i Font

apertium.org still has the old apertium-fra-cat version. Could someone
update it?
Thanks in advance.
Hèctor

Missatge de Hèctor Alòs i Font  del dia dc., 3 de
juny 2020 a les 19:07:

> Thank you very much, Tino!
> Hèctor
>
> Missatge de Tino Didriksen  del dia dc., 3 de
> juny 2020 a les 18:35:
>
>> Done.
>>
>> Tarballs uploaded to Github, release live on apertium.org, and pushed to
>> https://salsa.debian.org/science-team/apertium-fra-cat
>>
>> -- Tino Didriksen
>>
>>
>> On Fri, 22 May 2020 at 06:14, Hèctor Alòs i Font 
>> wrote:
>>
>>> A new release of apertium-fra-cat is ready to be packaged.
>>>
>>> It mostly contains almost 3,000 new translations in the bidix, many of
>>> them on the basis of translations of current newspaper news and social
>>> network chats, especially from French to Catalan. Colloquial (and Covid-19)
>>> language is now much better grasped.
>>>
>>> Please, @Tino Didriksen , could you package the
>>> release?
>>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

2020-06-15 Thread Hèctor Alòs i Font

Here come several practical examples. I tried to select them for their
variety. The result is more a wish list than something structured.

Let's begin with "je la baise". Depending on the context this may be "I
kiss her" or "I fuck her". The context can tell us if we are in a formal or
colloquial type of language. Another issue is that in this case the
anaphora resolution can also help us: if the pronoun reference is "hand",
it can only be "kiss"; if it is a person, the doubt persists.

Another kind of problem is the Arpitan words "chamô" ("camel"; plural
"camels") and "chamôs ("chamois"; unchanged in plural). So, translating
into French, I got yesterday chamois in a Bible text of Exodus xD  I solved
it deciding in a CG rule that all "chamôs" (without nothing around in
singular) are camels. (Similar cases in French: fil/fils, foi/fois,
cour/cours)

In French there are plenty of words with different meanings, depending on
the genre: livre, page, tour, etc. The problem is that often the immediate
surrounding context does not disambiguate: des livres, les pages, de tour,
etc. A similar but slightly different case is the word pairs homicide
mf/homicide m, féminicide mf/féminicide m, parricide mf/parricide, etc.:
the one with the genre "mf" is a person and the other is the action.

Other problems come in lexical selection. For instance, as a rule, Catalan
preposition "de" is translated as "de" in French, but if the following word
is a material, "en" must be selected (de fusta > en bois). So in the
Catalan2French lrx file we have a list of materials, as we have a list of
countries, a list of musical instruments, a list of animals, etc. I dream
about a monolingual dictionary where we could get this kind of information.
It is not useful to have these lists for many language pairs using Catalan.
This information should be in apertium-cat and not in every
apertium-cat-xxx lrx file.

Moreover, If we had words not only with different kind of semantic labels,
but also marked as synonyms, maybe it'd be possible to give a translation
using a word labeled as synonym (if it has a translation) instead of
"unknown".

Hèctor

Missatge de Francis Tyers  del dia dl., 15 de juny
2020 a les 18:26:

> El 2020-06-15 15:02, Xavi Ivars escribió:
> > Hello,
> >
> > To decouple conversations on how to store secondary information from
> > the use case I had in mind (that can be achieved regardless or how we
> > store and propagate that data), let me explain how I see this
> > functionality working, but using some sort of "apertium pipeline
> > trace" (simplified, many tags missing)
> >
> > This is how we currently handle this "mango" issue in spa-cat:
> > changing the "lemma".
> >
> > This is how I envision it. The key points here are: monolingual module
> > that adds the data to the pipeline. Bilingual module (probably
> > lex-tools?) that makes use of that information to decide the best
> > translation.
> >
> > Please don't look into the exact implementation: there are pieces I
> > don't exactly which module would be the one doing the things. Also,
> > please don't look at the "secondary tags" form to define the
> > semantics: i'm using it just for readability in this example but,
> > again, that data could be persisted anywhere.
> >
> > This is why I thought Tanmai's work could be useful for this: if a
> > module can add this data to the stream, a module later in the pipeline
> > (probably apertium-lex-tools, or biltrans itself?) could use it to
> > decide what the right translation is.
> >
> > Does it make sense?
>
> Thanks Xavi for the ideas...
>
> What I've been thinking about is a module that would go after
> biltrans and before lexical selection. It would essentially reweight
> the possible translations based on a bag of words over a fixed
> window of words or "sentences" (delimited with '.').
>
> You could have source and target components, so e.g. you might
> say that "fruit" is a semantic field or domain which includes,
>
> "mango", "manzana", "plátano", "naranja", ...
>
> and
>
> "mango", "taronja", "poma"
>
> In Catalan. These would be in the monolingual pairs. The
> module would take both lists and the input
>
> ^querer/voler$
> ^mango/mànec/mango$
> ^y/i$
> ^manzana/poma$
>
> And try and maximise semantic coherence, then it could reweight,
> so e.g.
>
> ^querer/voler$
> ^mango/mango<2.0>/mànec<0.0>$
> ^y/i$
> ^manzana/poma$
>
> And pass it to the lexical selection module which will choose the
> one with the highest weight.
>
> This would mean a new module, but it would require only minor
> changes to the bilingual dictionary and lexical selection, and
> wouldn't have any effect on transfer.
>
> Given a few more examples I'm sure I could come up with a mockup of
> how it would work and we could go from there.
>
> Fran
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Apertium's Wider Use & Secondary Tags

2020-06-14 Thread Hèctor Alòs i Font

Missatge de Francis Tyers  del dia dg., 14 de juny
2020 a les 10:32:

> El 2020-06-13 23:18, Jonathan Washington escribió:
> > On Sat, Jun 13, 2020, 16:05 Francis Tyers  wrote:
> >
> >> El 2020-06-13 19:31, Xavi Ivars escribió:
> >>> Before anything, let me say that I like the proposal to enhance
> >> the
> >>> pipeline with more data (including, but not limited to the surface
> >>> forms), to be able to do properly do things that currently we're
> >> doing
> >>> in vry hacky (to me) and definitely non-linguistic ways
> >>>
>  xavi@dell:~/src/apertium-spa$ echo "El mango" | apertium -d .
>  spa-morph
>  ^El/el$
> 
> >>>
> >>
> >
> ^mango/mango/mangar/MANGO_FRUTA$^./.$
> >>>
> >>> In this example, we "add" semantic information to the pipeline
> >> (and
> >>> disambiguate via CG3) by creating a "fake lemma" needed for
> >> SPA-CAT,
> >>> because "mango" (pan stick) and "mango_fruta" are translated
> >>> differently in Catalan. But this, in turn, forces every other
> >> language
> >>> pair using Spanish to know about "mango_fruta" even if the
> >>> translation was the same as "mango".
> >>>
> >>
> >> What is the problem here? That "mango" has two possible lemmas and
> >> paradigms
> >> in Spanish?
> >>
> >> The way that I've treated that is to have mango¹ and mango², like
> >> in a
> >> traditional dictionary. I don't think that this requires any further
> >
> >> information.
> >
> > I think Xavi's point is that there are a number of ways to approach
> > this, and having the option of another stream to put this extra
> > information could be one of them.  Imho, it is nicer in many ways than
> > even having (very arbitrary) superscripts (that aren't really any
> > better to have in a morphological analysis than _fruta).
> >
>
> It's following what the lexicographers do:
>
> https://dle.rae.es/?w=mango
>
> So it's following a fairly established practice.
>
> Fran
>

As far as I understand the mango's issue, Xavi is contemplating the
possibility of a semantic module which would add extra information that may
be used by other models (especially by the lexical selection one) to add
information about "mango". This could be used for distinguishing between a
handle or a fruit, but in fact not only. "Mango" can be the fruit and the
plant. One could eventually add what kind of handle it is, e.g. in the RAE
dictionary provided by Fran's the handle of a knife is specifically
distinguished among other handles. As Xavi shows, this extra information
could be added so that it can be ignored by pairs who don't need it. It
seems clear that the solution based on being able to add any additional
secondary information is more versatile, instead of "_fruta", "_2" and the
like.

Moreover, in the lexical selection we have lots of lists like "fruit",
"building", "person", "device", etc. (and if we don't it this because of a
lack of time for writing them). It would be easier if a module like the one
Xavi imagines could add this kind of information and it could be moved
through the pipeline.

I am not a technician, nor am I a computer linguist. I don't know, nor do I
understand, the implications of Tanmai and Tino's proposals in terms of
system performance. But, from the point of view of someone with some
experience in developing Apertium language pairs, I would love some tool
that would allow adding semantic information to the pipeline.

Other kind of contextual information that would also be useful for me are
things like the type of publication (a chat between friends or a medical
encyclopedia?), the dialect, the year of publication, etc. It would go very
well for both lexical selections and, sometimes, for transfer rules.

I don't know if this has helped the discussion at all or... si he pixat
completament fora de test.

Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] New release for apertium-fra-cat

2020-06-03 Thread Hèctor Alòs i Font

Thank you very much, Tino!
Hèctor

Missatge de Tino Didriksen  del dia dc., 3 de juny
2020 a les 18:35:

> Done.
>
> Tarballs uploaded to Github, release live on apertium.org, and pushed to
> https://salsa.debian.org/science-team/apertium-fra-cat
>
> -- Tino Didriksen
>
>
> On Fri, 22 May 2020 at 06:14, Hèctor Alòs i Font 
> wrote:
>
>> A new release of apertium-fra-cat is ready to be packaged.
>>
>> It mostly contains almost 3,000 new translations in the bidix, many of
>> them on the basis of translations of current newspaper news and social
>> network chats, especially from French to Catalan. Colloquial (and Covid-19)
>> language is now much better grasped.
>>
>> Please, @Tino Didriksen , could you package the
>> release?
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Apertium Beta Portal

2020-05-30 Thread Hèctor Alòs i Font

Sorry, I forgot to copy it:
http://apertiumtrad.tuxfamily.org/index.php?lang=eng

Missatge de mansur <6688...@gmail.com> del dia ds., 30 de maig 2020 a les
17:10:

> Hi, Hèctor!
>
> Sounds interesting! Could you, please, provide a link to @Bernard
> Chardonneau 's portal?
>
> Best,
> Mansur
>
> Am Sa., 30. Mai 2020 um 17:06 Uhr schrieb Hèctor Alòs i Font <
> hectora...@gmail.com>:
>
>> Hi, Mansur!
>> As an alternative, @Bernard Chardonneau 's site
>> seems much more reliable, although it doesn't have all the features that
>> supposedly provides Beta Apertium. I was very enthusiastic with Beta
>> Apertium, but I ceased use it because it doesn't work.
>> Best,
>> Hèctor
>>
>> Missatge de mansur <6688...@gmail.com> del dia ds., 30 de maig 2020 a
>> les 16:51:
>>
>>> Hey!
>>>
>>> It turned out, the Apertium Beta portal stopped working for some reason:
>>> http://beta.apertium.org/
>>> If I use httpS it redirects to the wiki page.
>>>
>>> Will it be fixed sometime soon? If not, what should we use that includes
>>> beta features?
>>>
>>> With best regards,
>>> Mansur
>>> ___
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Apertium Beta Portal

2020-05-30 Thread Hèctor Alòs i Font

Hi, Mansur!
As an alternative, @Bernard Chardonneau 's site seems
much more reliable, although it doesn't have all the features that
supposedly provides Beta Apertium. I was very enthusiastic with Beta
Apertium, but I ceased use it because it doesn't work.
Best,
Hèctor

Missatge de mansur <6688...@gmail.com> del dia ds., 30 de maig 2020 a les
16:51:

> Hey!
>
> It turned out, the Apertium Beta portal stopped working for some reason:
> http://beta.apertium.org/
> If I use httpS it redirects to the wiki page.
>
> Will it be fixed sometime soon? If not, what should we use that includes
> beta features?
>
> With best regards,
> Mansur
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] New release for apertium-fra-cat

2020-05-21 Thread Hèctor Alòs i Font

A new release of apertium-fra-cat is ready to be packaged.

It mostly contains almost 3,000 new translations in the bidix, many of them
on the basis of translations of current newspaper news and social network
chats, especially from French to Catalan. Colloquial (and Covid-19)
language is now much better grasped.

Please, @Tino Didriksen , could you package the
release?
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Secondary Tag Prefixes

2020-05-08 Thread Hèctor Alòs i Font

Missatge de Francis Tyers  del dia dv., 8 de maig 2020
a les 18:05:

> El 2020-05-08 15:50, Tino Didriksen escribió:
> > For khannatanmai's GSoC project, secondary tags will be implemented in
> > a backwards compatible manner. That it in itself indisputable. But,
> > there is a question of how the initial batch of secondary tags should
> > look.
> >
> > I feel they should be in the form of , as in a very short
> > textual lower-case prefix, followed by :, followed by whatever value
> > there is. Or even an upper-case prefix, as in  or .
> >
> > spectie wants symbol prefixes in the form of <%:cdefg>.
> >
>
> [snip]
>
> > From a technical and scientific basis, textual prefixes are just
> > better. And yet, spectie wants symbol prefixes because he likes them.
> > I disagree. Hence, this mail asking for opinions.
> >
> > Do you language developers actually prefer symbol prefixes?
> >
>
> Tino misrepresented me slightly. I never proposed using the pound sign.
>
> My proposal was for:
>
>
> отец<@subj><§agent><%:отца><:human><:kin>
>
> If we have to have these "secondary tags"... which I have yet to be
> completely convinced of,
> I would like to have them be readable and not clutter the stream with
> unnecessary
> verbosity. There are a lot of rule-based formalisms out there that are
> impossible to read,
> having been dreamt up by people who don't actually spend a lot of time
> writing language
> data, and I would like to avoid that happening with Apertium.
>

Well, from a developer's point of view, I'd like very much if I could get
information like "human", "construction", "denonym", "material", "musical
instrument", etc. which I have to use for lexical selection and also
sometimes for transfer. It seems logical to me that this data would be some
day placed in the dictionary or in a kind of secondary dictionary. In fact
the trend is already to add more semantic information to words: for example
in proper names we now often distinguish between first names, surnames,
place names, hidronyms, etc.

Personally, I don't have any preference in the syntax. I'm fine with any
method that is short, easy to type on any keyboard and that identifies a
tag as secondary.

Hèctor



> Again, and again I want to see a translation and a linguistic
> motivation. In an _actual_
> language pair, not in someone's imagination.
>
> We have a lot of modules that have been made but not reached use in a
> released pair,
> so I don't see how this should be different.
>
> Fran
>
>
>
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Help in writing transfer rule

2020-04-05 Thread Hèctor Alòs i Font

Missatge de Rajarshi Roychoudhury  del dia dg., 5
d’abr. 2020 a les 17:33:

> Hi,
> I want to write a transfer rule which has the following structure
>
>   IF CONDITION 1
>  OUTPUT LEXICAL UNIT 1( )
>   ELSE IF CONDITION 2
>  OUTPUT LEXICAL UNIT 2()
>   ELSE
>   OUTPUT LEXICAL UNIT 3()
> I am able to write the condition part, but where should be the  tag
> be placed? Can I have a seperate   for each condition? .I need
> detailed structure of the pipeline of such a transfer rule.
>
>
In fact, you have to have a separate  for each condition. Basically,
the structure is:


  
 condition 
 ... 
  

  
 condition 
 ... 
  

  
 ... 
  


Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Election Results

2020-04-04 Thread Hèctor Alòs i Font

Missatge de Samuel Sloniker  del dia ds., 4 d’abr.
2020 a les 22:11:

> Just a quick question... If I was not eligible to run due to not being a
> Committer, why was my vote counted?
>

Maybe I miss something. You were a candidate, weren't you? You were in the
census and you got a voter code, didn't you? Of course, your vote was
counted, and the Election Board even accepted the amendment of your vote.

Hèctor


>
> On Sat, Apr 4, 2020 at 11:55 AM Daniel Swanson 
> wrote:
>
>> Hi Apertiumers!
>>
>> The election proceedings are now complete and the votes have been tallied
>> as follows:
>>
>> Votes: 41
>> For president :
>> - Tino Didriksen 9
>> - Francis Tyers 30
>> For members :
>> - Sushain K. Cherivirala 18
>> - Tino Didriksen 28
>> - Mikel L. Forcada 29
>> - Scoop Gracie (pseudonym) 4
>> - Xavi Ivars 20
>> - Tanmai Khanna 4
>> - Francis Tyers 23
>> - Jonathan Washington 24
>> There is a tie between Scoop Gracie and Tanmai Khanna. In consultation
>> with the current PMC it was decided that under a strict reading of the
>> bylaws Tanmai Khanna would be eligible to run and Scoop Gracie would not.
>> Thus we announce the election results as follows:
>>
>> President: Francis Tyers
>> Members:
>> - Sushain K. Cherivirala
>> - Tino Didriksen
>> - Mikel L. Forcada
>> - Xavi Ivars
>> - Tanmai Khanna*
>> - Jonathan Washington
>>
>> * Due to participation in GSoC, Tanmai Khanna's appointment will be
>> delayed.
>>
>> Thank you to the candidates and to everyone who voted.
>>
>> The election committee,
>> Sevilay, Hèctor, and Daniel
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Lexical Selection

2020-04-01 Thread Hèctor Alòs i Font

Hola Helena,

Tens regles de selecció lèxica per a "estació" en el parell
apertium-fra-cat. Pots mirar el fitxer apertium-fra-cat.cat-fra.metalrx. Si
afines les regles per a l'anglès, pots també fer-ho en aquests al francès
:) I si et cal algun aclariment més, no dubtis a preguntar.

Cordialment,
Hèctor

Missatge de egea piñeiro helena  del dia dc., 1
d’abr. 2020 a les 11:48:

> Hello!
>
>
> I'm working on eng-cat and spa-cat dictionaries trying to find a way to
> avoid lexical selection. I mean, in such an example as this one:
>
> ^El/The$ ^*estació/season*
> */station*$ ^més/more$
> ^plujós/rainy$
> ^ser/be$
> ^el/the$ ^estiu/summer$
>
> How to show the text translated with the multiple options due to polisemy.
>
> "The *season/station* more rainy is"
>
> Also, to study this, I'm trying to find out how to make some changes on
> the lexical rules (.lrx files) and see an actual change on the output, but
> the commands through the pipeline doesn't seem to refer to that file. All
> that I can change to test different polisemy cases are the 'srl' and 'lrs'
> on the dictionary  ammong the 'D' option (default). I assumed the
> autloex.bin somehow called the lrx file but I don't know for sure and I
> couldn't find information so far.
>
>
> echo "L'estació més plujosa és l'estiu" | lt-proc -w
> '/usr/share/apertium/apertium-eng-cat/cat-eng.automorf.bin' | cg-proc -w
> '/usr/share/apertium/apertium-eng-cat/cat-eng.rlx.bin' | apertium-tagger -g
> $2 '/usr/share/apertium/apertium-eng-cat/cat-eng.prob' |
> apertium-pretransfer| lsx-proc
> '/usr/share/apertium/apertium-eng-cat/cat-eng.autosep.bin' | lt-proc -b
> '/usr/share/apertium/apertium-eng-cat/cat-eng.autobil.bin' |* lrx-proc -m
> -t '/usr/share/apertium/apertium-eng-cat/cat-eng.autolex.bin' *
>
>
> Thanks!
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC

2020-03-30 Thread Hèctor Alòs i Font

  n="sp"/>
>  n="sg"/>
>  n="pl"/>
> 
>
> For the first paradigm, the more simple syntax in fra-por pair is even a
> little
> better :
>
> 
> n="sg"/>
>   s n="pl"/>
>  n="sg"/>
>   s  n="pl"/>
>  n="sg"/>
>   s  n="pl"/>
> 
>
> So, this kind of change will have to be done everywhere in
> apertium-fra.fra.metadix
> a word is analysed as mf (don't know if masculine or feminine) or sp
> (don't know if
> singular ou plural).
>
> That id already done everywhere in apertium-epo-fra.epo.dix (as in
> apertium-fra-por.fra.metadix), so, you will just have to report these
> changes.
>
> Add the words presents into apertium-epo-fra.fra.dix which are not yet
> also in apertium-fra.fra.metadix
>
> Note : At least for one paradigm there is a difference between the two
> files.
> masculine names on which a "s" is added for plural use paradigm livre__n in
> apertium-fra.fra.metadix but accessoire__n in apertium-epo-fra.fra.dix
>
> I prefer accessoire__n that would do for the two most commons paradigms
> for names, the reference name or the paradigm appearing very early in the
> alphabetically sorted list of words.
>
> So, let change everywhere livre__n by accessoire__n
>
> I don't know if there are other paradigms doing the same with different
> names
> in the two files, but if you find them, let take as the reference word the
> first of these names in alphabetical order.
>
> Like that, the most frequently used paradigms will be the ones who appear
> early in the full list of words alphabetically sorted. And that could be
> a help for choosing a paradigm without generally having to read the content
> of a large number of them.
>
> Now for the language epo.
>
> I found a horrible file of more than 200 000 lines of paradigms, and no
> word for using them ! Completely useless. Only comments in the sdef section
> could be usefull.
>
> So, this file will have to be built again from apertium-epo-fra.epo.dix and
> apertium-eo-en.eo.dix.xml (+ eventually other files of that kind) to get
> all the Esperanto word used in these pairs. Paradigm used seem to work the
> same in both pairs.
>
> After that, you will have to test if tranfer rules still work and correct
> them.
>
> As he said, for eo-en pair, ask to Jacob Nordfalk
>
> For fra => epo translation direction, ask to Hèctor Alòs
>
> For epo => fra translation direction, ask to me.
>
> For this translation direction, the 0 step (apertium-epo-fra.epo-fra.t0x)
> add "unu" to names without the determinant "la". That allows to use the
> same transfer rules for names with determinant "le" "la" "les" (or
> sometimes
> "l'" after post generation) and for names with determinant "un" "une"
> "des".
>
> After that, only one stage of transfer is used. Presently, there are no
> rules
> for adverbs, or for pronouns in accusative form. Adding them would reduce a
> lot the number of # in a translation.
>
> I also did a lot of tranfer rules for sentences like
> ? *  "de" ? *  estas 
> ? *  "de" ? *  
>
> Example :
> la kato de la najbarino estas blanka
> la malgranda katino de la najbaro estas blanka
> la kanino de la dika najbarino ne estas nigra
> katoj de la dika granda najbaro estas blankaj
> ..
>
> With the possibility of having 0, 1 or 2 adjectives for each name, that
> makes plenty of similar transfer rules, even if in that case, the 0 step
> divides them by 4.
>
> A good change should be to rewrite transfer rules for this kind of
> sentences using a 3 stage transfer. That allows to process shorter
> lists of words send gender and number (or other informations) of one
> group to another.
>
>
>
> > Date: Wed, 25 Mar 2020 13:48:17 +0300
> > From: Hèctor Alòs i Font 
> > To: "[apertium-stuff]" ,
> >  Bernard Chardonneau ,
> >  Jacob Nordfalk 
> > Reply-To: apertium-stuff@lists.sourceforge.net
> > Subject: Re: [Apertium-stuff] GSoC
> > Pièce(s) jointes(s) probable(s)>
> > Saluton, Andrew!
> > Mi ĝojas legi pri propono rilata al esperanto. Mi daŭrigas angle, por ke
> al
> > ĉiuj estu kompreneble.
> >
> > It probably doesn't make any sense to work on the English-French pair in
> > Apertium, since these are two of the languages with the most resources in
> > the world (linguistic and

Re: [Apertium-stuff] New release for apertium-fra-cat & por-cat

2020-03-29 Thread Hèctor Alòs i Font

Missatge de Sushain Cherivirala  del dia dg., 29 de març
2020 a les 22:24:

> Hèctor,
>
> > cat-por_PTpre1990
>
> I think that https://github.com/apertium/apertium-apy/issues/141 is
> relevant here. That particular variant isn't working since it has numbers
> in it. I hot-patched the regex and it seem to work now:
> https://www.apertium.org/apy/listPairs. Committed to master:
> https://github.com/apertium/apertium-apy/commit/2f391281532273cc2d229f645f0e5fdf30cfbe7f
> .
>

The issue was only on a "_" that was used initially. I dropped it, but I
didn't know that numbers too were not allowed. Thanks for the patching!

This also needed a tweak to the html-tools config to allow the variant. It
> looks like apertium.org picks it up now.
>

Yes, it is working! Thanks!


> > By the way, "PTpre1990" stands for "European Portuguese (traditional
> orthography)". It should probably be added to the interface tags, but I
> don't know where is the file in github where the meaning should be added in
> several languages.
>
> You're looking for this file:
> https://github.com/apertium/apertium-apy/blob/master/language_names/variants.tsv
>
> Feel free to send a pull request my way. We'll have to upgrade APy to pick
> it up on the other end. I can cut a quick release for that after you've
> updated it.
>
> On Sat, Mar 28, 2020 at 12:06 PM Hèctor Alòs i Font 
> wrote:
>
>> The new release of apertium-fra-cat is already available in apertium.org.
>> Thanks!
>> But I'm not sure the new version of apertium-por-cat is. At least modes
>> cat-por_BR and cat-por_PTpre1990 are not yet. Someone could take a look,
>> please?
>> By the way, "PTpre1990" stands for "European Portuguese (traditional
>> orthography)". It should probably be added to the interface tags, but I
>> don't know where is the file in github where the meaning should be added in
>> several languages.
>>
>> Hèctor
>>
>> Missatge de Hèctor Alòs i Font  del dia ds., 21 de
>> març 2020 a les 0:07:
>>
>>> Thanks a lot, Tino!
>>>
>>> Missatge de Tino Didriksen  del dia dv., 20 de
>>> març 2020 a les 23:13:
>>>
>>>> Finally got around to packaging fra-cat and por-cat.
>>>> https://github.com/apertium/apertium-packaging/issues/26 has links to
>>>> exact commits and release tags.
>>>>
>>>> Pushed to Debian:
>>>> - https://salsa.debian.org/science-team/apertium-fra-cat v1.8.0
>>>> - https://salsa.debian.org/science-team/apertium-pt-ca v0.10.0
>>>>
>>>> And upgraded public APy instance.
>>>>
>>>> -- Tino Didriksen
>>>>
>>>>
>>>> On Sat, 8 Feb 2020 at 10:23, Hèctor Alòs i Font 
>>>> wrote:
>>>>
>>>>> A new release of apertium-fra-cat is ready to be packaged.
>>>>>
>>>>> It mostly contains many new translations in the bidix (more than
>>>>> 15,000). Besides:
>>>>> - disambiguation has been improved, especially for French
>>>>> - dozens of new lexical selection rules have been added
>>>>> - dozens of new transfer rules have been added, especially for the
>>>>> cat-fra side
>>>>>
>>>>> In any case, the main problem of this language pair is that it is
>>>>> still using just one step in the transfer. This makes impossible to reach
>>>>> 20% of WER, especially on the cat-fra side, where quite a lot of words 
>>>>> have
>>>>> to be added or reordered. Unfortunately, I can't find the time for such a
>>>>> change that requires a lot of work.
>>>>>
>>>>> Please, @Tino Didriksen , could you package
>>>>> the release?
>>>>>
>>>>> Hèctor
>>>>>
>>>> ___
>>>> Apertium-stuff mailing list
>>>> Apertium-stuff@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-28 Thread Hèctor Alòs i Font

Hi Tanmai,

I am surprised by this proposal. It involves some very important changes
that should be better justified. I don't quite understand when should one
define the "optional secondary information" in addition to the current
morphological fields. Will it be in the language module (apertium-xxx) or
in each of the translation modules (apertium-xxx-yyy)? Part of the problem
may be in the example. I can't imagine why information on case should be
added to every English word (not much that, say, information about
belonging, which is common for Turkic languages). Should this kind of
unnecessary information for everybody, or almost everybody, will be found
in every language pair using, say, English if someone for his or her
specific purposes will like to add it? As far as I understand, for the
given project it is needed to add the surface form of the word. This seems
quite logical. Moreover, this information may be useful for e.g. lexical
selection and structural transfer. But more than that seems to me too
obscure.

Best,
Hèctor

Missatge de Tanmai Khanna  del dia ds., 28 de març
2020 a les 23:51:

> Hey guys,
> As part of the project to eliminate trimming, I had to come up with a way
> to include the surface form in the lexical unit and hence modifying the
> apertium stream format. To do this I would have to modify the parsers of
> every program in the pipeline, and if that has to happen, we discussed on
> the IRC that *it might be a good idea to modify the stream in such a way
> that we can include an arbitrary amount of information in a lexical unit,
> and each program can use whatever information they need.*
>
> The current information in the lexical unit would be primary information,
> and then we would have optional secondary information which could contain
> the surface form, but also literally anything you can think of (case,
> sentiment, pragmatic info, etc.). This would open up a lot of possibilities
> for each program, and it would strengthen the apertium stream format
> considerably.
>
> We discussed several possible syntax for this new stream format, and the
> one that seems the best is something like this:
>
> ^potato/patata$
>
> This doesn't mess with the current stream format too much. The number of
> tags is already arbitrary so that helps. The secondary tags contain a ":"
> that would help distinguish them from primary tags.
>
> To implement this a modification would still be needed to all the parsers
> but the benefits far outweigh the amount of work needed to pull this off.
>
> Since this would be a major fundamental change to Apertium, I request you
> all to contribute with your views, any pros, cons, suggestions - to the
> idea, to the syntax, anything.
>
> Thanks and Regards,
> Tanmai Khanna
>
> --
> *Khanna, Tanmai*
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] New release for apertium-fra-cat & por-cat

2020-03-28 Thread Hèctor Alòs i Font

The new release of apertium-fra-cat is already available in apertium.org.
Thanks!
But I'm not sure the new version of apertium-por-cat is. At least modes
cat-por_BR and cat-por_PTpre1990 are not yet. Someone could take a look,
please?
By the way, "PTpre1990" stands for "European Portuguese (traditional
orthography)". It should probably be added to the interface tags, but I
don't know where is the file in github where the meaning should be added in
several languages.

Hèctor

Missatge de Hèctor Alòs i Font  del dia ds., 21 de
març 2020 a les 0:07:

> Thanks a lot, Tino!
>
> Missatge de Tino Didriksen  del dia dv., 20 de
> març 2020 a les 23:13:
>
>> Finally got around to packaging fra-cat and por-cat.
>> https://github.com/apertium/apertium-packaging/issues/26 has links to
>> exact commits and release tags.
>>
>> Pushed to Debian:
>> - https://salsa.debian.org/science-team/apertium-fra-cat v1.8.0
>> - https://salsa.debian.org/science-team/apertium-pt-ca v0.10.0
>>
>> And upgraded public APy instance.
>>
>> -- Tino Didriksen
>>
>>
>> On Sat, 8 Feb 2020 at 10:23, Hèctor Alòs i Font 
>> wrote:
>>
>>> A new release of apertium-fra-cat is ready to be packaged.
>>>
>>> It mostly contains many new translations in the bidix (more than
>>> 15,000). Besides:
>>> - disambiguation has been improved, especially for French
>>> - dozens of new lexical selection rules have been added
>>> - dozens of new transfer rules have been added, especially for the
>>> cat-fra side
>>>
>>> In any case, the main problem of this language pair is that it is still
>>> using just one step in the transfer. This makes impossible to reach 20% of
>>> WER, especially on the cat-fra side, where quite a lot of words have to be
>>> added or reordered. Unfortunately, I can't find the time for such a change
>>> that requires a lot of work.
>>>
>>> Please, @Tino Didriksen , could you package the
>>> release?
>>>
>>> Hèctor
>>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC

2020-03-26 Thread Hèctor Alòs i Font

Andrew,
Jes, vi havas hibridan situacion. Mi kredas, ke “Make a language pair
state-of-the-art” pli taŭgas en via situacio. Finfine, gravas bone pruvi,
ke vi komprenas la ilon, kaj ke vi kapablos fari la taskon ne perdante
tempon dum pluraj semajnoj de mallonga projekto por lerni ĝin.

Missatge de Andrew Briand  del dia dv., 27 de març
2020 a les 7:10:

> Saluton,
>
>
>
> Kiu "coding challenge" pli taŭgus en mia situacio? Mi povas aŭ ekkrei
> novan tradukilon esp->fra (la defio por “Adopt a language pair”) aŭ
> plibonigi paron por la defio de la ideo “Make a language pair
> state-of-the-art.”
>
>
>
> Andrew
>
>
>
> *From: *Hèctor Alòs i Font 
> *Sent: *Thursday, March 26, 2020 8:37 PM
> *To: *[apertium-stuff] ; Jacob
> Nordfalk 
> *Subject: *Re: [Apertium-stuff] GSoC
>
>
>
> Missatge de Andrew Briand  del dia dv., 27 de març
> 2020 a les 1:43:
>
> Saluton Jacob kaj Hèctor!
>
>
>
> Multan dankon por viaj respondoj. Mi nun provas taksi la nunajn parojn
> angla-esperanto kaj franca-esperanto por kompari ilin al Google Translate.
> Mi baldaŭ kreos malnetan proponon.
>
>
>
> Tre bone, Andrew, tamen rimarku, ke estas esence, ke vi faru tiel nomatan
> "coding challenge". Per ĝi vi pruvos, ke vi komprenas la esencan funkciadon
> de Apertium, scias aldoni vortojn al la vortaroj kaj krei simplajn
> regulojn. Restas nemultaj tagoj.
>
>
>
> @Jacob Nordfalk , ĉi-jare mi mem por la dua
> (kaj lasta) jaro kandidatiĝas kiel studento. Do, se mi estas elektita, mi
> ne povos kun-mentori.
>
>
>
> Hèctor
>
>
>
>
>
> Could I request a password for the wiki?
>
> Username: andrewbriand
>
> Email: atb8...@comcast.net
>
>
>
> Thank you,
>
>
>
> Andrew
>
>
>
> *From: *Jacob Nordfalk 
> *Sent: *Thursday, March 26, 2020 6:47 AM
> *To: *apertium-stuff 
> *Subject: *Re: [Apertium-stuff] GSoC
>
>
>
> Saluton Andrew!
>
>
>
> Estis mi kiu faris angla-esperanto antaŭ 10 jaroj mi estas programisto
> kiu lernis la necesan lingvistikon dum mi laboris pri la paro.
>
>
>
> Mi tre ĝojus se vi modernigus la parojn kaj mi volonte estus kun-mentoro
> (kune kun Hector) pri laboro rilate al Esperanto.
>
>
>
> Kore,
>
> Jacob
>
>
>
>
>
> Den ons. 25. mar. 2020 kl. 16.11 skrev Francis Tyers  >:
>
> El 2020-03-25 08:46, Andrew Briand escribió:
> > Hello,
> >
> > I am an undergrad interested in adopting an Apertium language pair for
> > Google Summer of Code 2020. I am most interested in French<->English,
> > Esperanto->English, and Esperanto<->French. What might a project for
> > those language pairs look like?
> >
> > Thank you,
> >
>
> Dear Andrew,
>
> Have you read:
>
>
> http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Adopt_a_language_pair#Frequently_asked_questions
>
> Particularly:
>
>   Can I do a pair with language x and language y ?
>
>  — Yes, there are no restrictions. But you should take the following
> into consideration: (a) Are there existing machine translation (MT)
> systems for this pair? (b) If there are existing systems, how good are
> they? -- Could you do better in three months? (c) How closely related is
> the pair? (d) How many resources already exist for the pair? (e) Are
> there any mentors who can evaluate your work?
>
> Best regards,
>
> Francis M. Tyers
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
>
>
> --
>
> Jacob Nordfalk <http://profiles.google.com/jacob.nordfalk>
>
> Androidudvikler og -underviser på DTU
> <http://www.dtu.dk/service/telefonbog/person?id=78778=7#tabs>
>
> Tlf 26206512 - javabog.dk
>
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC

2020-03-26 Thread Hèctor Alòs i Font

Missatge de Andrew Briand  del dia dv., 27 de març
2020 a les 1:43:

> Saluton Jacob kaj Hèctor!
>
>
>
> Multan dankon por viaj respondoj. Mi nun provas taksi la nunajn parojn
> angla-esperanto kaj franca-esperanto por kompari ilin al Google Translate.
> Mi baldaŭ kreos malnetan proponon.
>

Tre bone, Andrew, tamen rimarku, ke estas esence, ke vi faru tiel nomatan
"coding challenge". Per ĝi vi pruvos, ke vi komprenas la esencan funkciadon
de Apertium, scias aldoni vortojn al la vortaroj kaj krei simplajn
regulojn. Restas nemultaj tagoj.

@Jacob Nordfalk , ĉi-jare mi mem por la dua (kaj
lasta) jaro kandidatiĝas kiel studento. Do, se mi estas elektita, mi ne
povos kun-mentori.

Hèctor


>
> Could I request a password for the wiki?
>
> Username: andrewbriand
>
> Email: atb8...@comcast.net
>
>
>
> Thank you,
>
>
>
> Andrew
>
>
>
> *From: *Jacob Nordfalk 
> *Sent: *Thursday, March 26, 2020 6:47 AM
> *To: *apertium-stuff 
> *Subject: *Re: [Apertium-stuff] GSoC
>
>
>
> Saluton Andrew!
>
>
>
> Estis mi kiu faris angla-esperanto antaŭ 10 jaroj mi estas programisto
> kiu lernis la necesan lingvistikon dum mi laboris pri la paro.
>
>
>
> Mi tre ĝojus se vi modernigus la parojn kaj mi volonte estus kun-mentoro
> (kune kun Hector) pri laboro rilate al Esperanto.
>
>
>
> Kore,
>
> Jacob
>
>
>
>
>
> Den ons. 25. mar. 2020 kl. 16.11 skrev Francis Tyers  >:
>
> El 2020-03-25 08:46, Andrew Briand escribió:
> > Hello,
> >
> > I am an undergrad interested in adopting an Apertium language pair for
> > Google Summer of Code 2020. I am most interested in French<->English,
> > Esperanto->English, and Esperanto<->French. What might a project for
> > those language pairs look like?
> >
> > Thank you,
> >
>
> Dear Andrew,
>
> Have you read:
>
>
> http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Adopt_a_language_pair#Frequently_asked_questions
>
> Particularly:
>
>   Can I do a pair with language x and language y ?
>
>  — Yes, there are no restrictions. But you should take the following
> into consideration: (a) Are there existing machine translation (MT)
> systems for this pair? (b) If there are existing systems, how good are
> they? -- Could you do better in three months? (c) How closely related is
> the pair? (d) How many resources already exist for the pair? (e) Are
> there any mentors who can evaluate your work?
>
> Best regards,
>
> Francis M. Tyers
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
>
>
> --
>
> Jacob Nordfalk 
>
> Androidudvikler og -underviser på DTU
> 
>
> Tlf 26206512 - javabog.dk
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] PMC election: Proclamation of the candidates.

2020-03-25 Thread Hèctor Alòs i Font

There's nothing to the contrary in the by-laws, so they are allowed. In
fact, in a normal election I suspect this must be the usual practice for
candidates.

El dc., 25 març 2020, 19.29, Scoop Gracie  va
escriure:

> Are candidates allowed to vote for themselves?
>
> On Wed, Mar 25, 2020, 05:06 Hèctor Alòs i Font 
> wrote:
>
>> Hi Tanmai,
>>
>> The by-laws <http://wiki.apertium.org/wiki/By-laws> are quite specific
>> at this point (article 24):
>>
>> - "j. Once candidates are proclaimed, Committers with the right to vote
>> will send a ballot with:
>>
>>1. The name of a candidate to president of the Project Management
>>Committee, and
>>2. The name of up to four candidates to members of the Project
>>Management Committee
>>
>> - k. There will be 7 days to send the ballots"
>>
>> Hèctor
>>
>> Missatge de Tanmai Khanna  del dia dc., 25 de
>> març 2020 a les 14:50:
>>
>>> Hey Hèctor,
>>> What's the voting system that the election board will follow? Will it be
>>> first past the post or a ranked system like STV?
>>> Also, will I as a voter get 7 votes for the PMC or will I get 1 vote?
>>> Clarifying these and the thought-process behind the choice will make the
>>> process more transparent.
>>>
>>> Thanks and Regards,
>>> Tanmai
>>>
>>> On Wed, Mar 25, 2020 at 3:06 PM Hèctor Alòs i Font 
>>> wrote:
>>>
>>>> The final census of voters for the Apertium Project Management
>>>> Committee election is here
>>>> <https://docs.google.com/spreadsheets/d/1ECL_8Lkfx4A66xpHhbOTn7ljKoDcLa0w7MdFZC7DOpA/edit#gid=0>.
>>>> Nobody has found anything to amend.
>>>>
>>>> The following people have indicated they want to run for PMC members:
>>>>
>>>> - Sushain K. Cherivirala
>>>> - Tino Didriksen
>>>> - Mikel L. Forcada
>>>> - Scoop Gracie (pseudonym)
>>>> - Xavi Ivars
>>>> - Tanmai Khanna
>>>> - Francis Tyers
>>>> - Jonathan Washington
>>>>
>>>> Two candidates are standing for PMC President:
>>>>
>>>> - Tino Didriksen
>>>> - Francis Tyers
>>>>
>>>> The ballots will be sent to the voters shortly.
>>>>
>>>> On behalf of the Election Board,
>>>> Hèctor Alòs i Font
>>>> ___
>>>> Apertium-stuff mailing list
>>>> Apertium-stuff@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>>
>>>
>>> --
>>> *Khanna, Tanmai*
>>> ___
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] PMC election: Proclamation of the candidates.

2020-03-25 Thread Hèctor Alòs i Font

Hi Tanmai,

The by-laws <http://wiki.apertium.org/wiki/By-laws> are quite specific at
this point (article 24):

- "j. Once candidates are proclaimed, Committers with the right to vote
will send a ballot with:

   1. The name of a candidate to president of the Project Management
   Committee, and
   2. The name of up to four candidates to members of the Project
   Management Committee

- k. There will be 7 days to send the ballots"

Hèctor

Missatge de Tanmai Khanna  del dia dc., 25 de març
2020 a les 14:50:

> Hey Hèctor,
> What's the voting system that the election board will follow? Will it be
> first past the post or a ranked system like STV?
> Also, will I as a voter get 7 votes for the PMC or will I get 1 vote?
> Clarifying these and the thought-process behind the choice will make the
> process more transparent.
>
> Thanks and Regards,
> Tanmai
>
> On Wed, Mar 25, 2020 at 3:06 PM Hèctor Alòs i Font 
> wrote:
>
>> The final census of voters for the Apertium Project Management Committee
>> election is here
>> <https://docs.google.com/spreadsheets/d/1ECL_8Lkfx4A66xpHhbOTn7ljKoDcLa0w7MdFZC7DOpA/edit#gid=0>.
>> Nobody has found anything to amend.
>>
>> The following people have indicated they want to run for PMC members:
>>
>> - Sushain K. Cherivirala
>> - Tino Didriksen
>> - Mikel L. Forcada
>> - Scoop Gracie (pseudonym)
>> - Xavi Ivars
>> - Tanmai Khanna
>> - Francis Tyers
>> - Jonathan Washington
>>
>> Two candidates are standing for PMC President:
>>
>> - Tino Didriksen
>> - Francis Tyers
>>
>> The ballots will be sent to the voters shortly.
>>
>> On behalf of the Election Board,
>> Hèctor Alòs i Font
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>
>
> --
> *Khanna, Tanmai*
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC

2020-03-25 Thread Hèctor Alòs i Font

Saluton, Andrew!
Mi ĝojas legi pri propono rilata al esperanto. Mi daŭrigas angle, por ke al
ĉiuj estu kompreneble.

It probably doesn't make any sense to work on the English-French pair in
Apertium, since these are two of the languages with the most resources in
the world (linguistic and non-linguistic). As a result, there are quite a
lot of good translators between them, although most of them commercial.

Esperanto is also included in Google Translator, but I think the
Esperanto-French translation can be done at a similar level in Apertium.
Moreover, the translation from Esperanto to French could be used to test
the new apertium-recursive module.

In fact, the current versions of Apertium's four Esperanto pairs (English ⇆
Esperanto, French → Esperanto, Spanish → Esperanto and Catalan → Esperanto)
were released ten years ago. They all use the old all-in-one-repository
structure. Porting these four pairs into the new structure which shares
language resources (using apertium-eng, apertium-fra, apertium-spa and
apertium-cat) would result in a big improvement, because a lot of work has
been done in Apertium on these languages in the last ten years. But porting
is not automatic, since there are differences betweem the monodixes of the
current pairs and the ones in the given four repositories. There are even
differences between the Esperanto-monodixes in these four pairs.

Another question is that @Bernard Chardonneau  has
been working in his own branch of apertium-fra-esp. So, it'd be interesting
to read to what he thinks on a GSoC that would include this pair. Maybe he
could find time to mentor the project (and maybe @Jacob Nordfalk
, who created the Esperanto-English pair, too).

So, in short, in my opinion:

- You should evaluate the quality of the current Google translation,
especially for the French-Esperanto pair, in which Google translates in two
steps (this

is what I did in a similar case last year)

- A part of the project could be a kind of elementary "make a language pair
state-of the art
".
At a minimum this would include the French-Esperanto pair

- Maybe half of the project could be developing a translator from Esperanto
into French

Hèctor

Missatge de Andrew Briand  del dia dc., 25 de març
2020 a les 11:46:

> Hello,
>
>
>
> I am an undergrad interested in adopting an Apertium language pair for
> Google Summer of Code 2020. I am most interested in French<->English,
> Esperanto->English, and Esperanto<->French. What might a project for those
> language pairs look like?
>
>
>
> Thank you,
>
>
>
> Andrew Briand
>
>
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

1 2 3 4 >

1 - 100 of 322 matches

Mail list logo