Re: [Apertium-stuff] We now have markup handling and reordering in Apertium!

2020-09-03 Thread Kevin Brubeck Unhammer
Tanmai Khanna 
čálii:

> the analyser sees wordbound blanks as normal blanks,

So currently if I have the multiword "i dag", it'll recognize
"idag" but it won't recognize "i dag"? (And I suppose if
I have the non-multiword "today" it won't recognize "today".)

One possibility might be to have wordbound blanks match "space or
epsilon" in lt-proc – then it would recognize all of the above.

It seems OK that the wblank then applies to the whole LU, so what comes
out from translating "idag" into English is "today".

Though maybe it would be hard to implement – does this happen a lot?


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] We now have markup handling and reordering in Apertium!

2020-09-03 Thread Kevin Brubeck Unhammer
Tanmai Khanna 
čálii:

>> So currently if I have the multiword "i dag", it'll recognize
> "idag" but it won't recognize "i dag"? (And I suppose if
> I have the non-multiword "today" it won't recognize "today".)
>
> Exactly, but even when it recognises "idag", the  will probably
> be lost because it's being seen as a normal blank.
>
>> One possibility might be to have wordbound blanks match "space or
> epsilon" in lt-proc – then it would recognize all of the above.
>
> I had to do this for postgeneration and it wasn't trivial, so it's not like
> I can't do it for the analyser as well, but we decided that all multiword
> matches will be offloaded to apertium-separable, so the individual parts
> can be analysed as LUs and then apertium-separable can combine them into
> one LU. I have already modified apertium-separable such that it applies the
> individual markups on the final MWE. If this is done then
> both "idag" and "i dag" will be recognised and the italics
> will apply on the entire word.
>
> If this isn't acceptable or too much of an inconvenience, then I can modify
> the analyser.

Using separable for those cases seems like a good solution to me :)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] The French-Arpitan translator is ready to be packed

2020-09-01 Thread Kevin Brubeck Unhammer
Congrats 

5.7% WER is pretty nice =D


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] A talk evaluating Apertium

2020-10-19 Thread Kevin Brubeck Unhammer
Or just call the NMT system "NMT+apertium" =P 

Zanga Chimombo 
čálii:

> I do not mean to be unduly polemic by questioning the methodology in
> choosing what to compare, neither do I want to overlook the shortfalls
> of Apertium/ RBMT, however, if Apertium was "good enough" to create
> corpora for use in ENG-CAT NMT via English-Spanish Europarl corpus and
> Spanish-Catalan Apertium surely, a "fairer" comparison would have been
> the English-Spanish pair?
>
> On Sun, Oct 18, 2020 at 9:06 AM Jaume Ortolà i Font
>  wrote:
>>
>> Missatge de Hèctor Alòs i Font  del dia dg.,
>> 18 d’oct. 2020 a les 7:50:
>>>
>>> Xavi, I am impressed that you could in Softcatalà get enough
>>> bilingual texts to create an English-Catalan neural
>>> translator. Congratulations on the results! I am curious to know
>>> how big the corpus you collected has been, as well as from which
>>> sources to ensure the quality of the translations.
>>
>>
>> The corpora used can be found here:
>> https://github.com/Softcatala/en-ca-corpus
>>
>> One of the corpora is an automatic translation of the
>> English-Spanish Europarl corpus using Spanish-Catalan Apertium. It
>> has proved good enough to train the neural translator.
>>
>> The neural translator could be improved with better corpora and
>> using more powerful hardware in the training. The vocabulary size is
>> limited because of hardware constraints.
>>
>>>
>>> I'd maybe add that probably it would not be possible to collect
>>> such a corpus for Valencian Catalan, so I guess we face in this
>>> neural translator a typical problem with lesser-user
>>> languages/varieties. If it is ever considered necessary to generate
>>> Valencian, this will have to be done by translating it into
>>> "reference" Catalan and then automatically adapting it. In fact the
>>> same happens for the many flavours we currently have in Apertium
>>> for Catalan, both Valencian and "Catalonian".
>>
>>
>> It is easy to make a Catalan>Valencian adapter (a few lines of code
>> using LanguageTool). Not so easy the other way around because some
>> Valencian verbal forms are ambiguous.
>>
>>>
>>> By the way, is Softcatalà trying to create a neural translator for
>>> the Spanish-Catalan pair?
>>
>>
>> Not yet. Neural translators require a lot of hardware resources, in
>> training and in production. We could not support the current volume
>> of Spanish-Catalan translations with neural translation.
>>
>> Jaume Ortolà
>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] We now have markup handling and reordering in Apertium!

2020-08-26 Thread Kevin Brubeck Unhammer
Woohoo congrats and thanks for all the hard work Tanmai and Tino =D
The superblank issues have been a pain for quite some time.

How does it work with transfer now, what are the semantics of things
like  or just  ?


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Update about superblanks in transfer

2020-08-29 Thread Kevin Brubeck Unhammer
Tanmai Khanna 
čálii:

> we no longer need the user to be worried about blank
> positions in transfer rules. The latest update to the apertium code makes
> it such that  is now the same as  . You can change the  pos="X"/> in your transfer rules to just  and it'll work.
>
> Now, the only thing you need to worry about when writing transfer rules is
> whether you want a blank between the two LUs or not. 

  


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] We now have markup handling and reordering in Apertium!

2020-08-27 Thread Kevin Brubeck Unhammer
Tanmai Khanna 
čálii:

> So what I'll try to do, is after the blanks are collected, lets say X is
> the number of source LUs in the pattern and Y is the number of output LUs.
> If X = Y then we can keep them in the same place, if X < Y, then we can
> keep them in the first X gaps the rest can be spaces or whatever the user
> denotes. If X > Y, then we can print the first Y blanks and then flush the
> remaining. After this the  option will become useless. Does that
> sound good?

By "gaps" do you mean where the rule is outputting a ? So if input
is "ab c" and a rule matching that has two 's in its , the
 gets output on the first  and then on the second  we get
a regular space. If the rule has three 's, the third one is also
a regular space. If the rule has no 's, the  gets output after
the rule output. That would be nice (though I could also live with the
 always ending up after the rule as long as I never have to think
about pos="…")


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Fixing Phonological Processes

2020-08-27 Thread Kevin Brubeck Unhammer
Zanga Chimombo 
čálii:

> One of the processes that occurs in one of the languages I am dealing
> with is "nk-" becoming "ng-"
>
> I thought I would be able to fix this using the post generator here:
> https://gitlab.com/zangaphee/CiBantu/-/blob/master/twoc/apertium-yao/apertium-yao.post-yao.dix
>
> However, that doesn't fix it. Have I done it incorrectly? Should I
> even be using PG to do this?

If there's a ~ before every nk, then I think that should
work. What's the exact input to pgen?

(There's an open issue on not requiring the
`~` https://github.com/apertium/lttoolbox/issues/42 )


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] We now have markup handling and reordering in Apertium!

2020-08-26 Thread Kevin Brubeck Unhammer
Tanmai Khanna 
čálii:

> Thanks Unhammer!
> So now we have three kinds of units: block tags, superblanks, and wordbound
> blanks. Block tags are hard breaks in the text, wordbound blanks move
> around with words, and superblanks are tags that aren't hard breaks but not
> attached to words (such as ). Tino can give you a list of tags and
> their classifications.
>
> As per your question about transfer, the  refer to the
> superblanks if they exist in the input. Wordbound blanks will reorder
> automatically and block tags won't move at all.
>
> We can also decide to make the remaining superblanks immovable and just
> output them when a rule is matched, but that is a decision that can be
> taken in the future. For now, blanks in transfer rules work for any
> superblanks that still exist in the stream. Hope that answered your
> question :))

Almost :) If input has  somewhere, and my transfer rules don't have
any  (only maybe ), will it still be output?


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] We now have markup handling and reordering in Apertium!

2020-08-27 Thread Kevin Brubeck Unhammer
Tanmai Khanna 
čálii:

> I always thought that's the default behaviour. That if some blanks aren't
> explicitly printed in the transfer rules then they're flushed. I'll check
> it out, but it should be that.

The old behaviour has been to just throw away anything that's eaten by
a rule but not explicitly printed. So if you had a rule matching two
patterns, for example the words "ph'nglui mglw'nafh", and your input was
"ph'nglui mglw'nafh", but you just used  and not
, then transfer would eat the first blink giving "ph'nglui
mglw'nafh" and you would not be eaten first.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Fwd: tmx tools

2020-09-28 Thread Kevin Brubeck Unhammer
(Antonio: I forwarded your message to apertium-stuff since you're more
likely to get help there)

Are the TMX tools still used by anyone?


 Start of forwarded message 
From: Antonio Giovanni Contarino 
Date: Sat, 19 Sep 2020 14:46:39 +0200
To: apertium-cont...@lists.sourceforge.net
Subject: [Apertium-contact] tmx tools

Hello!

I'm interested in using some of these
 TMX tools. What are the
steps to do so?

Thank you in advance.

A.G. Contarino
___
Apertium-contact mailing list
apertium-cont...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-contact
 End of forwarded message 
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] How useful is eliminating trimming for language developers?

2020-05-25 Thread Kevin Brubeck Unhammer
Tanmai Khanna 
čálii:

> *making trimming the norm and having the option of
> eliminating it, or making eliminating trimming the norm and having the
> option of activating it, or to have partial trimming, as discussed later.*

I'd vote for keeping trimming the norm, implementing the project
(keeping everything 100 % backwards compatible), letting people who wish
switch to it, then asking this question again :)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] How useful is eliminating trimming for language developers?

2020-05-25 Thread Kevin Brubeck Unhammer
Flammie A Pirinen  čálii:

>> 4. Weighting the monodix will take more compile time than just trimming it.
>
> Some numbers would be interesting, I think both are quite heavy and we
> don't do much further processing in finite-state algebra (/hfst space)
> so the weighted models won't blow up. In any case, people seem to be
> happy in 2020 to wait 70 hours for some neural stuff, few minutes for
> weighted automata won't be too bad ;-)

One of the main advantages of RBMT is we can quickly fix little things
and see the results nearly right away. When working on nno-nob I often
compile every few minutes. OTOH, nno-nob has good coverage, so I rarely
notice things like missing words causing disambiguation errors. So for
me, a noticable increase in compile time would be enough to make me not
use this.

But yes, we need numbers. It needs to be implemented and tried before
we can decide if it should be default.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] How useful is eliminating trimming for language developers?

2020-05-26 Thread Kevin Brubeck Unhammer
Tanmai Khanna 
čálii:

> Here's a timing test for weighted dictionaries.
> On apertium-eng-kaz:
>
> 1. 
> real 0m4.257s
> 2. 
> real 0m7.990s

With nob→nno plain lt-trim 1. takes 33s
whereas the long script in 2. takes 45s.

File size increases from 1.2M to 5.5M, seems acceptable (the unweighted,
untrimmed nob analyser is 2.1M).

However, I think it should be possible to do 2. nearly as quickly as
1. in pure lttoolbox (at least if tags are used instead of weights).


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] let's move the mailing lists to sourcehut

2020-09-21 Thread Kevin Brubeck Unhammer
Francis Tyers  čálii:

> Sourcehut is a free/open-source "forge" type thing run by Drew
> DeVault. They have
> mailing lists.
>
> Our current mailing lists are with SourceForge and all of the terrible
> stuff that
> goes with that.
>
> Here is a link:
>
> https://lists.sr.ht/
>
> What do people think?

+1. I have a very good impression of Sourcehut. The code for their own
site is available, they publish their own financial reports, they work
without javascript, there's no tracking, the guy running it seems
trustworthy.

Considering the trouble people have just setting up their own e-mail
server without getting constantly spam-listed by other people's Gmail
accounts – and the fact that the mailing lists are supposed to be public
anyway – it'd be nice to have a third party host our mailing lists.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Starring Apertium repositories

2020-07-18 Thread Kevin Brubeck Unhammer
⭐



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

2020-06-17 Thread Kevin Brubeck Unhammer
Francis Tyers  čálii:

> fust
> ferro
>
> ->
>
> 
>
>
> 
>
> These could then be included into the .lrx file by the Makefile, or
> a separate, monolingual, file could be another argument to lrx-comp.

That's a great idea =D


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Apertium used by investigative journalists studying leaks

2020-06-11 Thread Kevin Brubeck Unhammer
I came across this article about various FOSS tools used by journalists
digging through a data leak:
https://www.icij.org/investigations/luanda-leaks/how-we-mined-more-than-715000-luanda-leaks-records
where they mention using Apertium to avoid sharing sensitive information:

With more than half of the documents written in Portuguese, digging
into the leak was even more challenging – the majority of
journalists working on this project were not fluent in
Portuguese. For security and source-protection reasons, we wanted to
avoid common online machine-translation tools (such as Google
Translate or DeepL) and, instead, have the translation directly
available on Datashare. So we decided to use an open-source piece of
software called Apertium. Our team wrapped Apertium within a
command-line tool that was able to translate any language pair
directly into Datashare. We published the code of this tool on
Github.

Great that we have such free software translation tools available :)



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Releases for everything

2020-12-10 Thread Kevin Brubeck Unhammer




___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Proper noun classification considered harmful

2021-02-02 Thread Kevin Brubeck Unhammer
Flammie A Pirinen  čálii:

> Hi all,
>
> I've written a handful of apertium-fin-* prototypes and I usually end up
> spending way too much time with all the useless subclasses of proper
> nouns we have (cogs, ants, als, tops, orgs, and to top all that,
> sometimes ms and fs for some extra (mis)gendering). Could we just get
> rid of those or those someone have a good use for them? Most of the time
> it's very random anyways and we aren't really doing NERing or anything.
> I think if these are used in e.g. cg or whatever we should probably have
> different way of introducing them that doesn't intervene with
> analysis-generation stuffs, like we talked passing by in the last
> apertium zoom meeting? Or is there some smart way to bypass them I
> haven't thought of (probably)

Genders are useful when anaphora resolving / in transfer, though only on
person names. There are some place/org names from swe that have genders
(originally from SALDO) which bled into other scandipairs – I'd be happy
to remove those since they seem quite useless for us.

The ,  and  tags are used quite a bit in the nob
disambiguator, but not in transfer.

I tend to underspecify np's in bidix:

 IranIran
 ThielThiel
 SarumanSaruman
 ContrasContras

so just the monodixen need to be synced. If there is an actual
bidix-relevant difference, e.g. some place name gets translated but not
if it's a person name, then one can specify the tags for just that
entry.

The remaining problem is when the analyser gives ^Saruman$ and
you try to send that into a generator that expects ^Saruman$.

We could perhaps use the Giellatekno solution for that, where dixen have
RL entries that just contain  (ie., no cog/ant/al), and some
transfer step cleans off the tags. Should be a fairly simple change, and
it's tried and tested in giella-pairs. Since lttoolbox is used mostly
for languages where np pardefs are small, adding the RL's is like max
10 extra lines; for languages requiring hfst it's probably a fairly
simple twol or xfregex rule?



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Ubuntu 16.04 (Xenial) going EOL

2021-05-10 Thread Kevin Brubeck Unhammer
"Bernard Chardonneau" 
čálii:

>  a solution could be to give both source code of Apertium tools
> ans source code of system libraries it uses. These libraries would be
> compiled with Apertium tools using them and object files stored outside
> /usr/lib . So, there would not be compatibility problems with other version
> of the same library in the distribution.

You can already do this – just like you could before Tino kindly
packaged things for us.

git clone https://github.com/apertium/lttoolbox
cd lttoolbox
./autogen.sh --prefix=$HOME/ap
make
make install

etc. for apertium-lex-tools, apertium, apertium-separable. For cg3, you
use 
./cmake.sh -DCMAKE_INSTALL_PREFIX=$HOME/ap
instead of autogen.

I don't think apertium depends on a very new version of libxml, but
it's probably the impossiblest to compile the same way, just put it into
the same prefix. You'll have to put something like this in your bashrc:

PREFIX=$HOME/ap
PATH="${PREFIX}/bin:${PATH}"
LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:${PREFIX}/lib"
PKG_CONFIG_PATH="${PKG_CONFIG_PATH}:${PREFIX}/share/pkgconfig:${PREFIX}/lib/pkgconfig"
ACLOCAL_PATH="${ACLOCAL_PATH}:${PREFIX}/share/aclocal"
export ACLOCAL_PATH PATH PKG_CONFIG_PATH LD_LIBRARY_PATH

If you ever need a newer C++ compiler, there are apt repos that do backports.


But I wouldn't recommend it – it'll probably take you more time than
just upgrading your OS once every three years.

(Another alternative would be something like Docker – you can run
a newer debian inside your old debian and apt install things.)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Ubuntu 16.04 (Xenial) going EOL

2021-05-10 Thread Kevin Brubeck Unhammer
Kevin Brubeck Unhammer  čálii:

> it's probably the impossiblest to compile the same way, just put it into

that line was missing a "not" there :-)



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Cleaning Parallel Corpus

2021-04-29 Thread Kevin Brubeck Unhammer
VIVEK VICKY 
čálii:

> Hello everyone,
> The eng-spa parallel corpora I am using(http://www.statmt.org/europarl/,
> http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty lines
> in either languages due to splitting of a sentence into two or merging of
> two sentences after the translation, which is causing errors during
> lexical-training. Is it common in parallel corpora? or is there any clean
> parallel corpus out there?
> Right now, I am translating the sentences around[up and below] the empty
> lines and manually merging/splitting them. Is there any better way to do
> this?

Can you give an example? I took a look at that corpus and haven't found
any unmatched lines yet. Make sure you use the es-en.en file when
pairing es with en (that is, don't use cs-en.en with es-en.es).

(It *is* common to find semi-parallel corpora out there, but I suppose
we can leave sentence alignment out of the GsoC task unless
there's extra time, and assume corpora will be fairly clean.)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Released: nno-nob 1.4.0

2021-04-29 Thread Kevin Brubeck Unhammer
Hi,

I've tagged some new releases of nno, nob and apertium-nno-nob.

Like before[0], the work has been funded by the Norwegian Ministry of
Culture via Nynorsk pressekontor (NPK) and the Norwegian News Agency,
now with direct commits from contributors Anja, Victoria and Hallvard of
NPK :-)

One major visible change is that we now let the user select a number of
spelling variants using a new preferences system. Instead of compiling
one FST per set of style choices, we just generate all choices on the
fly and disambiguate.[1] You can try it already on the Beta
site[2] (though currently it may fail if you use it with
Transfuse[3]). Before, the user could only select if infinitives ended
in -e or -a; now they can also pick if the third person plural should be
"me" or "vi", if words like "byggje" should have the optional j there,
etc. They can combine such options as they choose, and we'll be adding
more options in the future.

Since last time, we've also updated the monolingual dictionaries with
new entries from the updated Norsk ordbank[4] and gotten lots of new
bidix entry as well through that.

Other changes:
- 41 new transfer rules
- 614 new lrx rules
- about 800 new names and 26.800 new non-names added to bidix
  (many scriptually added via new Norsk ordbank entries)
- many transfer tweaks, e.g. adverbs can move past noun phrases, new
  constructions recognised 
- lots of work on nob disambiguation, especially on noun vs verb and
  participles (which gain a distinction in nno which they don't have in
  nob)
- much more consistent default nno spelling choices
- rules for name guessing using CG
- number compounding + more left-hand-only compound parts

WER on news text continues to stay around 4% – we're on the one hand
reaching deep into the long tail of unknown words, and on the other
hand spending more time making things more idiomatic with multi-word
rules. The next steps include better support for correcting
capitalisation[5] and starting to use apertium-separable for MWE's.[6]


-Kevin

[0] 
https://sourceforge.net/p/apertium/mailman/apertium-stuff/thread/CABnmVq5J5Acc7r4XwtMgVR2eyd5dF2ab4gUsUv2ZWPzMWE5J7A%40mail.gmail.com/
[1] 
https://wiki.apertium.org/wiki/Dialectal_or_standard_variation#Overlapping_variants
[2] 
https://beta.apertium.org/index.eng.html?dir=nob-nno=Vi%20liker%20enten%20%C3%A5%20fortsette%20%C3%A5%20bygge%20n%C3%A5r%20vi%20blant%20annet%20s%C3%B8ker%20forskjellen%20mens%20dere%20er%20uenige.#translation
[3] https://github.com/TinoDidriksen/cg3/pull/75
[4] https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-41/
[5] We'd like to be able to state in bidix that "Xyz" should turn into "xyz",
which is currently not possible. See also 
https://github.com/apertium/apertium/issues/75
[6] currently blocked by 
https://github.com/apertium/apertium-separable/issues/36



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Proper noun classification considered harmful

2021-02-02 Thread Kevin Brubeck Unhammer
Hèctor Alòs i Font 
čálii:

> I am more sceptical about the need to distinguish between toponyms and
> hydronyms. In some languages one will have an article and the other will
> not, but these are rare cases. On the other hand, we do not distinguish
> between countries (or regions) and cities, which in French is quite
> important both for generating the article and the preposition preceding it,
> if you translate from Catalan or Spanish: for instance, "New-York" is the
> city, but "le New-York" is the state, so will have "à New-York" or "au
> New-York" for "in New-York" (or "à Paris" but "en France").  The generation
> of articles may also not be the same whether "Barcelona" stands for the
> city or the (football or whatever) team, nor is the gender often the same.
> So, are we then going to create more and more subtypes ad nauseam? Better
> not!
>
> In short, we can find casuistries in certain pairs that may make us think
> that some distinctions are appropriate, but adding them in monolingual
> dictionaries and forcing them to be maintained for all languages seems
> doubtful to me.

So the city-vs-region distinction is only useful for target (structural)
generation, not source analysis/disambiguation/anaphora. I think that
can be a good guide to when something should be in monodixen or not.

One solution here would be to add it in bidix (with a pardef so you
don't need it when going the other way) and strip it in transfer, or
even just use a def-list in the transfer files.



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] My god, It's Full of Styles

2021-03-16 Thread Kevin Brubeck Unhammer
Hi all,

I made a thing:
https://apertium.trigram.no/?dir=nob-nno=Vi%20liker%20enten%20%C3%A5%20fortsette%20%C3%A5%20bygge%20n%C3%A5r%20vi%20blant%20annet%20s%C3%B8ker%20forskjellen%20mens%20dere%20er%20uenige.#translation
(try toggling the various "Style preferences")

Norway in general has a very positive view on dialects, and there are
quite a lot of accepted spelling/word alternatives in both the Bokmål
and Nynorsk variants of Norwegian. And people have style preferences
that don't map cleanly into disjoint sets – some people want
{me,a,kj}, others want {me,e,kj} or {vi,a,k} etc. The current system
of alt-attributes and multiple binaries doesn't scale here. With just
two options, you have to have four alt-values (and lots of duplication
in .dix) and four generators – we'd like dozens of options, but without
compiling 2^dozens of generators.

So instead, why not just generate everything and disambiguate? CG can
read variables inserted as blanks in the stream, and then use those to
decide which style preferences to keep or throw away. Apparently others
already do this.[1]

I've got some branches of apertium-nno-nob/apertium-nno where
it's implemented. We put a cg-proc command after both bidix and the
generator, and change the generator to use the bilingual format
(lt-proc -b).[2]


For bidix-specified preferences, when translating from right to left, we
remove the r="LR" on the entries in question, and match on lemmas in the
biprefs.rlx CG file:[3]

SELECT ("skilnad"i) IF (0 ("forskjell"i) + (VAR:forskjell_skilnad));
REMOVE ("skilnad"i) IF (0 ("forskjell"i));

The preference variable here is named "forskjell_skilnad" since the
default is "forskjell", but if that option is ticked / variable is set,
we choose "skilnad".


For generator-specified preferences, we remove the LR's and instead add
a tag when generating, typically through a pardef, e.g. this one is used
for the set of words where 'kj' kan be written as 'k':


   
  


I gave cg-proc a new switch -g/--generation that outputs lexical units
without without tags (unless --trace is also given) and without
surrounding ^$. The generator (running with lt-proc -b) now gives

$ echo spøkelse | apertium -d . nob-nno-dgen
^spøkelse/spøkelse/spøkjelse$^./.$

and yeah maybe the tag is at a weird spot but cg-proc doesn't mind:

$ echo '^spøkelse/spøkelse/spøkjelse$' | cg-proc -n 
-g -w nob-nno.genprefs.rlx.bin
spøkjelse

echo 
'[]^spøkelse/spøkelse/spøkjelse$'
 | cg-proc -n -g -w nob-nno.genprefs.rlx.bin
[]spøkelse

Those STREAMCMD's are hard to do manually, so /usr/bin/apertium can now
insert and strip them by reading AP_SETVAR:

$ export AP_SETVAR
$ for AP_SETVAR in "" "kj_k" "kj_k,infa_infe" "infa_infe"; do echo spøke| 
apertium -d . nob-nno; done
spøkja
spøka
spøke
spøkje

The changes to apertium and vislcg3 are merged, but the html-tools
changes may need deuglifying and testing.

I also haven't merged the changes to apy[6] yet, since we need to
bikeshed how to get the list of possible options from the language pair
(which html-tools shows). Currently the apy branch just hardcodes the
list for nob→nno[7] :)

Some preferences are in the monolingual packages and some in the pair,
so the preferences for a pair need to include both. Perhaps each package
includes one or more preferences.xml files, and then modes.xml can do


…






where nno.preferences.xml is copied by make from
$(LANG1)/preferences.xml and contains something


søkje → søke
søkje → søke


(then apy would have to parse modes.xml and the files listed there)

Or is there a better way?




[1] https://github.com/TinoDidriksen/cg3/issues/68#issuecomment-736571504
[2] 
https://github.com/apertium/apertium-nno-nob/compare/biprefs#diff-94c6b34f4d7517dc0915b07677ec9a8656b559e0bbdb683a66d4270790b88812R21-L39
[3] 
https://github.com/apertium/apertium-nno-nob/compare/biprefs#diff-4e8f9b0972e6a59cd53d18a476d8d01bbe772f29a5c481659b24185d6839cd1dR14-R15
[4] 
https://github.com/apertium/apertium-nno/compare/biprefs#diff-4e49c13e8aa1b44221621ad844edbd9be60169e8baa78ed7bbb80b4237865573R638-R641
[5] https://github.com/apertium/apertium-html-tools/tree/biprefs
[6] https://github.com/apertium/apertium-apy/tree/biprefs
[7] 
https://github.com/apertium/apertium-apy/blob/c16902972097d4e9dd9af1d14413562f96d32604/apertium_apy/handlers/translate.py#L219



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Seminar on opportunities/challenges of MT in education

2021-02-28 Thread Kevin Brubeck Unhammer
Volda University College recently had a seminar on MT in education –
half of it dedicated to Apertium:
https://nynorsksenteret.no/blogg/program-for-digital-fagdag-om-omsetjingsteknologi

They also link to an article
https://nynorsksenteret.no/uploads/images/Artikkel-Aasbrenn-okt2020.pdf
titled "Apertium as a teammate in Nynorsk education" (in response to
someone considering it an opponent), which concludes that "intentional
use of Apertium can make lessons more fun for the students as well as
improve their linguistic sensibilities".



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] GSOC proposal draft - building a prototype MT system

2021-04-07 Thread Kevin Brubeck Unhammer
Rajarshi Roychoudhury
 čálii:

> Bhojpuri and Hindi are very closely related language pairs
>  As far as I know(correct me if I am wrong) , apart from some minor
> phoenetical changes they can be considered identical pairs .

Seems like a good fit for Apertium then :) considering one of the most
popular pairs in Apertium is Nynorsk–Bokmål. Here's a sentence in
Nynorsk:

- Dette språkparet er kjempepopulært, veldig rart når det er så likt.

And here's the same sentence translated into Bokmål:

- Dette språkparet er kjempepopulært, veldig rart når det er så likt.

I could give a tree structure but I think you get the point.

If people write or want to write things in Bhojpuri then it would be
useful to have an MT system and if it doesn't differ much from Hindi
then it's more likely to succeed in a (short) Apertium GsoC project.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] A question about Apertium Kazakh and Tatar packages

2021-09-02 Thread Kevin Brubeck Unhammer
Tino Didriksen 
čálii:

> The monolingual packages install many more modes, because they are used for
> further development. So you can get morph from those. But biltrans is not
> normal to want if you aren't a developer, and thus building from source.

The reasoning is that

- people who want to use Apertium for translation only care about
  installing pairs
- `apertium -l` lists all available modes of installed apertium packages 
- we shouldn't clutter that list for people who want to use Apertium for
  translation 

So kaz-rus shouldn't be installing biltrans, because an end-user would
find it confusing/annoying and have more trouble finding what's useful
to them in that list.


However – there are people who want to use debug modes but would rather
not want to compile a pair and manually
`git pull && make && make test || revert-to-last-working-revision`.

Would it make sense to install debug-modes to a debug-modes folder? Put
stuff like -biltrans etc. in /usr/share/apertium/debug-modes, and then
`apertium -l` only shows translation /modes while `apertium -L` shows
both /modes and /debug-modes? (And `apertium kaz-rus-biltrans` works
without any special switches because why not, while `apertium
nonexistent` runs `apertium -l` and gives a hint to use `-L` to show the
rest.)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] skipping quoted text

2021-09-06 Thread Kevin Brubeck Unhammer
Hi,

Does anyone have any smart methods for protecting text in quotes? 
It's certainly possible with a little pre-/postprocessing like
https://github.com/apertium/apertium/issues/32
(a bit less safe if your language uses "" instead of «», I suppose
you could restrict it to only match fairly short quotes), but are there
better ways to do this? Should it be an option to the apertium script
itself / apy, or would it be too language-pair specific (and in that
case, would it make sense to have it *in* the pipeline somehow?)



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] A question about Apertium Kazakh and Tatar packages

2021-09-05 Thread Kevin Brubeck Unhammer
Jonathan Washington
 čálii:

> As to Andrey's question concerning kaz-rus not working because of a
> missing .t4x file, that sounds like a legit packaging error, which I'm
> not sure how to fix (I really should learn...)

That was fixed in
https://github.com/apertium/apertium-kaz-rus/commit/7bd16ebbd005838988fd9c0d47a31e2564921b07
but if i understand correctly there needs to be a "data change" for
a rebuild to happen:
https://apertium.projectjj.com/apt/logs/apertium-kaz-rus/rebuild.log
… so commit a new word and check again tomorrow?

> In the meantime, Andrey, you should be able to just clone the pair and
> compile from source (`apertium-get kaz-rus`), which will fix the
> missing file issue.¹  This gets around the missing modes as well,² and
> will be more future-proof given what Kevin and Tino are discussing.

Modes files don't change very often, so if the goal is to use the newest
packages without having to recompile all the time you could (as
a temporary workaround) manually make a file
/usr/share/apertium/modes/kaz-rus-biltrans.mode containing the same as
/usr/share/apertium/modes/kaz-rus.mode but without the bits after
autobil.bin

(I made https://github.com/apertium/apertium/issues/132 to track/discuss
installing debug modes, subscribe to that if you want notification when
that's possible.)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Regression Testing Now Operational

2021-08-01 Thread Kevin Brubeck Unhammer
Flammie A Pirinen  čálii:

>  chunk names had an effect to the
> translation

Postchunk by default applies the chunk case onto the lemma.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] English-Santali Plural form not working

2021-12-20 Thread Kevin Brubeck Unhammer
Make sure you compile (`make -j langs`) before testing, and preferably
before each commit as well. The version on github doesn't compile right
now:

apertium-eng-sat.eng-sat.t1x:69: element rule: validity error : Element 
rule content does not follow the DTD, expecting (pattern , action), got 
(pattern action rule )

You could check the various stages of the pipeline to see where the
problem happens. Do `ls modes/eng-sat*` to see the eng-sat-related
debug modes, you can do e.g.

echo Gregor woke up | apertium -d . eng-sat-lex

to see up until lexical selection (typically the step right before
transfer).

Prasanta Hembram
 čálii:

> Thanks for the clarification sir, but I tried changing my rules and paradef
> but no luck, no forms including paradef are working for bilingual
> English-Santali Dictionary. Only working in a monolingual Dictionary. It is
> detected as Plural for Bilingual Dictionary, but with that nouns should be
> inflicted as ᱰᱟᱹᱝᱜᱽᱨᱤ  ᱠᱚ for Cows. I'm not able to find where the problem
> lies. But for the monolingual Santali Dictionary it is working fine. After
> deleting the dual form there is no luck. I'm following this book and trying
> to make rules :-). Luckily I got this from the Internet.
>  OCRA-Glimpes-of-Santali-Grammar.pdf
> 
> with best regards
> Prasanta Hembram
>
>
>
> On Sun, Dec 19, 2021 at 11:59 PM Hèctor Alòs i Font 
> wrote:
>
>> Hi Prasanta,
>>
>> I'm delighted you are working on Santali.
>>
>> I have seen your code in github, and I have seen nothing that could
>> generate this kind of error. So, I downloaded your code, I fixed the
>> eng-sat.t1x file because it has a syntactic error, and I ran the
>> translation of "cows"... but I can't get anything, seemingly because of a
>> problem in the postchunk module (although it is standard):
>>
>> $ echo "cow" | apertium -d . eng-sat-interchunk
>> ^ᱰᱟᱹᱝᱜᱽᱨᱤ$^sent{^᱾$}$
>> $ echo "cow" | apertium -d . eng-sat-postchunk
>> ^᱾$
>>
>> In any case, if you somehow get something of the type "X/Y" (like
>> "ᱰᱟᱹᱝᱜᱽᱨᱤ ᱠᱤᱱ/ ᱰᱟᱹᱝᱜᱽᱨᱤ ᱠᱚ") this is because in the target dictionary there
>> are two generations for  ᱰᱟᱹᱝᱜᱽᱨᱤ. Probably you have:
>>
>> 
>>   
>>  ᱠᱤᱱ 
>>  ᱠᱚ 
>> 
>>
>> instead of:
>>
>> 
>>   
>>  ᱠᱤᱱ 
>>  ᱠᱚ 
>> 
>>
>> Regards,
>> Hèctor
>>
>>
>> Missatge de Prasanta Hembram 
>>  del dia dg.,
>> 19 de des. 2021 a les 19:09:
>>
>>> Hi, I'm working on a new language pair English -Santali pair and trying
>>> to learn everyday how can i improve this pair. Today I had a few doubts.
>>>
>>> Doubt 1: Plural rules are not working when translating from English to
>>> Santali, The English plural form "Cows" returns output " ᱰᱟᱹᱝᱜᱽᱨᱤ " instead
>>> of "ᱰᱟᱹᱝᱜᱽᱨᱤ ᱠᱤᱱ/ ᱰᱟᱹᱝᱜᱽᱨᱤ ᱠᱚ" . I have set up paradef but no luck. How to
>>> get the correct output??
>>>
>>> The correct forms are as follows
>>> Cow = ᱰᱟᱹᱝᱜᱽᱨᱤ
>>> Cows = ᱰᱟᱹᱝᱜᱽᱨᱤ ᱠᱤᱱ  or Two Cows = ᱰᱟᱹᱝᱜᱽᱨᱤ ᱠᱤᱱ
>>> Cows = ᱰᱟᱹᱝᱜᱽᱨᱤ ᱠᱚ
>>>
>>> My Santali Monolingual Dictionary link:
>>> https://github.com/Prasanta-Hembram/apertium-sat
>>>
>>> English-Santali Bilingual Dictionary link :
>>> https://github.com/Prasanta-Hembram/apertium-eng-sat
>>>
>>> echo "cows" | apertium -d . eng-sat-transfer
>>>
>>> returns wrong output: ^ᱰᱟᱹᱝᱜᱽᱨᱤ$^sent{^᱾$}$
>>>
>>>
>>>
>>>
>>>
>>>
>>> 
>>>  
>>>ᱠᱤᱱ  
>>>ᱠᱚ 
>>> 
>>>
>>> --
>>> Thanks
>>> with best regards
>>> Prasanta Hembram
>>> ___
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] English-Santali Plural form not working

2021-12-21 Thread Kevin Brubeck Unhammer
>  please unsubscribe me

try https://sourceforge.net/projects/apertium/lists/apertium-stuff/unsubscribe



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] English-Santali Plural form not working

2021-12-21 Thread Kevin Brubeck Unhammer
>  please unsubscribe me

https://sourceforge.net/projects/apertium/lists/apertium-stuff/unsubscribe



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Changes on apertium-apy

2021-12-15 Thread Kevin Brubeck Unhammer
> Please unsubscribe me from this list. Or please tell me how to.

https://sourceforge.net/projects/apertium/lists/apertium-stuff/unsubscribe



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Automatically change first-person to third-person

2022-02-14 Thread Kevin Brubeck Unhammer

> The link in that earlier email is dead, so I can't see what the original
> script was doing, but based on the name it might have just been replacing
>  with , in which case, if you still have that script, you could
> just edit it to replace  with .

Wops, I should've attached it …

These days I think I'd use rtx for this, probably would be an even
shorter file =D


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] New CG editor features for Emacs

2022-03-23 Thread Kevin Brubeck Unhammer
Hi all,

If anyone uses Emacs out there, the CG package now does error/warning
highlighting out-of-the-box:
https://wiki.apertium.org/w/images/0/03/Cg-flymake.gif

There are also new Toolbar Buttons for those who like clicking things,
letting you open the input editor, run the grammar and filter the
output.

Documentation is at https://wiki.apertium.org/wiki/Emacs#CG
You can install from MELPA, or follow
https://wiki.apertium.org/wiki/Emacs#Quickstart which also enables
dix and hfst modes.

-

If you don't use Emacs, there is a list of editor support here:
https://wiki.apertium.org/wiki/Constraint_Grammar#Editor_support
(please add to it if you know of more!)



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Get Faster Compilation With This One Simple Trick!

2022-03-30 Thread Kevin Brubeck Unhammer
TL;DR: Put `export LT_JOBS=yes` in your bash profile for faster dix
compilation. You'll need the newest packages for this. My most commonly
run compilation now takes 60s instead of 90s.

Please try it out and report back if you find any bugs.

In more depth: The newest package of lttoolbox in nightlies
(3.6.3+g538~e7418efe-1~focal1 on my Ubuntu) has a feature for minimising
sections in parallel. Minimisation is the bottleneck of lt-comp, and
each section is independent, so this was trivial to parallelise.
However, most dix files have one big section and one tiny one so there's
not much to win. So by default lt-comp also creates a new section on the
fly after reading 50.000 entries. This can speed things up even without
adding CPU cores (minimisation gets non-linearly slower the larger the
transducer). You can tweak this with the environment variable
`LT_MAX_SECTION_ENTRIES`.

More info at
https://github.com/apertium/lttoolbox/pull/133
https://github.com/apertium/lttoolbox/pull/135



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] apertium-srd-ita 1.2.0 ready to be released

2023-10-09 Thread Kevin Brubeck Unhammer
拾 congrats!



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] newbie

2023-08-23 Thread Kevin Brubeck Unhammer
> Theoretical, if I am able to get thru Sysdamins, this is “all” we need for 
> Suse?
> # Nightly, unstable, new, almost always use this:
> curl -sS https://apertium.projectjj.com/rpm/install-nightly.sh | sudo bash
>
> —OR—
>
> # Release, stable, old:
> curl -sS https://apertium.projectjj.com/rpm/install-release.sh | sudo bash
> and finally:
> # OpenSUSE:
> sudo zypper install apertium-all-devel
> What I do not see how ENG-DEU module will be part of it?

If you run install-nightly.sh, you should be able to install
apertium-eng-deu (and apertium-apy) afterwards.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] newbie

2023-08-22 Thread Kevin Brubeck Unhammer
Hi,

lt-proc is just one part of the pipeline, you also need transfer, tagger
etc. – the full pipeline is at
https://github.com/apertium/apertium-eng-deu/blob/master/modes.xml#L19..L62
It also depends on cg-proc, lsx-proc and lrx-proc which don't have Java
ports. So as of now, you can't run it with just lttoolbox.jar. The
easiest option is to just shell out to the regular Apertium pipeline.

Is this meant to run on Android or something since you're looking at
Java?

best regards,
Kevin Brubeck Unhammer 



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] newbie

2023-08-22 Thread Kevin Brubeck Unhammer
> Thank you Kevin
>
> I was asked to find offline translator for a tomcat based webapp.
> Tomcat itself is running on linux (development on win and mac). Our
> goal was with pure java to get less integration issues with system
> administration and to keep all logic in the same (WAR) file. But know
> my understanding, that lttoolbox-java is part of the game.
> No way to get a java only version? Even how pure android version would go?

There was a pure Java Android app long ago, but lately pretty much all
language pairs require modules that don't have Java ports.

But if you're on a Linux server, you can probably just run
https://apertium.projectjj.com/apt/install-nightly.sh and then `apt
install apertium-apy apertium-eng-deu` and you'll have a localhost api
you can call, see https://wiki.apertium.org/wiki/Apertium-apy

Or use the apy Docker image if you're on a non-Debian-based system; that
might be easiest if you're developing on non-Linux anyway.






___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Apertium PMC Election: Census & Candidates

2022-04-26 Thread Kevin Brubeck Unhammer
> === Candidates:
> Do you want to be a PMC member? Speak up!

I do.


-Kevin


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] partners for Initiative to require support for third party language components

2022-06-29 Thread Kevin Brubeck Unhammer
Hi,

Forwarding
https://giella.zulipchat.com/#narrow/stream/124588-all_langs/topic/IDIL.20langtech.20network

> The Norwegian Ministry of Local Government and Regional Development together 
> with the Norwegian Sámi Parliament are going to use the Indigenous Decade of 
> Indigenous Languages (IDIL) to try to change how big tech is treating - or 
> not - languages outside their comfort zone. The main target is systems and 
> tools penetrating major parts of the society (like mobile or desktop 
> operating systems, or office packages used in school), or system & tools 
> critical to the society (like patient-facing health care services). The basic 
> idea is that every part of such systems that touches human language must be 
> made so that third parties can offer their own 
> plugin/localisation/replacement for the language component, at least for the 
> languages the vendors do not provide support for. That is, most of the 
> languages of the world.

> The coverage should be as wide as possible, from menu texts and button labels 
> to interactive speech dialog systems.

> The goal is to establish some sort of framework that would define system 
> developer responsibilities such that any single language component developer 
> would not have to ask permission for each language and component. Instead, 
> everything should be open by default. Underlying technologies would not have 
> to be open, only the human interface parts, or technological components 
> driving language technology, and language specific parts thereof.

> All in all, system providers should not be allowed to decide what language 
> technology and UI language should be available to a language community - that 
> should be up to the language community themselves, or a third party that the 
> language community cooperates with.

> To give such an initiative momentum, there needs to be as large a network of 
> partners behind it as possible. I therefore ask you all to give me names of 
> relevant institutions, preferably with a contact person, and all relevant 
> contact information. I will collect all the information I get, and forward it 
> to the ministry.

> The deadline is pretty short, Friday this week before lunch, Norwegian time.

> Some examples:

> South Africa: some (academic or governmental) group working with language 
> technology tools for SA languages, preferably the head of that group.
> The (head of the) Greenlandic language board, as well as the head of the 
> Danish language board
> EU institutions
> Welsh language authorities/board/language tool developers
> Indian ditto

> The ministry will contact all suggested network partners, and organise the 
> collaboration, we only need to give them names and emails.

> Please answer/discuss/suggest in this thread, to keep everything in one place

So basically if you know someone who could be relevant, suggest them in
that Zulip thread (it's possible to log in with github to join) and the
ministry will ask them if they'd like to get on board.
 




___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Apertium PMC Election: Bypass election?

2022-04-27 Thread Kevin Brubeck Unhammer
Xavi Ivars  čálii:

> But also, voting for just to confirm (or also push back?) the only group of
> people that volunteered seems a bit useless.
>
> Maybe if someone outside the PMC gave their opinion, voting would make more
> sense. But so far, it's been only the ones in the PMC (+ Sushain + Daniel),
> everyone agreeing.

As someone currently outside the PMC, I too vote for no vote if it means
avoiding unnecessary bureaucracy :)




signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Released: nno-nob 1.5.0

2022-10-26 Thread Kevin Brubeck Unhammer
Hi,

I've tagged new releases of nno, nob and apertium-nno-nob.

As before, work has been funded by the Norwegian Ministry of Culture via
Nynorsk pressekontor (NPK) and the Norwegian News Agency, with commits
from contributors Mari, Anja, Maria, Victoria and Hallvard of NPK.

One major change is that we now use apertium-separable lsx rules, which
has helped a lot with getting more idiomatic translations without
complicated transfer rules, e.g.

hemmeligholde X-> holde X hemmelig
stifte bekjentskap med X   -> lære X å kjenne 
X.det Y.adj.sg aftenantrekk.sg -> Y.adj.pl selkapsklær.pl

(where X/Y are any lemma). The nob→nno direction currently has 1333 such
lsx rules.

Another pipeline change is that we now use the syntax CG to get mapping
tags; these are currently only used for finding subjects of passives,
but the plan is to also use it for other things like participle
subjects. 

Other changes:
- over 20.000 new bidix entries (around 18.000 new entries in each of
  nno/nob monodixes), of which 4.370 were proper names
- 48 new transfer rules
- 40 new spelling/variant preferences
- 1019 new lrx rules
- lots more disambiguation tweaks
- many more rules for name guessing using CG

best regards,
Kevin


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Released: swe-nor 0.4.0, dan-nor 1.5.0

2022-12-19 Thread Kevin Brubeck Unhammer
Goddag,

I've just tagged new releases of swe-nor and dan-nor.

The work on swe-nor is partially funded by the Norwegian News Agency,
and dan-nor by Store norske leksikon.

For both pairs, all directions now use apertium-separable (lsx) and
recursive transfer (rtx), with testing by apertium-regtest.

Most of the work has been focused on the nob→{swe,dan} direction, but
all directions have of course improved vocabulary and seem to have
improved quality. The directions into Nynorsk are also usable with style
preferences (though it hasn't been added to the UI yet in this release).

Some stats:

dan-nor:
- Over 22.000 new non-name bidix entries
- Over 300 new lexical selection rules
- Over 300 new lexical selection rules
- ~60 separable/mwe entries, including comma insertion rules for
  generating Danish

swe-nor:
- Over 20.000 new non-name bidix entries
- Over 300 new lexical selection rules manually added
- Nearly 7000 new lexical selection rules based on corpus frequencies
- ~30 separable/mwe entries

and the newer monolingual dependencies mean much better bokmål
disambiguation (and some improvements there for the other languages as
well) as well as much better compound epenthetic choices and tweaks all
round.

Moving from chunking transfer to recursive for these pairs was a joy. I
have spent very little time on the rules, but they already cover more
than the old rules did, in much fewer lines of code (including comments
and everything, dan-nor has ~1011 lines of rtx in one file per
direction, and 8347 of t?x with three files per direction). Each
direction has about 20 rtx rules (where a rule is NP→n|ncmp n|…), 50 if
you count alternatives. There's a lot less redundancy than before, and
the recursion means we can have e.g. compounds of arbitrary length.

-Kevin



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] GSoC 2023 Mentors & Ideas?

2023-01-24 Thread Kevin Brubeck Unhammer
I'll mentor :) 

--
Kevin Brubeck Unhammer 

> GSoC 2023 org application is open, but do we have mentors for this year?
> Please report in if you want to mentor.
>
> And as every year, please review
> https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code -
> add/remove/amend ideas.
>
> -- Tino Didriksen
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] GSoC 2023 Mentors & Ideas?

2023-01-27 Thread Kevin Brubeck Unhammer
Hi,

Those sound like great projects. There's no limit to how many projects
can be related to one language; proposals are ranked based things like
student application/experience/involvement, mentor availability, and how
well the proposal fits with the Apertium project's overall goals and how
feasible it seems.

If Apertium is accepted for GSoC, we will try to get the word out
wherever we can to possible students. Sometimes they hear it from us,
other times they find Apertium from looking over the general list of
GSoC projects.

(For first-time IRC use, people seem to like using Matrix, there's a
guide at https://wiki.apertium.org/wiki/IRC/Matrix but it should be
enough to make an account at https://riot.im/app/#/register and search
for "Apertium" in public rooms.)

regards,
Kevin

> Hi
>
> I'm desperately trying to join Apertium IRC to talk about this but I
> can't, so I will use this mailing list instead.
>
> We would like to propose subjects for GSoC related to occitan
> language. We have 2 ideas :
> - Convert the occitan-french pair to recursive transfert. I think
>   Hector Alos would agree to be the mentor on this one.
> - Add two occitan varieties (provençal and limousin) to the
>   occitan-french pair. I could mentor this one, and Lo Congrès would
>   provide lexicons (for conjugations and translations) to pour in
>   Apertium monodix and bidix.
>
> I don't know how GSoC works. Can a language submit two subjects ? If
> not, could we apply this year for the recursive transfert and next
> year for the new varieties ?
>
> How does it work to search for students with the required abilities ?
>
> Thanks
>
> Aura Séguier, responsabla de projèctes e desvolopaira
> Lo Congrès permanent de la lenga occitana
> Ciutat - Creem !, 5-7 rue de la Fontaine, 64000 Pau
> T. +33 (0)5 32 00 00 64
> a.segu...@locongres.org
> 
> www.locongres.org 



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] GSoC 2023 Mentors & Ideas?

2023-01-27 Thread Kevin Brubeck Unhammer
> As far as rewriting the
> transfer rules using apertium-recursive is concerned, a co-mentor with
> experience in the module would be highly desirable.

I can try to assist :) 


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New Occitan-French release

2022-10-31 Thread Kevin Brubeck Unhammer
Congrats on the release!

And that documentation is impressive :) 

> 1) We have a serious problem in the translation from Gascon into French.
> The basic issue is that some Gascon speakers use something called
> enunciatives and others do not. These enunciatives, when they are used, are
> found in every sentence and, what is worse, they are homographs with other
> words of very high frequency. At present, we take it for granted that
> Gascon sentences have an enunciative. The problem is that if they do not,
> the disambiguator tends to assign the enunciative function to homographs
> because, by definition, there must be at least one enunciative in every
> sentence.

(With the caveat that I have no idea what enunciatives are), one option
might be to set a variable in CG if you find evidence that the text
doesn't use enunciatives, and then for the remainder of the text remove
enunciative readings if the variable is set. If every sentence of an
enon speaker must have one enon, then finding a sentence without one
would be evidence they don't speak enon:

  SETVARIABLE (non-enon) (1) (*) IF (NEGATE 0* (enon)) ;

If you know that "que" can't be enon before "xyzzy", you could prepend
that rule with

  "" REMOVE (enon) IF (1 ("xyzzy")) ;

and so on, so that the rule is more likely to hit.

Then just

  REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;

which will keep removing for all sentences of the translation.

That will have to be reset at some point, especially if using in server
(I can't remember if cg-proc already resets all variables on null
flush?) or for corpus runs. At the very least

  REMVARIABLE (non-enon) IF (0C (enon)) ;

Testing it sounds challenging.

> 2) Occitan is very diverse: not only because of its six major dialects (+
> transition areas + regions outside the borders of France with other contact
> languages), but also because of the internal variation within each of them.
> The example of the Gascon enunciative is just one of the stuff that could
> be mentioned from Gascon alone. It would be interesting to use the system
> implemented for Nynorsk to produce sub-varieties.

Highly recommended. We have 52 preference choices now (that's 2^52
possible combinations? which I believe may be higher than the number of
Nynorsk users), but with

* only one generator fst
* only one bidix fst

ie. no compilation slowdown, and a cleaner Nynorsk dix – because we had
to clean up stuff in order to do this (previously variants "løk and
"lauk" were separate lemmas, now they're one lemma with a spelling
pardef applied).


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New Occitan-French release

2022-11-04 Thread Kevin Brubeck Unhammer
What if you do

lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin | cg-proc oci.rlx.bin | …

The first CG step would output a stream variable, so that what the next
step sees is

[]
^que/que/que$ 
[more text here]

If the next step is CG, it's just

 REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;

ie. remove enunciatives whenever the var is set.

One can also unset it in the middle of the stream (if doing corpus
runs), so output of the enon-detector is

[]
^que/que/que$ 
[more text here]
[]
^que/que/que$
[more text here]

and the REMOVE:var-is-set rule will remove enunciatives in the first
part, not after seeing the REMVARIABLE.


Then the problem of looking several windows ahead is restricted to that
first enon-detector step.




Alternatively, if we assume all the input is of the same language, we
just don't know what language it is ahead of time, then you could
do several passes, where one is a detector pipeline like

lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin

that outputs the STREAMCMD and then Apy would grep for that, and insert
the STREAMCMD at the start of the call to the regular pipeline

lt-proc oci.automorf.bin | cg-proc oci.rlx.bin | …

That won't automatically work in modes files, and won't work for corpus
tests if the corpus has a mix, but OTOH you could use 'export
AP_SETVAR=non-enon' to force the regular pipeline to insert the
STREAMCMD at the start.



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New Occitan-French release

2022-11-01 Thread Kevin Brubeck Unhammer
Hèctor Alòs i Font 
čálii:

> Enunciatives are a kind of adverbs that are put just before verbs in main
> clauses (although they can also be found in subordinate clauses too). For
> affirmative clauses, it works like the English reinforcement "do" in "I do
> like", but it is syntactically compulsory for enunciative users, so it's
> not seen as a reinforcement. The problem is that for affirmative clauses
> the enunciative is "que", which can be cnjsub (=that), rel (=that, which),
> prn.itg (=what, which) and a comparative (=than). Note that cnjsub, rel and
> prn.itg are often right in front of the verb in Occitan too. For negative,
> interrogative and exclamatory clauses other words can be used, but also
> "que"... which makes all the thing a big mess. (And there are more with
> dubitative, emphatic, etc. meanings).
>
> As for your proposal, I do not yet have sufficient knowledge of CG to fully
> understand it. My idea would be to make a first pass through a whole text
> to understand if enunciatives are used in it (for example, recognising
> other, more infrequent, but more easily recognisable enunciatives). In the
> solution you propose, it seems that this knowledge is acquired
> progressively, as sentences are translated. I fear that "que" is so messy
> that at least the first sentences of a text would have the same problems as
> we have now when we translate a Gascon text without enunciatives.

That should be possible too, though I'm not sure how feasible it is to
get CG to go that far into a text. By default, CG keeps a context of two
windows, but that's configurable. It should be possible (perhaps with
minor modifications to cg-proc) to read a bunch of sentences and use
Window Spanning tests https://visl.sdu.dk/cg3/single/#test-spanning

Tino, have you tried looking ahead several paragraphs, are there any
downsides? This should be a fairly simple rule file.

> This sounds perfect for Occitan. Is there a documentation in the wiki?

There is! See:
https://wiki.apertium.org/wiki/Dialectal_or_standard_variation#Overlapping_variants




signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Tagset Standardization

2023-03-07 Thread Kevin Brubeck Unhammer
Daniel Swanson
 čálii:

> Greetings Apertiumers!
>
> This morning I set out to change the Ancient Hebrew analyzer from
> Latin script to Hebrew script (a task I don't wish upon anyone) and in
> the process produced a search-and-replace tool that understands the
> structure of several of our source files:
> https://github.com/mr-martian/apertium-grep

Awesome!

> This script could, without too much trouble, be expanded to cover the
> rest of our source files, at which point I would like to propose that
> we move towards greater standardization of our tagset:
> https://wiki.apertium.org/wiki/List_of_symbols
>
> At minimum, I would like to deal with some of the duplicate tags, like
> impf/imperf, rec/res, v/vblex, pass/pasv, etc.

That would be great! I'll put in a vote for pasv right now.

> My preference would be that we also consider splitting compound tags,
> like the tense+mood (fti, fts, pii, pis) and maybe possessor and
> subject tags (px1sg, s_1sg).

It makes sense to split tense and mood, as well as number and person,
but I doubt it can be done automatically – it will require careful
changes to CG and transfer. Might make sense to try it on one language
pair along with the maintainer and see how it goes.

It would be very dangerous to turn  into  – that would
break lots of CG and transfer rules and possibly lead to more complexity
in tag matching since you now have to always check for the existence of
 whereever you check for  etc.

> And if we wanted to go really crazy we
> could consider a broader rewrite like changing our tags to UD-style
> feature-value pairs (so  becomes ), though I don't
> imagine we actually want to go nearly that far.

In principle I like it, but in practice I find it too noisy to be
worthwhile. Feature-value pairs are very useful if there's several
features that can have the same value, e.g. booleans Seen=true or
numbers Age=65, and for linguistics there are some use-cases like
SubjNumber vs ObjNumber, or  vs . But for the majority of the
values for most languages, the value indicates its feature (only Voice
can have the value Passive, only Case can have the value Genitive etc.).
So we'd end up with longer lines and debug output and lots of docs
needing rewriting and – most importantly – people needing retraining,
only for the sake of avoiding sticking a "apertium-to-ud" script in a
pipeline.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Regarding GSOC Project

2023-03-23 Thread Kevin Brubeck Unhammer
>  Since there was no coding challenge mentioned on the wiki,
> I'm assuming that there is none

The task just links to the issue tracker
https://github.com/apertium/apertium-python/issues
so I'm guessing a good place to start would be to

1. try out the api, see what features are there

2. try out the rest of apertium, both through beta.apertium.org and the
   command line, try both language pairs and monolingual packages
   Install apertium using the nightly packages: 
   https://wiki.apertium.org/wiki/Prerequisites_for_Debian

3. look at open issues, see if there's one where you (after having tried
   the existing packages) feel you understand the problem and try
   solving it


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] gsoc2023 proposal

2023-03-18 Thread Kevin Brubeck Unhammer
> Hello, I have finished my first draft and I would love to get any feedback
> from potential mentors.
> https://wiki.apertium.org/wiki/User:Eiji

Hi,

This looks promising :) Some thoughts:

You've already made kind of an overview of the possibilities in your
proposal; I would tone down the "investigate possibilities" parts and
instead try to focus on how you're going to implement one of the
methods, using apertium-jpn as a testbed.

Try to make clear deliverables per week or at least every other week,
you should have something like a proof-of-concept by week 2 – especially
if your ambition is to also work on improving the Japanese language
data. You currently have week 6 for testing – but you should be testing
from the start alongside the coding. I would probably plan for 2 weeks
for converting the PoC from Python to C++ and making it usable as a part
of the pipeline. 

(Think about how this will be integrated into apertium – we have a
translation pipeline which expects a certain format
https://wiki.apertium.org/wiki/Apertium_stream_format )

best regards,
Kevin Brubeck Unhammer 





___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Tagset Standardization

2023-03-07 Thread Kevin Brubeck Unhammer
Daniel Swanson
 čálii:

> To be clear, I meant splitting  into .

 

> One of my ideals for the tagset is that every tag be
> position-independent, so that the only reason I need to care about
> order is because of FST topology (and maybe not even then).

Aren't the tags themselves already position-independent? Both CG and to
a certain extent transfer assume that.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] GSOC2023

2023-02-24 Thread Kevin Brubeck Unhammer
> I'd like to participate in Google Summer of Code 2023 at Apertium.
> In particular, I'm interested in adding new language pair and I am
> thinking to add Japanese-English as I speak Japanese. I took summer
> school at Tokyo University online on natural language processing
> before.
> Could you tell me more about the project?

Hi,

Getting some support for Japanese would be great! I'm not sure if you
saw the whole IRC discussion, but what we really need in that regard is
support for the *tokenisation* step, where our regular methods[1] fail
us, since the text might have no spaces and lots of
tokenisation-ambiguity. There has been some prior work[2] and it's
already listed as a potential GsoC project.

Support for anything-Japanese depends on tokenisation. It's also a big
enough job that it would qualify as a full GsoC project, so if you were
hoping for jpn-eng in a summer you will be disappointeda (but having a
toy language pair to test with would help!). On the other hand, if we
get good spaceless tokenisation we open up the possibility for not just
Japanese, but Thai, Lao, Chinese etc. – and of course all those writing
systems used before the invention of the space character :)

regards,
Kevin

[1] https://wiki.apertium.org/wiki/LRLM
[2] http://hdl.handle.net/10066/20002 
[3] 
https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Error in translation English-Italian

2023-02-24 Thread Kevin Brubeck Unhammer
> I use Metawiki as a translator. Often I find that the English word
> 'for' is translated by Apertium with 'partorisca'. This Italian word
> is a verb meaning 'to give birth'. The correct translation is 'per'.
> Is it possible to fix it?


According to Kartik Mistry, apertium-eng-ita shouldn't be enabled in
Mediawiki Content Translation at all (it's incubator, not released).
Though they do have Google Translate and other MT systems doing eng-ita
there.

By Metawiki, do you mean the wikipedia content translation system
https://www.mediawiki.org/wiki/Content_translation
or something else?




signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Transducer contains initial epsilon loop

2023-02-28 Thread Kevin Brubeck Unhammer
Hi,

Cf. http://tinodidriksen.com/pisg/OFTC/logs/%23hfst/2023-02-28.log
perhaps you can make an xfst rule to do the equivalent of

sed 's/\(.*\)/\1/'

?




signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Released: dan-nor 1.5.1

2023-02-13 Thread Kevin Brubeck Unhammer
Godmorgen,

New minor release of dan-nor is out, with slightly better support for
Nynorsk this time, and support for genitives of adjectives. This release
also includes the Nynorsk preference list so they can be shown in the
UI.

Some stats:

- Over 5.000 new non-name bidix entries
- Nearly 100 new lexical selection
- 34 new separable/mwe entries



best regards,
Kevin Brubeck Unhammer 



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Changes to apertium-preprocess-transfer

2023-06-27 Thread Kevin Brubeck Unhammer
> Greetings Apertiumers!
>
> I recently identified a way that apertium-preprocess-transfer was
> being rather inefficient and today I fixed it, so tomorrow you all
> should be able to update to apertium 3.9.4 and see some improved
> compile times for any pairs not using apertium-recursive, with
> speedups between 10x and 7000x faster on the files I tested.

That's wonderful =D

> And I just wanted to let you all know, in case someone was depending
> on those. To compensate, I added a check to apertium-lint which can
> report roughly the same information:

What's the recommended way of installing apertium-lint on debians?



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Ready to release: spa-arg 0.6.0 and arg-cat 0.3.0

2023-05-05 Thread Kevin Brubeck Unhammer
Congrats, that's great =D

> Dear all,
>
> The pairs Spanish-Aragonese and Aragonese-Catalan are ready to release
> (can anyone tag them?)
>
> apertium-spa-arg 0.6.0 (commit 61048e9) depends on apertium-spa
> (commit d2455cf, needs new tag)  and apertium-arg 0.2.0 (commit
> 0b9f06e).
>
> apertium-arg-cat 0.3.0 (commit 5255af5) depends on apertium-arg 0.2.0
> (commit 0b9f06e) and apertium-cat (commit 201dcec, needs new tag).
>
> Although they include some new entries and paradigms (especially in
> the monolingual apertium-arg), the mean reason for the release is that
> both pairs have been adapted to generate Aragonese according to the
> new official spelling system approved by the Academia Aragonesa de la
> Lengua (while still analyzing text with the previous spelling system).
>
> Best,
>
> Juan Pablo
>
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Transducer contains initial epsilon loop

2023-05-14 Thread Kevin Brubeck Unhammer


> I am looking at this again. Removing the extra tag at the transfer
> stage seems to be too late down the pipeline (I need the adjective to
> match the noun which is done by CG). Actually, surely removing the
> extra tag could be done at the same CG stage?

If you use an xfst rule, that happens on the analyser FST, ie. before CG
and long before transfer.



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] (no subject)

2023-05-05 Thread Kevin Brubeck Unhammer
Try https://sourceforge.net/projects/apertium/lists/apertium-stuff/unsubscribe



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Apertium in GSoC 2024?

2024-01-26 Thread Kevin Brubeck Unhammer
> [CC: -stuff and PMC]
>
> Should we apply for Google Summer of Code this year? Deadline Feb 6th.
>
> -- Tino Didriksen

I'd be happy to mentor at least. Some projects that I personally would
love to see happen:

* More dictionaries and language data! Whether from scratch or converting 
sources
* Implement preferences in existing pairs
* Capitalisation handling in existing pairs
* Faster / more robust recursive transfer
* (alternatively / more experimental) CG-based transfer
* Language Server Protocol and/or better editor support
  (newly open sourced Zed editor supports tree-sitter …)
* WASM
* Nice dictionary UI for web (and generally fixing web papercuts)

Also, not fully thought through yet, but I'd love some way of debugging
apertium-separable ("why did this rule apply/not apply"). I suppose any
tooling here would probably also help with regular dix.

Also, is there anything that could make using our data in *other* tools
and systems easier?

best regards,
Kevin



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] GSoC 2024 mentors & admins must log in

2024-03-21 Thread Kevin Brubeck Unhammer
> Yes, even if you already registered last year. We just got a warning that
> we only have 1 admin (me), even though I was sure we had 3. So,
> https://summerofcode.withgoogle.com/
>
> "Before you can add an Org Member who has participated in previous programs
> to your organization for 2024, they must first agree to the 2024 Program
> Rules and Org Member agreement by logging into their GSoC dashboard and
> clicking the 2024 and expanding it to see the 2024 Terms."
>
> As soon as possible, or we'll be kicked out of the program.

OK, so under the heading saying "2024 Status: Accepted" you have to
click on the Rules and ToS and hit accept on them too.

(It was not immediately obvious that I still had something unfinished
there. They should have some kind of
Inbox (2) 
to make it clear.)



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Request for Wiki Account

2024-03-25 Thread Kevin Brubeck Unhammer
Check your inbox (and spam folder if not there) for a password.



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Question sus la desambigüizacion dins Apertium

2024-06-15 Thread Kevin Brubeck Unhammer
> On Tue, 4 Jun 2024 at 17:44, Aure Séguier  wrote:

>> Es possible de far de règlas de desambiguïzacion especificas a una
>> varietat ? Per exemple, en gascon, avèm los enonciatius ("que", "ne", etc.)
>> qu'existisson pas dins las autras varietats. Se cambiam lo sistèma de
>> gestion de las varietats, serà benlèu pas pus possible d'indicar dins lo
>> monodix que "que" (enonciatiu) existís sonque en gascon. Riscarà d'èstre
>> reconegut en lengadocian e de faussar la traduccion. I a tanben d'autres
>> cases especifics ("de" partitiu que se ditz quasi pas jamai en gascon, mas
>> totjorn en lengadocian...).

If you use the "new" system documented at
https://wiki.apertium.org/wiki/Dialectal_or_standard_variation#Overlapping_variants
with AP_SETVAR etc., then the variant info is available in all CG files,
not just the ones that select bidix/generator choices, but also the 
disambiguator.

So you could have source variant tags as well as target variant. E.g. if
you want to say that your source language is gascon, you could

export AP_SETVAR='src_gascon'

or something like that, and then in CG, if for example "que" is used as
a personal pronoun only in Gascon, you could do

SELECT pers IF (0 ("que") + (VAR:src_gascon));
REMOVE pers IF (0 ("que")); # not gascon

Or you could make it more nuanced and feature-based like

export AP_SETVAR='src_que_pers,src_other_feature'

SELECT pers IF (0 ("que") + (VAR:src_que_pers));
…

(if, say, both Gascon and Bigourdan use que as personal pronoun, but only
Gascon has other_feature as well)

With this system, the .dix file is more ambiguous, but it's easy to do
early removal of irrelevant stuff from CG.





___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Question sus la desambigüizacion dins Apertium

2024-06-19 Thread Kevin Brubeck Unhammer
> Occitan can manage variety in its metadix file. My question is, is
> there a way to manage variety in the .rlx file ?

There is :)

> For instance, we have the word "bad", "evil" which is "mal" in
> lengadocian and "mau" en gascon. But "mau" can also be a conjugated
> verb (a pretty rare one). I did this rule in the RLX file : REMOVE V
> IF (0 (""i));
> But I would want this rule not to apply to lengadocian, where "mau"
> can only be a conjugated verb.
> Is that possible ? If not, is this something easy to implement ?

Yes. You could for example say that "src_lengadocian" is the variable
that signifies that the source language is lengadocian, and then have 
one rule that picks the verb if source language is lengadocian:

SELECT V IF (0 (""i))(0 (VAR:src_lengadocian)) ;

and one that removes it if not:

REMOVE V IF (0 (""i)) (NEGATE 0 (VAR:src_lengadocian)) ;


I can't say for certain if this system makes things simpler or not for
you compared to metadix, but it allows for a lot more flexibility, with
much shorter compile times (since we have just one compiled FST which
contains all the variety).



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Question sus la desambigüizacion dins Apertium

2024-06-19 Thread Kevin Brubeck Unhammer
> How can I define src_lengadocian as the variable that means the source
> language is lengadocian ?

Hm, it kind of depends. In general, if you use variables, you can do

export AP_SETVAR=src_lengadocian
echo mau o mal | apertium -d . oci-fra 

and that variable will be available to the CG as VAR:src_lengadocian

If you put it in oci-fra.preferences.xml, it will also show up on the
web like the Preferences d'estil button at
https://beta.apertium.org/index.cat.html#?dir=cat-spa

But maybe these source language differences actually *should* be kept as
separate pipelines, and shown as different source languages in the
language selector in the web UI? In that case, it might actually be
simpler to not do variables at all, and just have a separate CG file
with lengadocian rules that runs before the regular CG. So in your
oci-fra_lengadocian mode in
https://github.com/apertium/apertium-oci-fra/blob/master/modes.xml#L373
instead of

  

  
  

  

you would have the general automorf, but two CG disambiguator steps

  

  
  

  
  

  

and the first CG would just have a few rules for lengadocian-specific
stuff.




___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


<    1   2   3   4   5