Re: [Apertium-stuff] Tagset Standardization

2023-03-07 Thread Flammie A Pirinen
Am Mon, Mar 06, 2023 at 03:35:45PM -0500 schrieb Daniel Swanson:

> This script could, without too much trouble, be expanded to cover the
> rest of our source files, at which point I would like to propose that
> we move towards greater standardization of our tagset:
> https://wiki.apertium.org/wiki/List_of_symbols
> 
> At minimum, I would like to deal with some of the duplicate tags, like
> impf/imperf, rec/res, v/vblex, pass/pasv, etc.

Yay. There's probably some ind / indic ~ indv / indef confusions too.

> My preference would be that we also consider splitting compound tags,
> like the tense+mood (fti, fts, pii, pis) and maybe possessor and
> subject tags (px1sg, s_1sg)

That's already harder to implement people surely have strong opinions
here. Personally, I'd be used to having verbal person numbers tagged
with only one tag too, {sg,du,pl}{1,2,3} rather than two, but I can see
some languages can use separate tags for syntax etc. At least as long as
they are standard and have easy 1:n mappings, perhaps even scripts to
switch between easily, they should be workable.

>. And if we wanted to go really crazy we
> could consider a broader rewrite like changing our tags to UD-style
> feature-value pairs (so  becomes ), though I don't
> imagine we actually want to go nearly that far.

YEah it would be ideal imo but probably would also have some opposition.
As long as we have standardised tagset
and fairly simple remapping to ud (and unimorph would be nice too) it's
not too bad. Simple being a screenful of code in your favourite
scripting language (and a mapping table).


-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Regression Testing Now Operational

2021-07-30 Thread Flammie A Pirinen
Am Fri, Jul 30, 2021 at 08:03:41AM -0500 schrieb Daniel Swanson:
> >
> > Another question is that in lot of expected files there seems to be
> > all-capsed words for fin-* pairs, I am not sure how this has happened?
> > I am guessing my apertium is older and some ICU changes have affected
> > the output, perhaps in some code I've copypasted ununderstandingly to
> > all fin-* pairs.
> >
> 
> For that example in particular (and likely others as well), the t1x
> output code says
> 
> 
> 
> which gives the chunk an all-caps lemma (caseFirstWord is a
> non-existent variable, so it has no effect) and then the default
> behavior of postchunk is to copy the chunk case to the words, so
> 
> ^NP{^koti$}$
> 
> becomes
> 
> ^KOTI$^.$
> 
> so I think this might be a case of inadvertently relying on a bug in
> the old transfer case-handling functions and I'm not quite sure what
> the appropriate solution is.

Mmh so the @case here is overall copypasted from the original template:



I have never touched or looked it up from the manual, maybe others have
not either, and yeah I didn't even know chunk names had an effect to the
translation before now.


-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Regression Testing Now Operational

2021-07-30 Thread Flammie A Pirinen
[for some reason these all went to gmail spam...]

Am Tue, Jul 27, 2021 at 08:54:05AM -0500 schrieb Daniel Swanson:
> On Tue, Jul 27, 2021 at 8:47 AM Flammie A Pirinen  wrote:
> >
> > perhaps this is different version of pyhthon or libraries? I have
> > python 3.9.1 on linux.
> 
> That's a typo for which I pushed a fix ~5 minutes ago.

Cool, I've been working with some language pairs I know, I have few
questions on cli usage and stuff:

Commonly when I test things I get like: 

> Corpus 1 of 5: deu-fin-pending
> 11/27 (40.74%) tests pass (11/11 (100.0%) match gold)

so I start up cli and see:

deu-fin 1 of 1
INPUT:
  Haus
EXPECTED OUTPUT:
  KOTI
ACTUAL OUTPUT:
  koti
IDEAL OUTPUTS:
  talo

if the case was that it's a bug in the (dix/t*x) code I'm not sure what
command to use to skip and see next error? 

Another question is that in lot of expected files there seems to be
all-capsed words for fin-* pairs, I am not sure how this has happened?
I am guessing my apertium is older and some ICU changes have affected
the output, perhaps in some code I've copypasted ununderstandingly to
all fin-* pairs.


-- 
Regards, Flammie <https://flammie.github.io>
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Regression Testing Now Operational

2021-07-27 Thread Flammie A Pirinen
Am Fri, Jul 23, 2021 at 08:14:01PM -0500 schrieb Daniel Swanson:

> At the beginning of last week, 47 languages and pairs had a meaningful
> 'make test', and several of those were failing. As of today, 365 repos
> have 'make test' and virtually all are passing.

That's exciting, here's some first test run minor suggestions:

* make test after git pull failed with "apertium-regtest: command not
  found", which is ok for devs but could be improved with having regtest
  stuff test in AM_CONDITIONALS possibly with informative message
* apertium-regtest has no ./autogen.sh, people have gotten used to that
  so even just having autoreconf -fvi in it is a good idea

Running teh actual tests in apertium-fin-nor I get:

make test
Making all in docs
make[1]: Verzeichnis „/home/flammie/github/apertium/apertium-fin-nor/docs“ wird 
betreten
make[1]: Für das Ziel „all“ ist nichts zu tun.
make[1]: Verzeichnis „/home/flammie/github/apertium/apertium-fin-nor/docs“ wird 
verlassen
make[1]: Verzeichnis „/home/flammie/github/apertium/apertium-fin-nor“ wird 
betreten
apertium-validate-modes modes.xml
apertium-gen-modes modes.xml
make[1]: Verzeichnis „/home/flammie/github/apertium/apertium-fin-nor“ wird 
verlassen
apertium-regtest test
Corpus 1 of 4: fin-nob
Traceback (most recent call last):
  File "/usr/local/bin/apertium-regtest", line 1041, in 
if not static_test(args.ignore_add):
  File "/usr/local/bin/apertium-regtest", line 950, in static_test
corp.load()
  File "/usr/local/bin/apertium-regtest", line 378, in load
outdata = load_output(self.out_name(c))
  File "/usr/local/bin/apertium-regtest", line 79, in load_output
l = int(l[1:])
TypeError: 'int' object is not subscriptable
make: *** [Makefile:1132: test] Fehler 1

perhaps this is different version of pyhthon or libraries? I have 
python 3.9.1 on linux.

-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Begin of sentence

2021-02-17 Thread Flammie A Pirinen
Am Wed, Feb 17, 2021 at 07:22:36PM +0300 schrieb Hèctor Alòs i Font:
> Is there any form to match a "begin of sentence" in lexical selection or in
> transfer? In transfer, usually the point of the previous sentence is used,
> but I want to match even the beginning of the first sentence of the text.

I think it would be possible to add a  tag in VISL CG 3,
that other steps can then refer to. 


-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Proper noun classification considered harmful

2021-02-12 Thread Flammie A Pirinen
I think I've come up with a solution that is minimally intrusive for
existing work-flows and usages, and that is, allowing optionalising
select tags for generation, i.e.:

Analysing:

echo London | lt-proc eng.automorf.bin
^London/London/London$

(I didn't even plan this it just happened to be ambiguous... :-D

Generating:

echo '^London$' | lt-proc -g eng.autogen.bin 
London

as before, but also:

echo '^London$' | lt-proc -g eng.autogen.bin 
London

Similarly with HFST:

echo Lontoo | hfst-lookup fin.automorf.hfst
Lontoo  Lontoo120,009995

echo 'Lontoo' | hfst-lookup fin.autogen.hfst 
LontooLontoo  120,009995
echo 'Lontoo' | hfst-lookup fin.autogen.hfst 
Lontoo Lontoo


I have a refernce implementation for HFST in apes-fin[1] but I may need
help with lt-stuff, maybe a new tool in style of lt-trim? I'd probably
start with hard-coded optional tags but maybe they should be in some
file, even monodix itself?

[1] 

 

-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Proper noun classification considered harmful

2021-02-06 Thread Flammie A Pirinen
Thank you all for a lively discussion, I'll summarise here and reply to
few of the comments in a typical inline reply format. I think as tldr we
agree to some extent that these rich np annotation tags are specific to
language pairs and steps in the pipeline and should not be hindering
unrelated bidixes and stuff...


Am Tue, Feb 02, 2021 at 11:34:40AM +0100 schrieb Kevin Brubeck Unhammer:
> 
> Genders are useful when anaphora resolving / in transfer, though only on
> person names. [...] 
> The ,  and  tags are used quite a bit in the nob
> disambiguator, but not in transfer.

I think there is an endless amount of lexical information that can be
recorded that is useful for disambiguations or some intermediate that
could be stored in tags as well, but most of it should not bother e.g.
bidix that has no use for it or all monodixes.
Traditionally lot of this is hidden for
example in CG and other formats in lists/sets of lexemes.

I do think that gendering first names is getting old-fashioned and also
unreliable, most of super common Finnish names I can think will be used
for both genders either locally or internationally or both for example.

> I tend to underspecify np's in bidix:
> 
>  IranIran
>  ThielThiel
>  SarumanSaruman
>  ContrasContras

I find this would be the ideal, I even start with  tag...

> so just the monodixen need to be synced.

That's unlikely to happen for all of apertium-langs...

> The remaining problem is when the analyser gives ^Saruman$ and
> you try to send that into a generator that expects ^Saruman$.

Yes and someone else is sure that is ^Saruman$ and maybe
someone else is helpful to say that it is "m" too and... not trying to
be funny, just within last weeks I had to encode something like
'Kristus' as al, ant, ant.m, and cog for such variation in monodixes.

 
> We could perhaps use the Giellatekno solution for that, where dixen have
> RL entries that just contain  (ie., no cog/ant/al), and some
> transfer step cleans off the tags. Should be a fairly simple change, and
> it's tried and tested in giella-pairs. Since lttoolbox is used mostly
> for languages where np pardefs are small, adding the RL's is like max
> 10 extra lines; for languages requiring hfst it's probably a fairly
> simple twol or xfregex rule?
> 

yeah I think having optionalise-filter for those tags in generator would
be ok solution, that also allows using the tags if there is some reason
for it, I can perhaps see different paradigms between same lemma with
different semantics happening...



Héctor said:
> Let's see the example of New-York in
> French. The city is "New-York" without any article but the state in "le
> New-York". The prepositions used in both cases are different in some cases
> (which come to be often in Wikipedia texts). So, they have different
> behaviour in French. In principle, it makes sense to differentiate them in
> the monodix... although I have preferred not to innovate too much, and, as
> you suggest, I've used long def-lists in the transfer files.

This is actually a good example, in theory Finnish has similar feature
that place names can prefer either inner or outer locative case systems.
Does this mean that every monodix in apertium should contain tag for
np.top's for  or ? Of course, language specific
details are indeed best encoded in e.g. lists and sets.


Bernard said:


> So, an alternative possibility should be to add extra files in language
> branch for when this language is the target language. These files (wordlists)
> could be used in tranfer without making more complicated bidixes. So, the
> same file could be written once and used in a lot of languge pairs. But if
> the wordlist is long, I don't know if that would degrade transfer speed
> performance compared to adding this information in any bidix fo which it is
> useful.

This is what I do with omorfi, a lot of people implementing various
applications have needed different lexical tags, so these can optionally
be joined in to the analyses

-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Proper noun classification considered harmful

2021-02-01 Thread Flammie A Pirinen
Hi all,

I've written a handful of apertium-fin-* prototypes and I usually end up
spending way too much time with all the useless subclasses of proper
nouns we have (cogs, ants, als, tops, orgs, and to top all that,
sometimes ms and fs for some extra (mis)gendering). Could we just get
rid of those or those someone have a good use for them? Most of the time
it's very random anyways and we aren't really doing NERing or anything.
I think if these are used in e.g. cg or whatever we should probably have
different way of introducing them that doesn't intervene with
analysis-generation stuffs, like we talked passing by in the last
apertium zoom meeting? Or is there some smart way to bypass them I
haven't thought of (probably)


-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] An easy tool to report bad translations and propose alternatives

2020-12-05 Thread Flammie A Pirinen
On Sat, Dec 05, 2020 at 12:28:14PM +0300, Hèctor Alòs i Font wrote:
> A Sardinian collaborator commented to me: "Wouldn't it be possible that
> every time there are more possible translations these come out in a little
> window where the user chooses the right solution, as in spell checkers"?

> This could be an idea for a GSoC tool project. Nevertheless, I don't think
> that, as he puts it, this is the best option because, in general, we have
> few multiple options in the bilingual dictionaries. Probably, another type
> of interface would be more appropriate. Is there anything done in the GSoC
> projects that could be used?

Which app are we talking about here? I proposed this kind of stuff for
the apertium webpage for GSOC regarding to untranslated words,
and as far as I understood it is done but not enabled? I think it could
be used for this kind of stuff as well?  I think "suggest a better
translation" kind of feature there, that can be optionally populated
with the alternative translations, would be quite ok for this.
Google translate has had some similar stuff too in the past.

One of the good things for stuff like this is iI envision that we can collect a
corpus of good and bad translations that can be useful for lots of
research projects.


-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] let's move the mailing lists to sourcehut

2020-09-21 Thread Flammie A Pirinen
On Sun, Sep 20, 2020 at 07:44:25PM -0700, Samuel Sloniker wrote:
> I've also thought about the possibility of a forum?

Forums can be nice but I think have different function than mailing
list..

> Discourse looks nice.
> 

Is this discourse.mozilla.org? I really dislike that, last I was forced
to move bug report from github issues to discourse and it manaaged to be
worse than github issue for discussing.

-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] let's move the mailing lists to sourcehut

2020-09-20 Thread Flammie A Pirinen
On Sun, Sep 20, 2020 at 06:04:35PM +0100, Francis Tyers wrote:
> Sourcehut is a free/open-source "forge" type thing run by Drew DeVault. They
> have
> mailing lists.
> [...] 
> What do people think?

Excellent idea. One suggestion I have is to make sure they get archived
as well as usable through e.g. gmane. Smaller services and self-hosted
lists can sometimes totally disappear from the internet..

-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Fixing Phonological Processes

2020-09-13 Thread Flammie A Pirinen
On Fri, Sep 11, 2020 at 03:18:44PM +0200, Zanga Chimombo wrote:
> Hello again,
> 
> I've had a bit of time to continue looking at this. I've copied over
> something from:
> https://github.com/apertium/apertium-lin/blob/master/apertium-lin.lin.twol
> 
> %{K%}:k <=> :n :0 _ .#. ;
> 
> But it's not working yet and I am not sure how to debug it. Is there
> an intro to twol online?

I think the historical documents from Xerox at fsmbook.com (click on the
newSoftware and agree to the the terms) and the original dissertation by
Prof. Koskenniemi
 are
quite good to understand the backgroudn.


-- 
Tommi A Pirinen, Computational Linguist, Software engineer, etc.
, Norges arktiske
Universitet , Divvun , giellatekno
. President of ACL SIGUR SIG for 
Uralic languages . I tend to follow 
inline-posting style in desktop e-mail messages.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Apertium's Wider Use & Secondary Tags

2020-06-13 Thread Flammie A Pirinen
On Sat, Jun 13, 2020 at 04:50:48PM +0100, Francis Tyers wrote:
> El 2020-06-13 15:20, Tino Didriksen escribió:
> > I would like everyone to read and seriously consider this thread and
> > give your opinion. This meanders a bit, so please read it all.
> > 
> 
> Here is a non-exhaustive list of potential pitfalls of using the "surface
> form is a tag" thing. As far as I understand the objective is to be able to
> put the original surface form in the output translation as an unknown token
> instead of the lemma.

yeah in practice the purpose is to eventually let any module anywhere
use the surface form for anything, this includes giving option to print
*surfaceform or @lemma without hacking the dictionaries.

> 0) languages without spaces in the writing system:
> 
>what is a surface form here? is it just the longest token matched?

I always very naively thought that when we talk surface forms we talk
about the span of text in original source input that the analysis
concerns, there shouldn't be complications to this cause the source is
always simply there. Compare to the surf field of conll-u and its
validations.

> 1) compounds
> 
> i)  infrastruktuurontwikkelingsplan, does each part of the compound get
> the surface form tag? if so, one happens if one part of the compound
> is translated but the other parts aren't, e.g. would you get
> *infrastruktuurontwikkelingsplan *infrastruktuurontwikkelingsplan plan?

All the stuff stored in the stream will let linguist choose whichever is
good. When the things are there. Before that there will be no
regressions in the streams and that is verified by comprehensive
testing.

> 2) contractions
> 
> i)  chawe - if you attach the surface form to both and both are unknown, do
> you get both in the output? if you only attach it to one, which one do
> you
> attach it to, where is that decision made?
> 
> ii) dárselo - if you attach the surface form to the clitic pronouns in
> addition
>to the verb, what happens if the verb is not in the dictionary but the
> clitic
>pronouns are? do you get the surface form and the translations in the
> output?

I guess I'm starting to see where you predict the problems will be, with
the already a bit dodgy multitoken word features (subwords?) between
apertium and cg streams?

the question of what happens I'd want answer to be that after the
implementation we will by default have the same output as before, and
enough information in the streams for linguist to make informed
decisions on what to output, if they want to output something nicer, I
mean, even with Finnish enclitic particles the answer depends on the
particle. 

If there is a limitation in the current stream format ideas preventing
this we should probably make a test case example of it. I feel like we
can output many good versions with current idea but haven't played it
through on paper.


-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] How useful is eliminating trimming for language developers?

2020-05-25 Thread Flammie A Pirinen
On Mon, May 25, 2020 at 03:10:28PM +0530, Tanmai Khanna wrote:

> *Disadvantages:*
> 1. The monodix has some erroneous analyses - wrong surface forms, wrong
> analyses, or even MWEs that aren't really MWEs and can be translated word
> by word. These are currently removed since bidixes are more carefully
> maintained. If trimming is eliminated, and none of the analyses of a word
> are in the bidix, then one of the analyses will be chosen, and there is a
> chance that it is erroneous. If it's an MWE that doesn't exist in the
> bidix, it won't be translated word by word even though that was ok.

I think the only argument here is that we want to keep having bad stuff
in monodixes. From software engineering standpoint I find this argument
really problematic, to hinder further development of systems because we
want to keep bad, low quality data in monodixes is not good. As I've
curated and maintained a bunch of stuff though, I can relate to the
sentiment, linguistic data collection including dictionaries is not
really a software project that will have complete and correct version
1.0. But I do think apertium does need to move towards maybe more
quality control, more continuous testing for monodixes, especially of
the esteemed release quality languages. However, I don't think I have
seen a  non-mwe example of how we lose the ability to reconstruct the
trimmed output of the whole pipeline, I am still kind of under the
impression that we mostly just add data to the stream, and it will then
be more possibilities to output either the input surface forms or the
bad monodix lemma or something else programmatically than before?

> 2. If your monodix is used by lots of other pair developers, you don't want
> *your* pair to get messed up because someone somewhere decided "take
> precautions" should be an MWE, and suddenly where your old output had "ta
> forholdsregler" you now get "*take precautions".
> - Unhammer

This MWE problem I do agree is bad and relevant. I usually develop
hacked untrimmed by default and doing eng→fin was not particularly fun
like that. Like, my solution to that would be, to not add "take
precautions" to apertium-eng ever, and make more automated and social
control for it, that's how other software projects keep codebases clean
enough, but I don't know how to go about it here. 

Wasn't there a "separable"-based solution that looked good though?

> 3. Having trimming gives the ability to control the monodix using the bidix
> in your language pair. This ability isn't lost, because we're still
> weighting the monodix, but if the bidix has none of the analyses for a
> word, earlier it was discarded and now it will be retained.

We can still discard it with just a bit of more hacking, surely? Am I
missing something here? The stream will contain the surface form as well
as the bad monodix analysis and information however encoded (nonzero 
weight, secondary tag, etc.) that it wasn't in the bidix? 

> 4. Weighting the monodix will take more compile time than just trimming it.

Some numbers would be interesting, I think both are quite heavy and we
don't do much further processing in finite-state algebra (/hfst space)
so the weighted models won't blow up. In any case, people seem to be
happy in 2020 to wait 70 hours for some neural stuff, few minutes for
weighted automata won't be too bad ;-)


-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Secondary Tag Prefixes

2020-05-10 Thread Flammie A Pirinen
On Fri, May 08, 2020 at 04:50:45PM +0200, Tino Didriksen wrote:
> For khannatanmai's GSoC project, secondary tags will be implemented in a
> backwards compatible manner. That it in itself indisputable. But, there is
> a question of how the initial batch of secondary tags should look.
> 
> I feel they should be in the form of , as in a very short textual
> lower-case prefix, followed by :, followed by whatever value there is. Or
> even an upper-case prefix, as in  or .
> 
> spectie wants symbol prefixes in the form of <%:cdefg>.

I feel like this is just a bikeshed[0] issue, but since I want this
project to succeed I'll give my 2 cents / rants:

I don't personally find apertium stream format readable, if I need to
make sense of it I will anyways have to preprocess a lot, enough that
I'd say apertium stream format need visualisation scripts to be
readable. It's not very hard to have dev scripts for this. That being
said, I don't find apertium stream format very machine readable either;
with regexes you need tons of ëscapes and double escapes, with
programming languages... well, you have to use regexes because it's not
a standard format with readily available parsing library or a format
neatly designed for python split() or c strtoks, or so... I'm fine with
either special symbols or strings for whatever, as a purely personal
preference I've been pro feature=value even before ud times but that's
not important, as long as stuff is handlable with grep and sed without
convoluted expressions it's all good, no? To that ggoal on the question
of having known set of prefixes, I have always been of the opinion that
any mature release-quality apertium stuff would follow the tags docu on
the wiki[1], I would expect similar to be true for prefixes as well.

One side note: I think there is a level of abstraction we often overlook
in these developments; a part of language data developer base will
probably interact with these secondary things through the XML formats if
I understand correctly? Surely one of the things that can be done
regardless of what kind of stream format representation the seodnary
stuff has, is to have the xml format part more self-documenting and
stream format more readale? And like eventually one could think there
were tooling and visualisations or whatnot to support whatever readable
and parsable formats if enough stuff is in the xml sources.

so tldr; just pick whatever greppable stuff for apertium strem format.

[0] 
[1] 

-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Registration for wiki page

2020-03-23 Thread Flammie A Pirinen
On Mon, Mar 23, 2020 at 04:46:06PM +0530, Ayush wrote:
> Dear sir,
> Actually I have quite reached nowhere while going through the lttoolbox. Can 
> you please help me with making of schedule for the proposal and also what all 
> thinks I would be working under for the task of robust tokenisation. I know 
> that I have to update lttoolbox to be fully Unicode but how?

Hi,
the lttoolbox part of the code is one that is also not my area of
expertise and it would be a good thing for the application to recruit a
co-mentor or advisor who knows lttoolbox internals. That said, I would
suggest to start figuring out just the user point of view of
tokenisation at the moment, take a handful of languages from current
apertium set, e.g. English, Finnish, Kazakh, Norwegian, German, and
maybe some spaceless script if there are any. Find kind of test cases
how they work currently and where they could improve and approach the
gsoc schedule as a test-driven software engineering project. It may be
hard to spread such schedule to three months timeline but when you have
some targets uncovered like so we can discuss what additional steps are
likely to take time-. 
>  

-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Registration for wiki page

2020-03-20 Thread Flammie A Pirinen
On Fri, Mar 20, 2020 at 11:34:22PM +0530, Ayush wrote:
> Dear sir/ma’am,
> This is to inform that I have successfully completed and submitted my 
> solution to coding task as assigned under the robust tokenisation.
> Link for my solution to challenging task – 
>  https://github.com/git-ayush-pradhan/Apertium_gsoc
> I would like to know, if I am eligible for registration for my wiki page so 
> that I can present my ideas for the robust tokenisation and requesting for 
> the same.
> Thanking you and regards,
> Ayush Pradhan

Hi Ayush, I checked the coding challenge and it seems to work, I'll try
to check more in detail later. Coding style is however a bit unoptimal,
apertium doesn't have official style guide or anything but it would be
desirable to stick to one style per project, check e.g. clang-format and
some style guide.

I've also added some more content to idea page, maybe it's clearer now.

Please go ahead on learning how tokenisation in apertium pipeline works
at the moment (inconditionals, dictionary sections etc.), and planning
the proposal and schedule.


-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Willingness to participate in the project

2020-03-03 Thread Flammie A Pirinen
Hi,

I am this week on hliday with low internet availability so only few
quick points. Firstly I strogly recommend joining #apertium IRC channel,
I think even non-mentors will have useful clues. For the tokenisation
problem I think the main resource is to understand various unicode
technical reports that describe tokenisations and a C++ library like
ICU, and then how apertium currently does tokenisations and how this
projects code will interact, especially for the last point many other
people in IRC know it better  than me.

Regards,

On Thu, Feb 27, 2020 at 01:45:09PM +0800, 杨伟哲 wrote:
> Hi Francis and Flammie,
> 
> I’m interested in the “Robust tokenisation in lttoolbox”[1] GSoC project.
> And
> currently I’m writing the proposal.
> 
> I have completed the code challenge listed in the project, which has been
> put
> on Pastebin[2]. However, I’m not quite clear where this project starting
> with.
> And I will be much appreciate if you could list somewhere (e.g. GitHub repo
> related to this project) for me to get started with. I will also try to
> learn
> and solve issues there if possible.
> 
> Bio: I’m Chinese undergraduate in Software Engineering. In my freshman
> year, I
> joined the high-performance computing center[3] of the university as a
> research
> assistant. Through research and learning during the period, I have a deep
> understanding of software architecture and open source projects.
> 
> 
> [1]
> http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Robust_tokenisation
> 
> [2] https://github.com/GavinWz/Apertium
> 
> [3] http://cs.wfu.edu.cn/2014/0603/c1227a33048/page.htm
> 
> 
> Regards,
> 
> Weizhe Yang


> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff


-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] GSoC 2020 Ideas Page

2020-02-21 Thread Flammie A Pirinen
On Fri, Feb 21, 2020 at 03:10:40PM +0100, Tino Didriksen wrote:
> Apertium is in GSoC 2020!
> 
> Time to update the
> http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code page.
> 
> What projects were actually completed to the mentors' satisfaction last
> year?

I think at least unsupervised weighting is completed so that it cannot
be extended with current description. That's the only task
I know well about but I think also recursive transfer and python api
were quite finalised?


-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Lexd: a transducer compiler for prefixes and stuff

2020-02-05 Thread Flammie A Pirinen
On Tue, Feb 04, 2020 at 12:55:55PM -0500, Daniel Swanson wrote:
> > Do you have plans on doing tests
> > on runtime efficiency, i.e. how fast it is to run the automata on texts?
> > One thing that we found with flag diacritics on lexc is is that it's
> > kindof possible to abuse them to optimise the compiled stuff and it'd
> > probably be interesting to see here too, I see there's something with
> > flags in the code already?
> 
> It can compile with or without flag diacritics, though the flag mode was
> mostly an afterthought and I haven't really tested it yet.

Yeah so if the flag mode stuff works it can be interesting for testing
if flags optimise certain morphotactics or not.

> For non-flag runtimes, the transducers should be the same as lexc + twoc,
> apart from alignment differences (a:b c:0 vs a:0 c:b) and state numbers, so
> I assumed it would have the same performance, but maybe I should double
> check.

Yeah, the --align option of hfst-lexc is there because alignment
differences in worst cases are bad; especially as lexc is usually
followed by further processing, I haven't checked the theory but from
experience I'd estimate it can get exponentially worse, it certainly
made some bigger langs uncompileable.

In the end though this is all usually only noticeable with rather large
old language models like Finnish or North Sámi.

-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Apertium Python Module Names

2019-10-13 Thread Flammie A Pirinen
On Sun, Oct 13, 2019 at 12:27:10PM +0200, Tino Didriksen wrote:
> https://www.debian.org/doc/packaging-manuals/python-policy/ch-module_packages.html#s-package_names
> 
> The package python3-apertium must provide the Python module apertium, but
> it provides apertium_core. I can fix this by either adding an alias
> apertium.py with 'from apertium_core import *' or by renaming the package
> to python3-apertium_core.

I'd prefer to have a package name just with apertium, I don't use
debians so often and whenever I have to I always struggle finding
packages more than I should have to :-)

> (same issue with python3-cg3 module constraint_grammar, and python3-hfst
> module libhfst)

I think hfst python stuff has module hfst nowadays.

-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff