Re: [Apertium-stuff] Tagset Standardization

2023-03-07 Thread Daniel Swanson
Yes, most of our tools assume that tags are position independent, but
I've come across a handful of languages that treat some tags as
position dependent, and I was more hoping to make it official to make
it less likely that we run into issues with that.

Also, I have an idea for how to make a version of lt-proc -g that
accepts the tags in any order, which might be helpful for reducing
generation errors, though it may turn out to be too much of a slowdown
for production.

Daniel

On Tue, Mar 7, 2023 at 1:58 PM Kevin Brubeck Unhammer  wrote:
>
> Daniel Swanson
>  čálii:
>
> > To be clear, I meant splitting  into .
>
> 
>
> > One of my ideals for the tagset is that every tag be
> > position-independent, so that the only reason I need to care about
> > order is because of FST topology (and maybe not even then).
>
> Aren't the tags themselves already position-independent? Both CG and to
> a certain extent transfer assume that.
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff


___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Tagset Standardization

2023-03-07 Thread Kevin Brubeck Unhammer
Daniel Swanson
 čálii:

> To be clear, I meant splitting  into .

 

> One of my ideals for the tagset is that every tag be
> position-independent, so that the only reason I need to care about
> order is because of FST topology (and maybe not even then).

Aren't the tags themselves already position-independent? Both CG and to
a certain extent transfer assume that.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Tagset Standardization

2023-03-07 Thread Daniel Swanson
On Tue, Mar 7, 2023 at 6:07 AM Kevin Brubeck Unhammer  wrote:
>
> Daniel Swanson
>  čálii:
>
> > Greetings Apertiumers!
> >
> > This morning I set out to change the Ancient Hebrew analyzer from
> > Latin script to Hebrew script (a task I don't wish upon anyone) and in
> > the process produced a search-and-replace tool that understands the
> > structure of several of our source files:
> > https://github.com/mr-martian/apertium-grep
>
> Awesome!
>
> > This script could, without too much trouble, be expanded to cover the
> > rest of our source files, at which point I would like to propose that
> > we move towards greater standardization of our tagset:
> > https://wiki.apertium.org/wiki/List_of_symbols
> >
> > At minimum, I would like to deal with some of the duplicate tags, like
> > impf/imperf, rec/res, v/vblex, pass/pasv, etc.
>
> That would be great! I'll put in a vote for pasv right now.
>
> > My preference would be that we also consider splitting compound tags,
> > like the tense+mood (fti, fts, pii, pis) and maybe possessor and
> > subject tags (px1sg, s_1sg).
>
> It makes sense to split tense and mood, as well as number and person,
> but I doubt it can be done automatically – it will require careful
> changes to CG and transfer. Might make sense to try it on one language
> pair along with the maintainer and see how it goes.
>
> It would be very dangerous to turn  into  – that would
> break lots of CG and transfer rules and possibly lead to more complexity
> in tag matching since you now have to always check for the existence of
>  whereever you check for  etc.

To be clear, I meant splitting  into .

One of my ideals for the tagset is that every tag be
position-independent, so that the only reason I need to care about
order is because of FST topology (and maybe not even then).

Daniel


___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Tagset Standardization

2023-03-07 Thread Kevin Brubeck Unhammer
Daniel Swanson
 čálii:

> Greetings Apertiumers!
>
> This morning I set out to change the Ancient Hebrew analyzer from
> Latin script to Hebrew script (a task I don't wish upon anyone) and in
> the process produced a search-and-replace tool that understands the
> structure of several of our source files:
> https://github.com/mr-martian/apertium-grep

Awesome!

> This script could, without too much trouble, be expanded to cover the
> rest of our source files, at which point I would like to propose that
> we move towards greater standardization of our tagset:
> https://wiki.apertium.org/wiki/List_of_symbols
>
> At minimum, I would like to deal with some of the duplicate tags, like
> impf/imperf, rec/res, v/vblex, pass/pasv, etc.

That would be great! I'll put in a vote for pasv right now.

> My preference would be that we also consider splitting compound tags,
> like the tense+mood (fti, fts, pii, pis) and maybe possessor and
> subject tags (px1sg, s_1sg).

It makes sense to split tense and mood, as well as number and person,
but I doubt it can be done automatically – it will require careful
changes to CG and transfer. Might make sense to try it on one language
pair along with the maintainer and see how it goes.

It would be very dangerous to turn  into  – that would
break lots of CG and transfer rules and possibly lead to more complexity
in tag matching since you now have to always check for the existence of
 whereever you check for  etc.

> And if we wanted to go really crazy we
> could consider a broader rewrite like changing our tags to UD-style
> feature-value pairs (so  becomes ), though I don't
> imagine we actually want to go nearly that far.

In principle I like it, but in practice I find it too noisy to be
worthwhile. Feature-value pairs are very useful if there's several
features that can have the same value, e.g. booleans Seen=true or
numbers Age=65, and for linguistics there are some use-cases like
SubjNumber vs ObjNumber, or  vs . But for the majority of the
values for most languages, the value indicates its feature (only Voice
can have the value Passive, only Case can have the value Genitive etc.).
So we'd end up with longer lines and debug output and lots of docs
needing rewriting and – most importantly – people needing retraining,
only for the sake of avoiding sticking a "apertium-to-ud" script in a
pipeline.


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Tagset Standardization

2023-03-07 Thread Flammie A Pirinen
Am Mon, Mar 06, 2023 at 03:35:45PM -0500 schrieb Daniel Swanson:

> This script could, without too much trouble, be expanded to cover the
> rest of our source files, at which point I would like to propose that
> we move towards greater standardization of our tagset:
> https://wiki.apertium.org/wiki/List_of_symbols
> 
> At minimum, I would like to deal with some of the duplicate tags, like
> impf/imperf, rec/res, v/vblex, pass/pasv, etc.

Yay. There's probably some ind / indic ~ indv / indef confusions too.

> My preference would be that we also consider splitting compound tags,
> like the tense+mood (fti, fts, pii, pis) and maybe possessor and
> subject tags (px1sg, s_1sg)

That's already harder to implement people surely have strong opinions
here. Personally, I'd be used to having verbal person numbers tagged
with only one tag too, {sg,du,pl}{1,2,3} rather than two, but I can see
some languages can use separate tags for syntax etc. At least as long as
they are standard and have easy 1:n mappings, perhaps even scripts to
switch between easily, they should be workable.

>. And if we wanted to go really crazy we
> could consider a broader rewrite like changing our tags to UD-style
> feature-value pairs (so  becomes ), though I don't
> imagine we actually want to go nearly that far.

YEah it would be ideal imo but probably would also have some opposition.
As long as we have standardised tagset
and fairly simple remapping to ud (and unimorph would be nice too) it's
not too bad. Simple being a screenful of code in your favourite
scripting language (and a mapping table).


-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff