Re: [Apertium-stuff] Gsoc proposal

2020-03-29 Thread Sushain Cherivirala
Shrey,

Your proposal covers a significant amount of features, many of which APy
doesn't have
any support for. I suggest expanding on how that support would be built
(perhaps references
to supporting papers, etc.).

Really though, we consider your coding challenge *significantly more
important *than
your proposal. I suggest sending PRs to either apertium-apy or
apertium-html-tools if
you're interested in this project. Your proposal will not receive any
attention without
a serious coding challenge attempt (ideally a merged PR or two).

On Sun, Mar 29, 2020 at 12:29 PM Shrey Modi  wrote:

> Hello Guys and mentors (sushain and friespeaker)
> Please have a look at my proposal and provide necessary suggestions to it.
> http://wiki.apertium.org/wiki/User:Shrey1608
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New release for apertium-fra-cat & por-cat

2020-03-29 Thread Sushain Cherivirala
> The issue was only on a "_" that was used initially. I dropped it, but I
didn't know that numbers too were not allowed. Thanks for the patching!

Yeah, I think we didn't fully investigate your funky variant :) There was
no real intentionality around not including numbers. I think including
underscores could cause some downstream issues so thanks for dropping it!

On Sun, Mar 29, 2020 at 1:47 PM Hèctor Alòs i Font 
wrote:

> Missatge de Sushain Cherivirala  del dia dg., 29 de
> març 2020 a les 22:24:
>
>> Hèctor,
>>
>> > cat-por_PTpre1990
>>
>> I think that https://github.com/apertium/apertium-apy/issues/141 is
>> relevant here. That particular variant isn't working since it has numbers
>> in it. I hot-patched the regex and it seem to work now:
>> https://www.apertium.org/apy/listPairs. Committed to master:
>> https://github.com/apertium/apertium-apy/commit/2f391281532273cc2d229f645f0e5fdf30cfbe7f
>> .
>>
>
> The issue was only on a "_" that was used initially. I dropped it, but I
> didn't know that numbers too were not allowed. Thanks for the patching!
>
> This also needed a tweak to the html-tools config to allow the variant. It
>> looks like apertium.org picks it up now.
>>
>
> Yes, it is working! Thanks!
>
>
>> > By the way, "PTpre1990" stands for "European Portuguese (traditional
>> orthography)". It should probably be added to the interface tags, but I
>> don't know where is the file in github where the meaning should be added in
>> several languages.
>>
>> You're looking for this file:
>> https://github.com/apertium/apertium-apy/blob/master/language_names/variants.tsv
>>
>> Feel free to send a pull request my way. We'll have to upgrade APy to
>> pick it up on the other end. I can cut a quick release for that after
>> you've updated it.
>>
>> On Sat, Mar 28, 2020 at 12:06 PM Hèctor Alòs i Font 
>> wrote:
>>
>>> The new release of apertium-fra-cat is already available in apertium.org.
>>> Thanks!
>>> But I'm not sure the new version of apertium-por-cat is. At least modes
>>> cat-por_BR and cat-por_PTpre1990 are not yet. Someone could take a look,
>>> please?
>>> By the way, "PTpre1990" stands for "European Portuguese (traditional
>>> orthography)". It should probably be added to the interface tags, but I
>>> don't know where is the file in github where the meaning should be added in
>>> several languages.
>>>
>>> Hèctor
>>>
>>> Missatge de Hèctor Alòs i Font  del dia ds., 21
>>> de març 2020 a les 0:07:
>>>
 Thanks a lot, Tino!

 Missatge de Tino Didriksen  del dia dv., 20 de
 març 2020 a les 23:13:

> Finally got around to packaging fra-cat and por-cat.
> https://github.com/apertium/apertium-packaging/issues/26 has links to
> exact commits and release tags.
>
> Pushed to Debian:
> - https://salsa.debian.org/science-team/apertium-fra-cat v1.8.0
> - https://salsa.debian.org/science-team/apertium-pt-ca v0.10.0
>
> And upgraded public APy instance.
>
> -- Tino Didriksen
>
>
> On Sat, 8 Feb 2020 at 10:23, Hèctor Alòs i Font 
> wrote:
>
>> A new release of apertium-fra-cat is ready to be packaged.
>>
>> It mostly contains many new translations in the bidix (more than
>> 15,000). Besides:
>> - disambiguation has been improved, especially for French
>> - dozens of new lexical selection rules have been added
>> - dozens of new transfer rules have been added, especially for the
>> cat-fra side
>>
>> In any case, the main problem of this language pair is that it is
>> still using just one step in the transfer. This makes impossible to reach
>> 20% of WER, especially on the cat-fra side, where quite a lot of words 
>> have
>> to be added or reordered. Unfortunately, I can't find the time for such a
>> change that requires a lot of work.
>>
>> Please, @Tino Didriksen , could you package
>> the release?
>>
>> Hèctor
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
 ___
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New release for apertium-fra-cat & por-cat

2020-03-29 Thread Hèctor Alòs i Font
Missatge de Sushain Cherivirala  del dia dg., 29 de març
2020 a les 22:24:

> Hèctor,
>
> > cat-por_PTpre1990
>
> I think that https://github.com/apertium/apertium-apy/issues/141 is
> relevant here. That particular variant isn't working since it has numbers
> in it. I hot-patched the regex and it seem to work now:
> https://www.apertium.org/apy/listPairs. Committed to master:
> https://github.com/apertium/apertium-apy/commit/2f391281532273cc2d229f645f0e5fdf30cfbe7f
> .
>

The issue was only on a "_" that was used initially. I dropped it, but I
didn't know that numbers too were not allowed. Thanks for the patching!

This also needed a tweak to the html-tools config to allow the variant. It
> looks like apertium.org picks it up now.
>

Yes, it is working! Thanks!


> > By the way, "PTpre1990" stands for "European Portuguese (traditional
> orthography)". It should probably be added to the interface tags, but I
> don't know where is the file in github where the meaning should be added in
> several languages.
>
> You're looking for this file:
> https://github.com/apertium/apertium-apy/blob/master/language_names/variants.tsv
>
> Feel free to send a pull request my way. We'll have to upgrade APy to pick
> it up on the other end. I can cut a quick release for that after you've
> updated it.
>
> On Sat, Mar 28, 2020 at 12:06 PM Hèctor Alòs i Font 
> wrote:
>
>> The new release of apertium-fra-cat is already available in apertium.org.
>> Thanks!
>> But I'm not sure the new version of apertium-por-cat is. At least modes
>> cat-por_BR and cat-por_PTpre1990 are not yet. Someone could take a look,
>> please?
>> By the way, "PTpre1990" stands for "European Portuguese (traditional
>> orthography)". It should probably be added to the interface tags, but I
>> don't know where is the file in github where the meaning should be added in
>> several languages.
>>
>> Hèctor
>>
>> Missatge de Hèctor Alòs i Font  del dia ds., 21 de
>> març 2020 a les 0:07:
>>
>>> Thanks a lot, Tino!
>>>
>>> Missatge de Tino Didriksen  del dia dv., 20 de
>>> març 2020 a les 23:13:
>>>
 Finally got around to packaging fra-cat and por-cat.
 https://github.com/apertium/apertium-packaging/issues/26 has links to
 exact commits and release tags.

 Pushed to Debian:
 - https://salsa.debian.org/science-team/apertium-fra-cat v1.8.0
 - https://salsa.debian.org/science-team/apertium-pt-ca v0.10.0

 And upgraded public APy instance.

 -- Tino Didriksen


 On Sat, 8 Feb 2020 at 10:23, Hèctor Alòs i Font 
 wrote:

> A new release of apertium-fra-cat is ready to be packaged.
>
> It mostly contains many new translations in the bidix (more than
> 15,000). Besides:
> - disambiguation has been improved, especially for French
> - dozens of new lexical selection rules have been added
> - dozens of new transfer rules have been added, especially for the
> cat-fra side
>
> In any case, the main problem of this language pair is that it is
> still using just one step in the transfer. This makes impossible to reach
> 20% of WER, especially on the cat-fra side, where quite a lot of words 
> have
> to be added or reordered. Unfortunately, I can't find the time for such a
> change that requires a lot of work.
>
> Please, @Tino Didriksen , could you package
> the release?
>
> Hèctor
>
 ___
 Apertium-stuff mailing list
 Apertium-stuff@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/apertium-stuff

>>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Gsoc proposal

2020-03-29 Thread Shrey Modi
Hello Guys and mentors (sushain and friespeaker)
Please have a look at my proposal and provide necessary suggestions to it.
http://wiki.apertium.org/wiki/User:Shrey1608
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New release for apertium-fra-cat & por-cat

2020-03-29 Thread Sushain Cherivirala
Hèctor,

> cat-por_PTpre1990

I think that https://github.com/apertium/apertium-apy/issues/141 is
relevant here. That particular variant isn't working since it has numbers
in it. I hot-patched the regex and it seem to work now:
https://www.apertium.org/apy/listPairs. Committed to master:
https://github.com/apertium/apertium-apy/commit/2f391281532273cc2d229f645f0e5fdf30cfbe7f
.

This also needed a tweak to the html-tools config to allow the variant. It
looks like apertium.org picks it up now.

> By the way, "PTpre1990" stands for "European Portuguese (traditional
orthography)". It should probably be added to the interface tags, but I
don't know where is the file in github where the meaning should be added in
several languages.

You're looking for this file:
https://github.com/apertium/apertium-apy/blob/master/language_names/variants.tsv

Feel free to send a pull request my way. We'll have to upgrade APy to pick
it up on the other end. I can cut a quick release for that after you've
updated it.

On Sat, Mar 28, 2020 at 12:06 PM Hèctor Alòs i Font 
wrote:

> The new release of apertium-fra-cat is already available in apertium.org.
> Thanks!
> But I'm not sure the new version of apertium-por-cat is. At least modes
> cat-por_BR and cat-por_PTpre1990 are not yet. Someone could take a look,
> please?
> By the way, "PTpre1990" stands for "European Portuguese (traditional
> orthography)". It should probably be added to the interface tags, but I
> don't know where is the file in github where the meaning should be added in
> several languages.
>
> Hèctor
>
> Missatge de Hèctor Alòs i Font  del dia ds., 21 de
> març 2020 a les 0:07:
>
>> Thanks a lot, Tino!
>>
>> Missatge de Tino Didriksen  del dia dv., 20 de
>> març 2020 a les 23:13:
>>
>>> Finally got around to packaging fra-cat and por-cat.
>>> https://github.com/apertium/apertium-packaging/issues/26 has links to
>>> exact commits and release tags.
>>>
>>> Pushed to Debian:
>>> - https://salsa.debian.org/science-team/apertium-fra-cat v1.8.0
>>> - https://salsa.debian.org/science-team/apertium-pt-ca v0.10.0
>>>
>>> And upgraded public APy instance.
>>>
>>> -- Tino Didriksen
>>>
>>>
>>> On Sat, 8 Feb 2020 at 10:23, Hèctor Alòs i Font 
>>> wrote:
>>>
 A new release of apertium-fra-cat is ready to be packaged.

 It mostly contains many new translations in the bidix (more than
 15,000). Besides:
 - disambiguation has been improved, especially for French
 - dozens of new lexical selection rules have been added
 - dozens of new transfer rules have been added, especially for the
 cat-fra side

 In any case, the main problem of this language pair is that it is still
 using just one step in the transfer. This makes impossible to reach 20% of
 WER, especially on the cat-fra side, where quite a lot of words have to be
 added or reordered. Unfortunately, I can't find the time for such a change
 that requires a lot of work.

 Please, @Tino Didriksen , could you package
 the release?

 Hèctor

>>> ___
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tanmai Khanna
Just to clarify, in this original example:
^potato/patata$

case refers to capitalisation. Morphological case already has a tag, which
would be primary information so this wouldn't touch that at all. So if it
felt like we're changing the format, we're not and this would continue
to be backwards compatible. :)

Tanmai

On Sun, Mar 29, 2020 at 11:59 PM Tanmai Khanna 
wrote:

> Instead of looking at this as modifying or extending the apertium stream
> format, we could look at this as making tags more versatile by creating a
> new kind of tags which have a feature:value pair. That's all there is to
> it, really. In effect, it allows us to pass an arbitrary amount of info in
> the pipe.
>
> Tanmai
>
> On Sun, Mar 29, 2020 at 8:53 PM Tanmai Khanna 
> wrote:
>
>> Hi Mikel,
>>
>>> (0) No change should be made without proper regression testing. I think
>>> we all agree on that!
>>>
>> Definitely, and this is something I'll add in the proposal.
>>
>>> (1) I still believe that the functionality should be proven without
>>> rewriting the (critical) format parsing portions in Apertium's modules. In
>>> particular, finite-state lexical-processing modules written in C++. I'd
>>> like to hear (for instance) from Sergio first.  There is always the
>>> possibility of having a complete series of modules that allow for
>>> "extended" formats, one per current module, so that each pair works with
>>> what it needs.
>>>
>> As mentioned in the last mail, the functionality will definitely be
>> proven before even one line of code is written.
>>
>>> (2) Extending *is* changing. The payload associated to each lexical unit
>>> would be heavier and will be running down the pipe, where it is necessary
>>> and where it isn't (but see (4)). This does not come without a
>>> computational burden. It is just a matter of comparing it with the burden
>>> of working around the current modules. I would not not underestimate the
>>> speed of Unix utilities, pipes, etc. This is code that has been around for
>>> 40 years.
>>>
>> I have regularly seen programs work with larger streams but if
>> computational burden is the main issue then yes Tino's idea of reduced
>> reference payload works. Any way I'd see that as more of an optimisation
>> than a different idea.
>>
>>> (99) What's the driving force behind avoiding trimming? I confess I
>>> don't quite get it.
>>>
>> I explain this in
>> http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming#Which_of_the_published_tasks_am_I_interested_in.3F_What_do_I_plan_to_do.3F
>> , but the gist is that if you trim away any source words and their analysis
>> if they aren't present in the bidix, you lose out on a better lexical
>> selection and transfer rules that would have otherwise matched, would not
>> match now because the word is unknown. If we can avoid trimming, we can
>> improve disambiguation and transfer. Monodixes can often be improved
>> individually using several sources and methods, but none of that would
>> matter if we keep trimming all words that aren't in the bidix as well. Just
>> because we cannot translate a word shouldn't mean we can't use it for our
>> benefit.
>>
>> On a more fundamental level, this was also about not trimming away
>> information in general, and propagating important information in the pipe
>> so that each program can use it as it'd like. This would need us to have
>> the surface form in the pipe, and it felt like a more general and future
>> ready solution would be to allow arbitrary information in the pipe. For
>> this project, we can limit this to surface form so that none of the files
>> are modified, and every program works as it did before. But it gives the
>> developers a powerful option for propagating several features, as Tino
>> listed already.
>>
>> Hope this made the rationale behind trimming and modifying the stream
>> clear. It would be great to hear from Sergio, as these discussions are how
>> we're going to make this project a success for Apertium.
>>
>> Thanks and Regards,
>> Tanmai Khanna
>>
>>
>> El 29/3/20 a les 13:31, Tanmai Khanna ha escrit:
>>>
>>> Mikel,
>>> This is a preliminary idea and a suggestion that we discussed only
>>> yesterday, but I assure you that it will be justified in an uncontestable
>>> way before even one line of code is written. Ensuring backwards
>>> compatibility is of utmost importance, and because of this, in the proposal
>>> to modify the stream, the current stream format is called primary, whereas
>>> anything else you can add would be secondary and optional. It will also be
>>> presented in a level of formality comparable to that used in the
>>> documentation.
>>>
>>> And of course, as Xavi mentioned, it won't be merged until backwards
>>> compatibility is ensured.
>>>
>>> About the idea of DAG and multiple streams, before we discuss it I just
>>> want to understand that the main concern you have is that we should leave
>>> the main stream untouched, or is it that information shouldn't have to go
>>> 

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tanmai Khanna
Instead of looking at this as modifying or extending the apertium stream
format, we could look at this as making tags more versatile by creating a
new kind of tags which have a feature:value pair. That's all there is to
it, really. In effect, it allows us to pass an arbitrary amount of info in
the pipe.

Tanmai

On Sun, Mar 29, 2020 at 8:53 PM Tanmai Khanna 
wrote:

> Hi Mikel,
>
>> (0) No change should be made without proper regression testing. I think
>> we all agree on that!
>>
> Definitely, and this is something I'll add in the proposal.
>
>> (1) I still believe that the functionality should be proven without
>> rewriting the (critical) format parsing portions in Apertium's modules. In
>> particular, finite-state lexical-processing modules written in C++. I'd
>> like to hear (for instance) from Sergio first.  There is always the
>> possibility of having a complete series of modules that allow for
>> "extended" formats, one per current module, so that each pair works with
>> what it needs.
>>
> As mentioned in the last mail, the functionality will definitely be proven
> before even one line of code is written.
>
>> (2) Extending *is* changing. The payload associated to each lexical unit
>> would be heavier and will be running down the pipe, where it is necessary
>> and where it isn't (but see (4)). This does not come without a
>> computational burden. It is just a matter of comparing it with the burden
>> of working around the current modules. I would not not underestimate the
>> speed of Unix utilities, pipes, etc. This is code that has been around for
>> 40 years.
>>
> I have regularly seen programs work with larger streams but if
> computational burden is the main issue then yes Tino's idea of reduced
> reference payload works. Any way I'd see that as more of an optimisation
> than a different idea.
>
>> (99) What's the driving force behind avoiding trimming? I confess I don't
>> quite get it.
>>
> I explain this in
> http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming#Which_of_the_published_tasks_am_I_interested_in.3F_What_do_I_plan_to_do.3F
> , but the gist is that if you trim away any source words and their analysis
> if they aren't present in the bidix, you lose out on a better lexical
> selection and transfer rules that would have otherwise matched, would not
> match now because the word is unknown. If we can avoid trimming, we can
> improve disambiguation and transfer. Monodixes can often be improved
> individually using several sources and methods, but none of that would
> matter if we keep trimming all words that aren't in the bidix as well. Just
> because we cannot translate a word shouldn't mean we can't use it for our
> benefit.
>
> On a more fundamental level, this was also about not trimming away
> information in general, and propagating important information in the pipe
> so that each program can use it as it'd like. This would need us to have
> the surface form in the pipe, and it felt like a more general and future
> ready solution would be to allow arbitrary information in the pipe. For
> this project, we can limit this to surface form so that none of the files
> are modified, and every program works as it did before. But it gives the
> developers a powerful option for propagating several features, as Tino
> listed already.
>
> Hope this made the rationale behind trimming and modifying the stream
> clear. It would be great to hear from Sergio, as these discussions are how
> we're going to make this project a success for Apertium.
>
> Thanks and Regards,
> Tanmai Khanna
>
>
> El 29/3/20 a les 13:31, Tanmai Khanna ha escrit:
>>
>> Mikel,
>> This is a preliminary idea and a suggestion that we discussed only
>> yesterday, but I assure you that it will be justified in an uncontestable
>> way before even one line of code is written. Ensuring backwards
>> compatibility is of utmost importance, and because of this, in the proposal
>> to modify the stream, the current stream format is called primary, whereas
>> anything else you can add would be secondary and optional. It will also be
>> presented in a level of formality comparable to that used in the
>> documentation.
>>
>> And of course, as Xavi mentioned, it won't be merged until backwards
>> compatibility is ensured.
>>
>> About the idea of DAG and multiple streams, before we discuss it I just
>> want to understand that the main concern you have is that we should leave
>> the main stream untouched, or is it that information shouldn't have to go
>> through programs that don't need it.
>>
>> With all your comments and suggestions, I will make a robust proposal and
>> it will be formalised before it is implemented.
>> Thanks and Regards,
>> Tanmai Khanna
>>
>> On Sun, Mar 29, 2020 at 3:52 PM Mikel L. Forcada  wrote:
>>
>>> Folks:
>>>
>>> The elders in Apertium will not be surprised if I voiced my opposition
>>> to changing the format in the Apertium formats used between different
>>> modules of the pipeline. In any case

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tanmai Khanna
Hi Mikel,

> (0) No change should be made without proper regression testing. I think we
> all agree on that!
>
Definitely, and this is something I'll add in the proposal.

> (1) I still believe that the functionality should be proven without
> rewriting the (critical) format parsing portions in Apertium's modules. In
> particular, finite-state lexical-processing modules written in C++. I'd
> like to hear (for instance) from Sergio first.  There is always the
> possibility of having a complete series of modules that allow for
> "extended" formats, one per current module, so that each pair works with
> what it needs.
>
As mentioned in the last mail, the functionality will definitely be proven
before even one line of code is written.

> (2) Extending *is* changing. The payload associated to each lexical unit
> would be heavier and will be running down the pipe, where it is necessary
> and where it isn't (but see (4)). This does not come without a
> computational burden. It is just a matter of comparing it with the burden
> of working around the current modules. I would not not underestimate the
> speed of Unix utilities, pipes, etc. This is code that has been around for
> 40 years.
>
I have regularly seen programs work with larger streams but if
computational burden is the main issue then yes Tino's idea of reduced
reference payload works. Any way I'd see that as more of an optimisation
than a different idea.

> (99) What's the driving force behind avoiding trimming? I confess I don't
> quite get it.
>
I explain this in
http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming#Which_of_the_published_tasks_am_I_interested_in.3F_What_do_I_plan_to_do.3F
, but the gist is that if you trim away any source words and their analysis
if they aren't present in the bidix, you lose out on a better lexical
selection and transfer rules that would have otherwise matched, would not
match now because the word is unknown. If we can avoid trimming, we can
improve disambiguation and transfer. Monodixes can often be improved
individually using several sources and methods, but none of that would
matter if we keep trimming all words that aren't in the bidix as well. Just
because we cannot translate a word shouldn't mean we can't use it for our
benefit.

On a more fundamental level, this was also about not trimming away
information in general, and propagating important information in the pipe
so that each program can use it as it'd like. This would need us to have
the surface form in the pipe, and it felt like a more general and future
ready solution would be to allow arbitrary information in the pipe. For
this project, we can limit this to surface form so that none of the files
are modified, and every program works as it did before. But it gives the
developers a powerful option for propagating several features, as Tino
listed already.

Hope this made the rationale behind trimming and modifying the stream
clear. It would be great to hear from Sergio, as these discussions are how
we're going to make this project a success for Apertium.

Thanks and Regards,
Tanmai Khanna


El 29/3/20 a les 13:31, Tanmai Khanna ha escrit:
>
> Mikel,
> This is a preliminary idea and a suggestion that we discussed only
> yesterday, but I assure you that it will be justified in an uncontestable
> way before even one line of code is written. Ensuring backwards
> compatibility is of utmost importance, and because of this, in the proposal
> to modify the stream, the current stream format is called primary, whereas
> anything else you can add would be secondary and optional. It will also be
> presented in a level of formality comparable to that used in the
> documentation.
>
> And of course, as Xavi mentioned, it won't be merged until backwards
> compatibility is ensured.
>
> About the idea of DAG and multiple streams, before we discuss it I just
> want to understand that the main concern you have is that we should leave
> the main stream untouched, or is it that information shouldn't have to go
> through programs that don't need it.
>
> With all your comments and suggestions, I will make a robust proposal and
> it will be formalised before it is implemented.
> Thanks and Regards,
> Tanmai Khanna
>
> On Sun, Mar 29, 2020 at 3:52 PM Mikel L. Forcada  wrote:
>
>> Folks:
>>
>> The elders in Apertium will not be surprised if I voiced my opposition to
>> changing the format in the Apertium formats used between different modules
>> of the pipeline. In any case, this is affects the core functionality of
>> Apertium in many ways and its need should be justified in an uncontestable
>> way so that the PMC makes a decision to have a new version of Apertium
>> which should inevitably have paths to backward compatibility so that legacy
>> languages and language pairs work identically and without any loss of
>> performance. I believe we are far from "uncontestability", but that is just
>> my personal opinion.
>>
>> Currently, modes are linear pipelines. Any

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Mikel L. Forcada

Folks:

A quick round of  comments  after the responses by Tino, Xavi, Tanmai, 
and Fran. Did I miss anyone?


(00) I cannot claim to have thoroughly considered all of the details of 
the proposal. Therefore, I can change my mind.


(0) No change should be made without proper regression testing. I think 
we all agree on that!


(1) I still believe that the functionality should be proven without 
rewriting the (critical) format parsing portions in Apertium's modules. 
In particular, finite-state lexical-processing modules written in C++. 
I'd like to hear (for instance) from Sergio first.  There is always the 
possibility of having a complete series of modules that allow for 
"extended" formats, one per current module, so that each pair works with 
what it needs.


(2) Extending *is* changing. The payload associated to each lexical unit 
would be heavier and will be running down the pipe, where it is 
necessary and where it isn't (but see (4)). This does not come without a 
computational burden. It is just a matter of comparing it with the 
burden of working around the current modules. I would not not 
underestimate the speed of Unix utilities, pipes, etc. This is code that 
has been around for 40 years. [1]


(3) I was not suggesting external forking. Branches inside Apertium are 
fine.


(4) I like Tino's idea of a reduced reference payload and a common 
storage which is referred to.


(99) What's the driving force behind avoiding trimming? I confess I 
don't quite get it.


Cheers

Mikel

[1] If it weren't for some stubbornness on my part (ask Sergio), 
Apertium (previously interNOSTRUM) would not be a Unix pipeline, but 
some monolithic object-oriented thing, or whatever was fashionable then. 
Yes, I was a grumpy, stubborn boss back then. Now I am not the boss 
anymore; still grumpy sometimes — fortunately 3,5 years from voluntary 
retirement. Which, hey, I have decided to take.


El 29/3/20 a les 13:31, Tanmai Khanna ha escrit:

Mikel,
This is a preliminary idea and a suggestion that we discussed only 
yesterday, but I assure you that it will be justified in an 
uncontestable way before even one line of code is written. Ensuring 
backwards compatibility is of utmost importance, and because of this, 
in the proposal to modify the stream, the current stream format is 
called primary, whereas anything else you can add would be secondary 
and optional. It will also be presented in a level of formality 
comparable to that used in the documentation.


And of course, as Xavi mentioned, it won't be merged until backwards 
compatibility is ensured.


About the idea of DAG and multiple streams, before we discuss it I 
just want to understand that the main concern you have is that we 
should leave the main stream untouched, or is it that information 
shouldn't have to go through programs that don't need it.


With all your comments and suggestions, I will make a robust proposal 
and it will be formalised before it is implemented.

Thanks and Regards,
Tanmai Khanna

On Sun, Mar 29, 2020 at 3:52 PM Mikel L. Forcada > wrote:


Folks:

The elders in Apertium will not be surprised if I voiced my
opposition to changing the format in the Apertium formats used
between different modules of the pipeline. In any case, this is
affects the core functionality of Apertium in many ways and its
need should be justified in an uncontestable way so that the PMC
makes a decision to have a new version of Apertium which should
inevitably have paths to backward compatibility so that legacy
languages and language pairs work identically and without any loss
of performance. I believe we are far from "uncontestability", but
that is just my personal opinion.

Currently, modes are linear pipelines. Any functionality requiring
information that is currently ingested by one module and not
passed ahead could be sent to later modules by teeing
(https://en.wikipedia.org/wiki/Tee_(command)), named pipes, etc.
We would have a directed acyclic graph, much as in tools like
make, snakemake, dgsh (https://github.com/dspinellis/dgsh/wiki) etc.

Any justification should first prove that the functionality is
robust and needed by working around the current format and
modules, and be presented in a level of formality which is
comparable to that used currently in our documentation.

Having said that, no one cannot oppose people forking and testing.
If the new thing works, Apertium could bless the fork and merge it
(depending on how the fork handles provisions for legacy Apertium
workflows). But, as I said, this seems premature to me. But I am
usually very conservative.

Cheers

Mikel


El 29/3/20 a les 11:41, Tanmai Khanna ha escrit:

Hey guys,
Here's a draft proposal for this project. Any comments will be
appreciated :)

Thanks,
Tanmai

On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna
mailto:kha

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tanmai Khanna
Mikel,
This is a preliminary idea and a suggestion that we discussed only
yesterday, but I assure you that it will be justified in an uncontestable
way before even one line of code is written. Ensuring backwards
compatibility is of utmost importance, and because of this, in the proposal
to modify the stream, the current stream format is called primary, whereas
anything else you can add would be secondary and optional. It will also be
presented in a level of formality comparable to that used in the
documentation.

And of course, as Xavi mentioned, it won't be merged until backwards
compatibility is ensured.

About the idea of DAG and multiple streams, before we discuss it I just
want to understand that the main concern you have is that we should leave
the main stream untouched, or is it that information shouldn't have to go
through programs that don't need it.

With all your comments and suggestions, I will make a robust proposal and
it will be formalised before it is implemented.
Thanks and Regards,
Tanmai Khanna

On Sun, Mar 29, 2020 at 3:52 PM Mikel L. Forcada  wrote:

> Folks:
>
> The elders in Apertium will not be surprised if I voiced my opposition to
> changing the format in the Apertium formats used between different modules
> of the pipeline. In any case, this is affects the core functionality of
> Apertium in many ways and its need should be justified in an uncontestable
> way so that the PMC makes a decision to have a new version of Apertium
> which should inevitably have paths to backward compatibility so that legacy
> languages and language pairs work identically and without any loss of
> performance. I believe we are far from "uncontestability", but that is just
> my personal opinion.
>
> Currently, modes are linear pipelines. Any functionality requiring
> information that is currently ingested by one module and not passed ahead
> could be sent to later modules by teeing (
> https://en.wikipedia.org/wiki/Tee_(command)), named pipes, etc. We would
> have a directed acyclic graph, much as in tools like make, snakemake, dgsh (
> https://github.com/dspinellis/dgsh/wiki) etc.
>
> Any justification should first prove that the functionality is robust and
> needed by working around the current format and modules, and be presented
> in a level of formality which is comparable to that used currently in our
> documentation.
>
> Having said that, no one cannot oppose people forking and testing. If the
> new thing works, Apertium could bless the fork and merge it (depending on
> how the fork handles provisions for legacy Apertium workflows). But, as I
> said, this seems premature to me. But I am usually very conservative.
>
> Cheers
>
> Mikel
>
>
> El 29/3/20 a les 11:41, Tanmai Khanna ha escrit:
>
> Hey guys,
> Here's a draft proposal for this project. Any comments will be
> appreciated :)
>
> Thanks,
> Tanmai
>
> On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna 
> wrote:
>
>> Hi Hèctor,
>> A fundamental motivation for this proposal is the possibility of giving
>> the power to each program to use and propagate as much information as it
>> needs in the pipeline. In our discussion on the IRC, Tino Didriksen said:
>>
>>> You should see how much secondary information VISL's streams have. Noun
>>> semantics, verb frames, dependency, markup tags, etc. Being able to carry
>>> any information along makes many things possible, often things you can't
>>> imagine because of current limitations.
>>>
>>
>> The example in my original was well, just an example, but the idea is
>> that you can add any amount of information as you want, in the language
>> models or even the translation modules. It's also not just about English
>> words, but about all languages. This is not to say that we have to add case
>> information to every English word. It is optional information which can be
>> added if needed for the translation task.
>>
>> With this proposal we're trying to prepare the apertium stream for the
>> future. Today we realised that we need the surface form in the stream, and
>> tomorrow we might need semantic tags, sentiment tags, etc.* If we don't
>> do this now, we will have to modify all the parsers in the pipeline each
>> time we need more information in the pipe.* This is why it's a good idea
>> to modify the parsers so that it can handle an arbitrary amount of
>> information.
>>
>> Lastly, one point we should discuss is this idea about how any secondary
>> information I add in the monodix would be available for everyone who uses
>> that information. There's several things to say about this:
>>
>>- As long as the information is correct, I don't really see why
>>redundant secondary information should bother anyone. It will be available
>>for anyone who wishes to use it for their task, and if you don't want to
>>use it the programs will ignore it.
>>- Another idea is that secondary information could be put in a
>>separate dix, however this would lead to an unnecessary increase in
>>complexit

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tino Didriksen
It's all transparent. Nobody has to add secondary information to the
stream. All current pipes will continue to work as-is, unmodified. All old
data and files remain valid.

The work is to allow for arbitrary secondary information to be added to the
stream. Initially for use with surface forms, so that we can eliminate the
need for trimming and get input surface forms carried through to output.

Pieces of information that would be useful to have in the stream includes
but is not limited to: Surface form, syntactic function, sentiment,
semantics (noun, adjective, verb), roles, verb frames, dependency, markup
tags.

As an example, http://codepad.org/Lq4xwyZr is how VISL + GramTrans' stream
looks, after syntactic transfer. You've got the input tokens, re-arranged,
with target analysis injected, but keeping original analysis and
information about where this token was in the input. This is all needed.
Some of this comes all the way from the input analysis, and is useful in
several places along the way.

It cannot be passed along via a separate channel (DAG-style) - it is
inherently tied to each reading. What you could do, and what GramTrans
does, is to store a lot of information in a separate channel, but still
store handles to this information per-reading. E.g., that's how GramTrans
handles markup tags - separate channel stores the full tag  and stream passes along  - the stream still gets
baseline information that there is an  tag, because that's relevant, but
doesn't need all the attributes.

Appending to each reading also allows for pausing the pipe in any location,
and still retain all the information.

Will Apertium need all that information? Not immediately. But currently
it's impossible to append secondary information to readings, and it's
hindering us. What we do need immediately is surface form and markup tags.
And if we're going to modify the stream to transport this, we might as well
allow anything.

The proposal boils down to: tags with : in them are secondary, and
secondary tags are always trailing. Make secondary tags not break the pipe.
Modify the tokeniser to append surface form via  and generator to
use surface form if there is no translation.

-- Tino Didriksen


On Sun, 29 Mar 2020 at 12:21, Mikel L. Forcada  wrote:

> Folks:
>
> The elders in Apertium will not be surprised if I voiced my opposition to
> changing the format in the Apertium formats used between different modules
> of the pipeline. In any case, this is affects the core functionality of
> Apertium in many ways and its need should be justified in an uncontestable
> way so that the PMC makes a decision to have a new version of Apertium
> which should inevitably have paths to backward compatibility so that legacy
> languages and language pairs work identically and without any loss of
> performance. I believe we are far from "uncontestability", but that is just
> my personal opinion.
>
> Currently, modes are linear pipelines. Any functionality requiring
> information that is currently ingested by one module and not passed ahead
> could be sent to later modules by teeing (
> https://en.wikipedia.org/wiki/Tee_(command)), named pipes, etc. We would
> have a directed acyclic graph, much as in tools like make, snakemake, dgsh (
> https://github.com/dspinellis/dgsh/wiki) etc.
>
> Any justification should first prove that the functionality is robust and
> needed by working around the current format and modules, and be presented
> in a level of formality which is comparable to that used currently in our
> documentation.
>
> Having said that, no one cannot oppose people forking and testing. If the
> new thing works, Apertium could bless the fork and merge it (depending on
> how the fork handles provisions for legacy Apertium workflows). But, as I
> said, this seems premature to me. But I am usually very conservative.
>
> Cheers
>
> Mikel
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Xavi Ivars
Missatge de Mikel L. Forcada  del dia dg., 29 de març 2020
a les 12:22:

> Folks:
>
> The elders in Apertium will not be surprised if I voiced my opposition to
> changing the format in the Apertium formats used between different modules
> of the pipeline. In any case, this is affects the core functionality of
> Apertium in many ways and its need should be justified in an uncontestable
> way so that the PMC makes a decision to have a new version of Apertium
> which should inevitably have paths to backward compatibility so that legacy
> languages and language pairs work identically and without any loss of
> performance. I believe we are far from "uncontestability", but that is just
> my personal opinion.
>
Either I'm missing something, or there's something about the proposal that
is not clear enough.

I don't think anyone has suggested to "change" the format used between
modules, but to extend it in a way that is fully backwards compatible. One
of the main pieces about the "implementation" part of the proposal is
actually about that: avoiding any type of regression in existing modules
and language pairs.

Someone using the language data with apertium 3.x (whatever X we currently
have) will keep working the same way it works as of today. And anyone using
the language data with Apertium 3.Y (whatever Y is when this features get
merged) will also work the same way. I agree this is a must, but where does
the proposal says that won't be the case? It actually says the opposite: it
will ensure no regressions.


> Currently, modes are linear pipelines. Any functionality requiring
> information that is currently ingested by one module and not passed ahead
> could be sent to later modules by teeing (
> https://en.wikipedia.org/wiki/Tee_(command)), named pipes, etc. We would
> have a directed acyclic graph, much as in tools like make, snakemake, dgsh (
> https://github.com/dspinellis/dgsh/wiki) etc.
>
One of the (huge) benefits of Apertium's modular architecture is the
ability to place modules at any point of the pipeline to improve the
overall translation, without having to modify the existing modules. Several
modules have been added to different pipelines: move from 1-level to
3-level transfers (with chunking), new morphological analyzers based on
HFST, CG disambiguation, sliding window and perceptron implementation for
taggers, lexical selection, anaphora resolution, recursive transfers,...
even a new DOCX processor few days ago!

But this extensibility is currently limited by the actual information that
is passed over to the next step. Why isn't the PoS tagger passing the
surface form to the "next" module? Because at the time it was written, the
"next" module didn't need it. And that definitely constraints the ability
to add a module after a certain point that needs information that has
already been discarded.

On the other hand, teeing would actually have a much bigger performance
impact: file systems are slow, much slower, than memory. Also, when
thousands of translations are done every minute (apertium.org,
softcatala.org.,...) adding that level of *stress* in the IO of those
servers seems a very bad implementation decision from my point of view.

> Any justification should first prove that the functionality is robust and
> needed by working around the current format and modules, and be presented
> in a level of formality which is comparable to that used currently in our
> documentation.
>
Completely agree. Documentation is something that is already behind the
current implementation, and these changes should be properly documented.

> Having said that, no one cannot oppose people forking and testing. If the
> new thing works, Apertium could bless the fork and merge it (depending on
> how the fork handles provisions for legacy Apertium workflows). But, as I
> said, this seems premature to me. But I am usually very conservative.
>
I don't think anyone would suggest merging this before actually ensuring it
properly works in a backwards compatible mode. The good thing about how
Apertium code base is structured in GitHub (and Git in general) is that it
makes fork-and-merge straight forward *inside *Apertium itself (git
branches, pull requests, etc).

And the reason I don't see these discussions premature is because it'll be
much easier to eventually merge it back to master if there's already an
agreement on the plan. Because it would be really painful that, once this
work is done and it achieves the prerequisites that we all agree (backwards
compatibility, documentation,...) it wouldn't be merged because of
disagreements of, for example, how the extension has been done.

So I'd really ask the ones that have more expertise (and, honestly, a much
better formed point of view that I do) in how Apertium is like it is today
(Mikel & Fran, but also Felipe, Sergio,...) to please keep challenging
approaches like this, but also by pointing out what are the weak points of
the proposal, so they can be reconsidered and improved, 

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Francis Tyers

El 2020-03-29 11:21, Mikel L. Forcada escribió:

Folks:

The elders in Apertium will not be surprised if I voiced my opposition
to changing the format in the Apertium formats used between different
modules of the pipeline. In any case, this is affects the core
functionality of Apertium in many ways and its need should be
justified in an uncontestable way so that the PMC makes a decision to
have a new version of Apertium which should inevitably have paths to
backward compatibility so that legacy languages and language pairs
work identically and without any loss of performance. I believe we are
far from "uncontestability", but that is just my personal opinion.


Agree.


Currently, modes are linear pipelines. Any functionality requiring
information that is currently ingested by one module and not passed
ahead could be sent to later modules by teeing
(https://en.wikipedia.org/wiki/Tee_(command)), named pipes, etc. We
would have a directed acyclic graph, much as in tools like make,
snakemake, dgsh (https://github.com/dspinellis/dgsh/wiki) etc.


A DAG would be an interesting way of setting up the pipeline ... a bit 
like

the "compute graphs" that are used in NNs nowadays ... but I'm
not sure how it could be done in a unixy way.


Any justification should first prove that the functionality is robust
and needed by working around the current format and modules, and be
presented in a level of formality which is comparable to that used
currently in our documentation.


Completely agree.


Having said that, no one cannot oppose people forking and testing. If
the new thing works, Apertium could bless the fork and merge it
(depending on how the fork handles provisions for legacy Apertium
workflows). But, as I said, this seems premature to me. But I am
usually very conservative.



I think in addition we need to have clear examples of the information
that should be included and a translational motivation for it. How
is the information going to help us, with concrete translation
examples.

Fran


___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Mikel L. Forcada

Folks:

The elders in Apertium will not be surprised if I voiced my opposition 
to changing the format in the Apertium formats used between different 
modules of the pipeline. In any case, this is affects the core 
functionality of Apertium in many ways and its need should be justified 
in an uncontestable way so that the PMC makes a decision to have a new 
version of Apertium which should inevitably have paths to backward 
compatibility so that legacy languages and language pairs work 
identically and without any loss of performance. I believe we are far 
from "uncontestability", but that is just my personal opinion.


Currently, modes are linear pipelines. Any functionality requiring 
information that is currently ingested by one module and not passed 
ahead could be sent to later modules by teeing 
(https://en.wikipedia.org/wiki/Tee_(command)), named pipes, etc. We 
would have a directed acyclic graph, much as in tools like make, 
snakemake, dgsh (https://github.com/dspinellis/dgsh/wiki) etc.


Any justification should first prove that the functionality is robust 
and needed by working around the current format and modules, and be 
presented in a level of formality which is comparable to that used 
currently in our documentation.


Having said that, no one cannot oppose people forking and testing. If 
the new thing works, Apertium could bless the fork and merge it 
(depending on how the fork handles provisions for legacy Apertium 
workflows). But, as I said, this seems premature to me. But I am usually 
very conservative.


Cheers

Mikel


El 29/3/20 a les 11:41, Tanmai Khanna ha escrit:

Hey guys,
Here's a draft proposal almost done except the work plan. Could you check it out? 
http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming> 
for this project. Any comments will be appreciated :)


Thanks,
Tanmai

On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna 
mailto:khanna.tan...@gmail.com>> wrote:


Hi Hèctor,
A fundamental motivation for this proposal is the possibility of
giving the power to each program to use and propagate as much
information as it needs in the pipeline. In our discussion on the
IRC, Tino Didriksen said:

You should see how much secondary information VISL's streams
have. Noun semantics, verb frames, dependency, markup tags,
etc. Being able to carry any information along makes many
things possible, often things you can't imagine because of
current limitations.


The example in my original was well, just an example, but the idea
is that you can add any amount of information as you want, in the
language models or even the translation modules. It's also not
just about English words, but about all languages. This is not to
say that we have to add case information to every English word. It
is optional information which can be added if needed for the
translation task.

With this proposal we're trying to prepare the apertium stream for
the future. Today we realised that we need the surface form in the
stream, and tomorrow we might need semantic tags, sentiment tags,
etc.*If we don't do this now, we will have to modify all the
parsers in the pipeline each time we need more information in the
pipe.* This is why it's a good idea to modify the parsers so that
it can handle an arbitrary amount of information.

Lastly, one point we should discuss is this idea about how any
secondary information I add in the monodix would be available for
everyone who uses that information. There's several things to say
about this:

  * As long as the information is correct, I don't really see why
redundant secondary information should bother anyone. It will
be available for anyone who wishes to use it for their task,
and if you don't want to use it the programs will ignore it.
  * Another idea is that secondary information could be put in a
separate dix, however this would lead to an unnecessary
increase in complexity.

Unless if the developers of Apertium feel that redundant
information in the stream will be a huge problem, this will allow
each program to access a lot more information and open up
possibilities that we haven't even thought of yet. At the very
least, it will help us to eliminate trimming.

Thanks and Regards,
Tanmai Khanna

On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font
mailto:hectora...@gmail.com>> wrote:

Hi Tanmai,

I am surprised by this proposal. It involves some very
important changes that should be better justified. I don't
quite understand when should one define the "optional
secondary information" in addition to the current
morphological fields. Will it be in the language module
(apertium-xxx) or in each of the translation modules
(apertium-xxx-yyy)? Part of the problem may be in the example.
 

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tino Didriksen
No dixes will be harmed during this procedure. Nobody has to touch any
existing language files for this work to be incredibly useful. The proposal
is to allow the stream to carry secondary information. This secondary
information can come from anywhere, and will mostly be dynamic.

Initially, the tokeniser will dynamically append the surface form as
secondary information to the output stream, which the generator will use if
no translation was available.

In the future, this can be used for so many things, including secondary
information that authors write into the dixes.

-- Tino Didriksen


On Sun, 29 Mar 2020 at 07:07, Hèctor Alòs i Font 
wrote:

> Hi Tanmai,
>
> I am surprised by this proposal. It involves some very important changes
> that should be better justified. I don't quite understand when should one
> define the "optional secondary information" in addition to the current
> morphological fields. Will it be in the language module (apertium-xxx) or
> in each of the translation modules (apertium-xxx-yyy)? Part of the problem
> may be in the example. I can't imagine why information on case should be
> added to every English word (not much that, say, information about
> belonging, which is common for Turkic languages). Should this kind of
> unnecessary information for everybody, or almost everybody, will be found
> in every language pair using, say, English if someone for his or her
> specific purposes will like to add it? As far as I understand, for the
> given project it is needed to add the surface form of the word. This seems
> quite logical. Moreover, this information may be useful for e.g. lexical
> selection and structural transfer. But more than that seems to me too
> obscure.
>
> Best,
> Hèctor
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Registration for wiki page

2020-03-29 Thread Ayush
Dear sir,
This is to inform you that, I have submitted the final proposal for the robust 
tokenisation task to GSoC site and will shortly begin to make a proposal on my 
wiki page. Please review my proposal and suggest some changes to the proposal 
if needed. Also, for making a wiki page proposal, am I following the right 
procedure of:
1) clicking to the username at top right corner
2) clicking at user page
3) creating a new user page and filling up the proposal detail there.

Thanks and regards,
Ayush Pradhan 
From: Flammie A Pirinen
Sent: 23 March 2020 05:56 PM
To: apertium-stuff@lists.sourceforge.net
Subject: Re: [Apertium-stuff] Registration for wiki page

On Mon, Mar 23, 2020 at 04:46:06PM +0530, Ayush wrote:
> Dear sir,
> Actually I have quite reached nowhere while going through the lttoolbox. Can 
> you please help me with making of schedule for the proposal and also what all 
> thinks I would be working under for the task of robust tokenisation. I know 
> that I have to update lttoolbox to be fully Unicode but how?

Hi,
the lttoolbox part of the code is one that is also not my area of
expertise and it would be a good thing for the application to recruit a
co-mentor or advisor who knows lttoolbox internals. That said, I would
suggest to start figuring out just the user point of view of
tokenisation at the moment, take a handful of languages from current
apertium set, e.g. English, Finnish, Kazakh, Norwegian, German, and
maybe some spaceless script if there are any. Find kind of test cases
how they work currently and where they could improve and approach the
gsoc schedule as a test-driven software engineering project. It may be
hard to spread such schedule to three months timeline but when you have
some targets uncovered like so we can discuss what additional steps are
likely to take time-. 
>  

-- 
Regards, Flammie 
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)

___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tanmai Khanna
I apologise, it seems like the link got removed when the message sent.
Here it is:
http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming

Thanks
Tanmai

On Sun, Mar 29, 2020 at 3:11 PM Tanmai Khanna 
wrote:

> Hey guys,
> Here's a draft proposal for this project. Any comments will be
> appreciated :)
>
> Thanks,
> Tanmai
>
> On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna 
> wrote:
>
>> Hi Hèctor,
>> A fundamental motivation for this proposal is the possibility of giving
>> the power to each program to use and propagate as much information as it
>> needs in the pipeline. In our discussion on the IRC, Tino Didriksen said:
>>
>>> You should see how much secondary information VISL's streams have. Noun
>>> semantics, verb frames, dependency, markup tags, etc. Being able to carry
>>> any information along makes many things possible, often things you can't
>>> imagine because of current limitations.
>>>
>>
>> The example in my original was well, just an example, but the idea is
>> that you can add any amount of information as you want, in the language
>> models or even the translation modules. It's also not just about English
>> words, but about all languages. This is not to say that we have to add case
>> information to every English word. It is optional information which can be
>> added if needed for the translation task.
>>
>> With this proposal we're trying to prepare the apertium stream for the
>> future. Today we realised that we need the surface form in the stream, and
>> tomorrow we might need semantic tags, sentiment tags, etc.* If we don't
>> do this now, we will have to modify all the parsers in the pipeline each
>> time we need more information in the pipe.* This is why it's a good idea
>> to modify the parsers so that it can handle an arbitrary amount of
>> information.
>>
>> Lastly, one point we should discuss is this idea about how any secondary
>> information I add in the monodix would be available for everyone who uses
>> that information. There's several things to say about this:
>>
>>- As long as the information is correct, I don't really see why
>>redundant secondary information should bother anyone. It will be available
>>for anyone who wishes to use it for their task, and if you don't want to
>>use it the programs will ignore it.
>>- Another idea is that secondary information could be put in a
>>separate dix, however this would lead to an unnecessary increase in
>>complexity.
>>
>> Unless if the developers of Apertium feel that redundant information in
>> the stream will be a huge problem, this will allow each program to access a
>> lot more information and open up possibilities that we haven't even thought
>> of yet. At the very least, it will help us to eliminate trimming.
>>
>> Thanks and Regards,
>> Tanmai Khanna
>>
>> On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font 
>> wrote:
>>
>>> Hi Tanmai,
>>>
>>> I am surprised by this proposal. It involves some very important changes
>>> that should be better justified. I don't quite understand when should one
>>> define the "optional secondary information" in addition to the current
>>> morphological fields. Will it be in the language module (apertium-xxx) or
>>> in each of the translation modules (apertium-xxx-yyy)? Part of the problem
>>> may be in the example. I can't imagine why information on case should be
>>> added to every English word (not much that, say, information about
>>> belonging, which is common for Turkic languages). Should this kind of
>>> unnecessary information for everybody, or almost everybody, will be found
>>> in every language pair using, say, English if someone for his or her
>>> specific purposes will like to add it? As far as I understand, for the
>>> given project it is needed to add the surface form of the word. This seems
>>> quite logical. Moreover, this information may be useful for e.g. lexical
>>> selection and structural transfer. But more than that seems to me too
>>> obscure.
>>>
>>> Best,
>>> Hèctor
>>>
>>> Missatge de Tanmai Khanna  del dia ds., 28 de
>>> març 2020 a les 23:51:
>>>
 Hey guys,
 As part of the project to eliminate trimming, I had to come up with a
 way to include the surface form in the lexical unit and hence modifying the
 apertium stream format. To do this I would have to modify the parsers of
 every program in the pipeline, and if that has to happen, we discussed on
 the IRC that *it might be a good idea to modify the stream in such a
 way that we can include an arbitrary amount of information in a lexical
 unit, and each program can use whatever information they need.*

 The current information in the lexical unit would be primary
 information, and then we would have optional secondary information which
 could contain the surface form, but also literally anything you can think
 of (case, sentiment, pragmatic info, etc.). This would open up a lot of
 possibilities for each program, 

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tanmai Khanna
Hey guys,
Here's a draft proposal http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming>
for this project. Any comments will be appreciated :)

Thanks,
Tanmai

On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna 
wrote:

> Hi Hèctor,
> A fundamental motivation for this proposal is the possibility of giving
> the power to each program to use and propagate as much information as it
> needs in the pipeline. In our discussion on the IRC, Tino Didriksen said:
>
>> You should see how much secondary information VISL's streams have. Noun
>> semantics, verb frames, dependency, markup tags, etc. Being able to carry
>> any information along makes many things possible, often things you can't
>> imagine because of current limitations.
>>
>
> The example in my original was well, just an example, but the idea is that
> you can add any amount of information as you want, in the language models
> or even the translation modules. It's also not just about English words,
> but about all languages. This is not to say that we have to add case
> information to every English word. It is optional information which can be
> added if needed for the translation task.
>
> With this proposal we're trying to prepare the apertium stream for the
> future. Today we realised that we need the surface form in the stream, and
> tomorrow we might need semantic tags, sentiment tags, etc.* If we don't
> do this now, we will have to modify all the parsers in the pipeline each
> time we need more information in the pipe.* This is why it's a good idea
> to modify the parsers so that it can handle an arbitrary amount of
> information.
>
> Lastly, one point we should discuss is this idea about how any secondary
> information I add in the monodix would be available for everyone who uses
> that information. There's several things to say about this:
>
>- As long as the information is correct, I don't really see why
>redundant secondary information should bother anyone. It will be available
>for anyone who wishes to use it for their task, and if you don't want to
>use it the programs will ignore it.
>- Another idea is that secondary information could be put in a
>separate dix, however this would lead to an unnecessary increase in
>complexity.
>
> Unless if the developers of Apertium feel that redundant information in
> the stream will be a huge problem, this will allow each program to access a
> lot more information and open up possibilities that we haven't even thought
> of yet. At the very least, it will help us to eliminate trimming.
>
> Thanks and Regards,
> Tanmai Khanna
>
> On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font 
> wrote:
>
>> Hi Tanmai,
>>
>> I am surprised by this proposal. It involves some very important changes
>> that should be better justified. I don't quite understand when should one
>> define the "optional secondary information" in addition to the current
>> morphological fields. Will it be in the language module (apertium-xxx) or
>> in each of the translation modules (apertium-xxx-yyy)? Part of the problem
>> may be in the example. I can't imagine why information on case should be
>> added to every English word (not much that, say, information about
>> belonging, which is common for Turkic languages). Should this kind of
>> unnecessary information for everybody, or almost everybody, will be found
>> in every language pair using, say, English if someone for his or her
>> specific purposes will like to add it? As far as I understand, for the
>> given project it is needed to add the surface form of the word. This seems
>> quite logical. Moreover, this information may be useful for e.g. lexical
>> selection and structural transfer. But more than that seems to me too
>> obscure.
>>
>> Best,
>> Hèctor
>>
>> Missatge de Tanmai Khanna  del dia ds., 28 de
>> març 2020 a les 23:51:
>>
>>> Hey guys,
>>> As part of the project to eliminate trimming, I had to come up with a
>>> way to include the surface form in the lexical unit and hence modifying the
>>> apertium stream format. To do this I would have to modify the parsers of
>>> every program in the pipeline, and if that has to happen, we discussed on
>>> the IRC that *it might be a good idea to modify the stream in such a
>>> way that we can include an arbitrary amount of information in a lexical
>>> unit, and each program can use whatever information they need.*
>>>
>>> The current information in the lexical unit would be primary
>>> information, and then we would have optional secondary information which
>>> could contain the surface form, but also literally anything you can think
>>> of (case, sentiment, pragmatic info, etc.). This would open up a lot of
>>> possibilities for each program, and it would strengthen the apertium stream
>>> format considerably.
>>>
>>> We discussed several possible syntax for this new stream format, and the
>>> one that seems the best is something like this:
>>>
>>> ^potato/patata$
>>>
>>> This doesn't mess with 

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

2020-03-29 Thread Tanmai Khanna
Hi Hèctor,
A fundamental motivation for this proposal is the possibility of giving the
power to each program to use and propagate as much information as it needs
in the pipeline. In our discussion on the IRC, Tino Didriksen said:

> You should see how much secondary information VISL's streams have. Noun
> semantics, verb frames, dependency, markup tags, etc. Being able to carry
> any information along makes many things possible, often things you can't
> imagine because of current limitations.
>

The example in my original was well, just an example, but the idea is that
you can add any amount of information as you want, in the language models
or even the translation modules. It's also not just about English words,
but about all languages. This is not to say that we have to add case
information to every English word. It is optional information which can be
added if needed for the translation task.

With this proposal we're trying to prepare the apertium stream for the
future. Today we realised that we need the surface form in the stream, and
tomorrow we might need semantic tags, sentiment tags, etc.* If we don't do
this now, we will have to modify all the parsers in the pipeline each time
we need more information in the pipe.* This is why it's a good idea to
modify the parsers so that it can handle an arbitrary amount of information.

Lastly, one point we should discuss is this idea about how any secondary
information I add in the monodix would be available for everyone who uses
that information. There's several things to say about this:

   - As long as the information is correct, I don't really see why
   redundant secondary information should bother anyone. It will be available
   for anyone who wishes to use it for their task, and if you don't want to
   use it the programs will ignore it.
   - Another idea is that secondary information could be put in a separate
   dix, however this would lead to an unnecessary increase in complexity.

Unless if the developers of Apertium feel that redundant information in the
stream will be a huge problem, this will allow each program to access a lot
more information and open up possibilities that we haven't even thought of
yet. At the very least, it will help us to eliminate trimming.

Thanks and Regards,
Tanmai Khanna

On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font 
wrote:

> Hi Tanmai,
>
> I am surprised by this proposal. It involves some very important changes
> that should be better justified. I don't quite understand when should one
> define the "optional secondary information" in addition to the current
> morphological fields. Will it be in the language module (apertium-xxx) or
> in each of the translation modules (apertium-xxx-yyy)? Part of the problem
> may be in the example. I can't imagine why information on case should be
> added to every English word (not much that, say, information about
> belonging, which is common for Turkic languages). Should this kind of
> unnecessary information for everybody, or almost everybody, will be found
> in every language pair using, say, English if someone for his or her
> specific purposes will like to add it? As far as I understand, for the
> given project it is needed to add the surface form of the word. This seems
> quite logical. Moreover, this information may be useful for e.g. lexical
> selection and structural transfer. But more than that seems to me too
> obscure.
>
> Best,
> Hèctor
>
> Missatge de Tanmai Khanna  del dia ds., 28 de
> març 2020 a les 23:51:
>
>> Hey guys,
>> As part of the project to eliminate trimming, I had to come up with a way
>> to include the surface form in the lexical unit and hence modifying the
>> apertium stream format. To do this I would have to modify the parsers of
>> every program in the pipeline, and if that has to happen, we discussed on
>> the IRC that *it might be a good idea to modify the stream in such a way
>> that we can include an arbitrary amount of information in a lexical unit,
>> and each program can use whatever information they need.*
>>
>> The current information in the lexical unit would be primary information,
>> and then we would have optional secondary information which could contain
>> the surface form, but also literally anything you can think of (case,
>> sentiment, pragmatic info, etc.). This would open up a lot of possibilities
>> for each program, and it would strengthen the apertium stream format
>> considerably.
>>
>> We discussed several possible syntax for this new stream format, and the
>> one that seems the best is something like this:
>>
>> ^potato/patata$
>>
>> This doesn't mess with the current stream format too much. The number of
>> tags is already arbitrary so that helps. The secondary tags contain a ":"
>> that would help distinguish them from primary tags.
>>
>> To implement this a modification would still be needed to all the parsers
>> but the benefits far outweigh the amount of work needed to pull this off.
>>
>> Since this would be a ma