Re: [Apertium-stuff] [PATCH] supervised weighting of automata

2019-06-18 Thread nlhowell
OK, tried to rewrite this to be (a bit more) readable; sorry if I
over-trimmed somewhere, it was hard keeping track of what the patch
context was. Next time I'll send out in message-per-patch format. (Or
maybe Amr can do it...? `man git-send-mail`)

(I think your mail client *double spaced* the reply! Wow!)

Cheers,
Nick

Amr Mohamed Hosny Anwar wrote:
> Nick Howell  wrote:
> > 
> > Here's my (inline) review of the github PR at
> > 
> > https://github.com/apertium/lttoolbox/pull/55
> > 
> > Cheers,
> > 
> > Nick
> > 
> > --
> > > From 944ed2556c38f058a5118ab5e481b3412aa3e3d8 Mon Sep 17 00:00:00 2001
> > > From: Amr Keleg 
> > > Date: Sat, 11 May 2019 01:30:26 +0200
> > > Subject: [PATCH 01/12] Fix the out of alphabet token handling in analyses
> > >  generation
> > >
> > > Solves #45
> > > Consider alphanumeric characters to be part of the vocabulary.
> > > ---
> > >  lttoolbox/fst_processor.cc | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/lttoolbox/fst_processor.cc b/lttoolbox/fst_processor.cc
> > > index 2be326a..d3d029a 100644
> > > --- a/lttoolbox/fst_processor.cc
> > > +++ b/lttoolbox/fst_processor.cc
> > > @@ -837,7 +837,7 @@ FSTProcessor::isEscaped(wchar_t const c) const

> > >  bool
> > >  FSTProcessor::isAlphabetic(wchar_t const c) const
> > >  {
> > > -  return alphabetic_chars.find(c) != alphabetic_chars.end();
> > > +  return (bool)std::iswalnum(c) || alphabetic_chars.find(c) !=

> > >alphabetic_chars.end();
> > >  }
> > >
> > >  void
> > > --
> > > 2.21.0
> > 
> > Are we sure that std::iswalnum does what we want? Should it be
> > encoding-dependent? Do we take encoding from the environment?
> >
> This commit isn't part of the PR. A separate issue was filed and can

> be found through: https://github.com/apertium/lttoolbox/issues/45 [ht-

> tps://avatars0.githubusercontent.com/u/4594256?s=400=4] hub.com/apertium/lttoolbox/issues/45> Strange output when transducer

> is compiled from a .att file · Issue #45 ·
> apertium/lttoolbox
> github.com I am trying to add weights to the morphological analyser.

> So while I was checking last year's project (http://wiki.apertium.org-

> /wiki/User:Techievena/GSoC_2018_Work_Product_Submission) I noticed...


OK

> > > From be8673d196772a254e540a3fa92a920da35d8731 Mon Sep 17 00:00:00 2001
> > > From: Amr Keleg 
> > > Date: Sun, 19 May 2019 22:39:07 +0200
> > > Subject: [PATCH 02/12] Add basic structure for the supervised weighting of
> > >  automata
> > >
> > > ---
> > > diff --git a/scripts/data/corpus.tagged b/scripts/data/corpus.tagged
> > 
> > What does this file do? These things should be described in the
> > long-form part of the commit message.
> > 
> I deleted these files in a later commit.

OK; can be fixed with rebase then. 

> > > diff --git a/scripts/data/transducer.bin b/scripts/data/transducer.bin
> > 
> > Please don't add binary blobs to git history; better to add a text

> > representation along with the necessary rules to the build system to

> > reconstruct the binary.
> > 
> - I deleted these files in a later commit.

OK, can be fixed in rebase

> > > diff --git a/scripts/lt-weight b/scripts/lt-weight
> > > new file mode 100755
> > > index 000..a8b7c3e
> > > --- /dev/null
> > > +++ b/scripts/lt-weight
> > > @@ -0,0 +1,2 @@
> > > +#! /bin/sh
> > > +
> > > --
> > > 2.21.0
> > 
> > No point in just adding an empty file; just add the basic
> > implementation in the first step
>
> I thought of using commits to reflect the progress and squashing these

> commits later on merging.

As we discussed, pull requests should be "fully formed" so that
reviewers don't spend time catching bugs that are fixed later in the
series; rebase first so that your PR consists of a sequence of
correct and well-explained patches.

> > > From 8b11bb4e0fc606c465815a3ec5dd15c13a34da8f Mon Sep 17 00:00:00 2001
> > > From: Amr Keleg 
> > > Date: Mon, 20 May 2019 21:52:17 +0200
> > > Subject: [PATCH 03/12] lt-weight: Estimate the weights using unigram 
> > > counts
> > >
> > > ---
> > >  scripts/lt-weight | 5 +
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/scripts/lt-weight b/scripts/lt-weight
> > > index a8b7c3e..d1abd6b 100755
> > > --- a/scripts/lt-weight
> > > +++ b/scripts/lt-weight
> > > @@ -1,2 +1,7 @@
> > >  #! /bin/sh
> > 
> > add help and version output, and copyright/license info
> - Will we end-up using a shell script or we will need to implement a CC file?

That is something we can decide later. If the shell script is correct
(and portable) and not a resource hog, it is probably fine to leave
it.

> I believe we need to decide on the script's language first.

Nothing can be accepted into an upstream repository without
copyright/license information; otherwise the repository becomes
unredistributable. (Default is "all rights reserved", meaning no
copying.)

Help and version output are required so that others can understand how

to 

Re: [Apertium-stuff] [PATCH] supervised weighting of automata

2019-06-15 Thread Amr Mohamed Hosny Anwar


Best Regards,
Amr



From: Nick Howell 
Sent: Thursday, June 13, 2019 3:24 PM
To: [apertium-stuff]
Cc: Amr Mohamed Hosny Anwar; Nick Howell; Tommi A Pirinen; Francis Tyers
Subject: Re: [PATCH] supervised weighting of automata

Here's my (inline) review of the github PR at

https://github.com/apertium/lttoolbox/pull/55



Cheers,

Nick



--



First patch should add yourself to AUTHORS
- Done.


> From 944ed2556c38f058a5118ab5e481b3412aa3e3d8 Mon Sep 17 00:00:00 2001


> From: Amr Keleg 

> Date: Sat, 11 May 2019 01:30:26 +0200

> Subject: [PATCH 01/12] Fix the out of alphabet token handling in analyses

>  generation

>

> Solves #45

> Consider alphanumeric characters to be part of the vocabulary.

> ---

>  lttoolbox/fst_processor.cc | 2 +-

>  1 file changed, 1 insertion(+), 1 deletion(-)

>

> diff --git a/lttoolbox/fst_processor.cc b/lttoolbox/fst_processor.cc


> index 2be326a..d3d029a 100644

> --- a/lttoolbox/fst_processor.cc

> +++ b/lttoolbox/fst_processor.cc

> @@ -837,7 +837,7 @@ FSTProcessor::isEscaped(wchar_t const c) const

>  bool

>  FSTProcessor::isAlphabetic(wchar_t const c) const

>  {

> -  return alphabetic_chars.find(c) != alphabetic_chars.end();

> +  return (bool)std::iswalnum(c) || alphabetic_chars.find(c) != 
> alphabetic_chars.end();

>  }

>

>  void

> --

> 2.21.0



Are we sure that std::iswalnum does what we want? Should it be

encoding-dependent? Do we take encoding from the environment?
- This commit isn't part of the PR.
A separate issue was filed and can be found through: 
https://github.com/apertium/lttoolbox/issues/45
[https://avatars0.githubusercontent.com/u/4594256?s=400=4]
Strange output when transducer is compiled from a .att file · Issue #45 · 
apertium/lttoolbox
github.com
I am trying to add weights to the morphological analyser. So while I was 
checking last year's project 
(http://wiki.apertium.org/wiki/User:Techievena/GSoC_2018_Work_Product_Submission)
 I noticed...




> From be8673d196772a254e540a3fa92a920da35d8731 Mon Sep 17 00:00:00 2001


> From: Amr Keleg 

> Date: Sun, 19 May 2019 22:39:07 +0200

> Subject: [PATCH 02/12] Add basic structure for the supervised weighting of

>  automata

>

> ---

>  scripts/data/corpus.tagged  |   8 

>  scripts/data/transducer.bin | Bin 0 -> 397 bytes

>  scripts/lt-weight   |   2 ++

>  3 files changed, 10 insertions(+)

>  create mode 100644 scripts/data/corpus.tagged

>  create mode 100644 scripts/data/transducer.bin

>  create mode 100755 scripts/lt-weight

>

> diff --git a/scripts/data/corpus.tagged b/scripts/data/corpus.tagged


> new file mode 100644

> index 000..60f60ea

> --- /dev/null

> +++ b/scripts/data/corpus.tagged

> @@ -0,0 +1,8 @@

> +^a/a$

> +^a/a$

> +^b/b$

> +^b/b$

> +^c/d$

> +^c/d$

> +^d/d$

> +^d/d$



What does this file do? These things should be described in the

long-form part of the commit message.

- I deleted these files in a later commit.

> diff --git a/scripts/data/transducer.bin b/scripts/data/transducer.bin


> new file mode 100644

> index 
> ..d5e4da3b1ba47886cc6701faa52711926d8aee23

> GIT binary patch

> literal 397

> zcmZXQ*)Bs-6o$L^*%g%v9;EN^?W(CgraHzerIV?xQdGhlc?ZNSE{s83AZBr44C2CW

> zab;cnS;<=Y{(mjoar~~|B8A3JH34d 

Re: [Apertium-stuff] [PATCH] supervised weighting of automata

2019-06-13 Thread Kevin Brubeck Unhammer
Nick Howell  čálii:

>> +cat $CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' |
>
> unnecessary use of "cat"; instead use <"$CORPUS" (quoting in case of
> whitespace in the filename).

https://www.shellcheck.net/ would've told you that – please everybody
run shellcheck on any shell scripts you commit :)

You can get inline suggestions in Atom, VSCode, Emacs, Vim, Sublime,
Geany and probably other editors:
https://github.com/koalaman/shellcheck#user-content-in-your-editor


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] [PATCH] supervised weighting of automata

2019-06-13 Thread Nick Howell
Here's my (inline) review of the github PR at
https://github.com/apertium/lttoolbox/pull/55

Cheers,
Nick

--

First patch should add yourself to AUTHORS

> From 944ed2556c38f058a5118ab5e481b3412aa3e3d8 Mon Sep 17 00:00:00 2001

> From: Amr Keleg 
> Date: Sat, 11 May 2019 01:30:26 +0200
> Subject: [PATCH 01/12] Fix the out of alphabet token handling in analyses
>  generation
> 
> Solves #45
> Consider alphanumeric characters to be part of the vocabulary.
> ---
>  lttoolbox/fst_processor.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lttoolbox/fst_processor.cc b/lttoolbox/fst_processor.cc

> index 2be326a..d3d029a 100644
> --- a/lttoolbox/fst_processor.cc
> +++ b/lttoolbox/fst_processor.cc
> @@ -837,7 +837,7 @@ FSTProcessor::isEscaped(wchar_t const c) const
>  bool
>  FSTProcessor::isAlphabetic(wchar_t const c) const
>  {
> -  return alphabetic_chars.find(c) != alphabetic_chars.end();
> +  return (bool)std::iswalnum(c) || alphabetic_chars.find(c) != 
> alphabetic_chars.end();
>  }
>  
>  void
> -- 
> 2.21.0

Are we sure that std::iswalnum does what we want? Should it be
encoding-dependent? Do we take encoding from the environment?

> From be8673d196772a254e540a3fa92a920da35d8731 Mon Sep 17 00:00:00 2001

> From: Amr Keleg 
> Date: Sun, 19 May 2019 22:39:07 +0200
> Subject: [PATCH 02/12] Add basic structure for the supervised weighting of
>  automata
> 
> ---
>  scripts/data/corpus.tagged  |   8 
>  scripts/data/transducer.bin | Bin 0 -> 397 bytes
>  scripts/lt-weight   |   2 ++
>  3 files changed, 10 insertions(+)
>  create mode 100644 scripts/data/corpus.tagged
>  create mode 100644 scripts/data/transducer.bin
>  create mode 100755 scripts/lt-weight
> 
> diff --git a/scripts/data/corpus.tagged b/scripts/data/corpus.tagged

> new file mode 100644
> index 000..60f60ea
> --- /dev/null
> +++ b/scripts/data/corpus.tagged
> @@ -0,0 +1,8 @@
> +^a/a$
> +^a/a$
> +^b/b$
> +^b/b$
> +^c/d$ 
> +^c/d$
> +^d/d$
> +^d/d$

What does this file do? These things should be described in the
long-form part of the commit message.

> diff --git a/scripts/data/transducer.bin b/scripts/data/transducer.bin

> new file mode 100644
> index 
> ..d5e4da3b1ba47886cc6701faa52711926d8aee23
> GIT binary patch
> literal 397
> zcmZXQ*)Bs-6o$L^*%g%v9;EN^?W(CgraHzerIV?xQdGhlc?ZNSE{s83AZBr44C2CW
> zab;cnS;<=Y{(mjoar~~|B8A3JH34d