Re: [Apertium-stuff] [PATCH] supervised weighting of automata

2019-06-13 Thread Nick Howell
Here's my (inline) review of the github PR at
https://github.com/apertium/lttoolbox/pull/55

Cheers,
Nick

--

First patch should add yourself to AUTHORS

> From 944ed2556c38f058a5118ab5e481b3412aa3e3d8 Mon Sep 17 00:00:00 2001

> From: Amr Keleg 
> Date: Sat, 11 May 2019 01:30:26 +0200
> Subject: [PATCH 01/12] Fix the out of alphabet token handling in analyses
>  generation
> 
> Solves #45
> Consider alphanumeric characters to be part of the vocabulary.
> ---
>  lttoolbox/fst_processor.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lttoolbox/fst_processor.cc b/lttoolbox/fst_processor.cc

> index 2be326a..d3d029a 100644
> --- a/lttoolbox/fst_processor.cc
> +++ b/lttoolbox/fst_processor.cc
> @@ -837,7 +837,7 @@ FSTProcessor::isEscaped(wchar_t const c) const
>  bool
>  FSTProcessor::isAlphabetic(wchar_t const c) const
>  {
> -  return alphabetic_chars.find(c) != alphabetic_chars.end();
> +  return (bool)std::iswalnum(c) || alphabetic_chars.find(c) != 
> alphabetic_chars.end();
>  }
>  
>  void
> -- 
> 2.21.0

Are we sure that std::iswalnum does what we want? Should it be
encoding-dependent? Do we take encoding from the environment?

> From be8673d196772a254e540a3fa92a920da35d8731 Mon Sep 17 00:00:00 2001

> From: Amr Keleg 
> Date: Sun, 19 May 2019 22:39:07 +0200
> Subject: [PATCH 02/12] Add basic structure for the supervised weighting of
>  automata
> 
> ---
>  scripts/data/corpus.tagged  |   8 
>  scripts/data/transducer.bin | Bin 0 -> 397 bytes
>  scripts/lt-weight   |   2 ++
>  3 files changed, 10 insertions(+)
>  create mode 100644 scripts/data/corpus.tagged
>  create mode 100644 scripts/data/transducer.bin
>  create mode 100755 scripts/lt-weight
> 
> diff --git a/scripts/data/corpus.tagged b/scripts/data/corpus.tagged

> new file mode 100644
> index 000..60f60ea
> --- /dev/null
> +++ b/scripts/data/corpus.tagged
> @@ -0,0 +1,8 @@
> +^a/a$
> +^a/a$
> +^b/b$
> +^b/b$
> +^c/d$ 
> +^c/d$
> +^d/d$
> +^d/d$

What does this file do? These things should be described in the
long-form part of the commit message.

> diff --git a/scripts/data/transducer.bin b/scripts/data/transducer.bin

> new file mode 100644
> index 
> ..d5e4da3b1ba47886cc6701faa52711926d8aee23
> GIT binary patch
> literal 397
> zcmZXQ*)Bs-6o$L^*%g%v9;EN^?W(CgraHzerIV?xQdGhlc?ZNSE{s83AZBr44C2CW
> zab;cnS;<=Y{(mjoar~~|B8A3JH34d zvcxhgtg^;B8*H-0HaqOH$36!fa>Oww+N&#dQ1|M%o*_aX{X~foCqa?{1{q?Q5tAcL
> zhEcMNF;0#NCYd77G^d<#&IOlTam@|4+;PtXk38|r3$MKK&Ig}->2h7ALpq`pdX_n#
> z5~QS|W-E+o+L&aF=8RR;XPT7p&HJ#+{tv?*rA)?{vJ4_lT)B#AlO&ZCUbA07QA`>s

> p;LKfFqPF7%{{HF`r5+`OC8Z^ll&a)MgaNuOVOhd0r4%K?^$i0xdb 
> literal 0
> HcmV?d1

Please don't add binary blobs to git history; better to add a text
representation along with the necessary rules to the build system to
reconstruct the binary.

> diff --git a/scripts/lt-weight b/scripts/lt-weight
> new file mode 100755
> index 000..a8b7c3e
> --- /dev/null
> +++ b/scripts/lt-weight
> @@ -0,0 +1,2 @@
> +#! /bin/sh
> +
> -- 
> 2.21.0

No point in just adding an empty file; just add the basic
implementation in the first step

> 
> From 8b11bb4e0fc606c465815a3ec5dd15c13a34da8f Mon Sep 17 00:00:00 2001

> From: Amr Keleg 
> Date: Mon, 20 May 2019 21:52:17 +0200
> Subject: [PATCH 03/12] lt-weight: Estimate the weights using unigram counts
> 
> ---
>  scripts/lt-weight | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/scripts/lt-weight b/scripts/lt-weight
> index a8b7c3e..d1abd6b 100755
> --- a/scripts/lt-weight
> +++ b/scripts/lt-weight
> @@ -1,2 +1,7 @@
>  #! /bin/sh

add help and version output, and copyright/license info

> +CORPUS=$2
> +LINES=$(wc -l $CORPUS | cut -d ' ' -f1)
>  
> +cat $CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' |

unnecessary use of "cat"; instead use <"$CORPUS" (quoting in case of
whitespace in the filename).

> +sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' |

Combine the multiple sed commands into one execution

> +awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "::" $1}'

indent for command continuations


> -- 
> 2.21.0
> 
> From 1d78c1698727de7abcd17e3d92a9d099d2f920f7 Mon Sep 17 00:00:00 2001

> From: Amr Keleg 
> Date: Tue, 21 May 2019 00:02:04 +0200
> Subject: [PATCH 04/12] lt-weight: Use HFST to weight the input fst

long-form commit message?

> 
> ---
>  scripts/lt-weight | 30 +-
>  1 file changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/scripts/lt-weight b/scripts/lt-weight
> index d1abd6b..092fc66 100755
> --- a/scripts/lt-weight
> +++ b/scripts/lt-weight
> @@ -1,7 +1,35 @@
>  #! /bin/sh
> +FST=$1
>  CORPUS=$2
> +OUTPUT_FST=$3
> +
> +# Temporary intermediate files
> +ATTFST='transducer.att'
> +HFST_FST='transducer.hfst'
> +WEIGHTED_FST='weighted-pairs.hfst'
> +COMPOSED_FST='weighted-transducer.hfst'
> +MULTICHAR='multichar_symbols'

use a temporary directory for temp files, created using "mktemp"

> +
> +# Convert the inpu

Re: [Apertium-stuff] Separate Corpus Repos

2019-12-15 Thread Nick Howell
Apologies for the ugly e-mail; I am traveling.

On Thursday, December 12, 2019, Francis Tyers  wrote:
> El 2019-12-12 09:16, Kevin Brubeck Unhammer escribió:
>>
>> Tino Didriksen 
>> čálii:
>>
>>> I would like for corpus and other indirect data to go in separate
>>> repositories. Basically, if the data is not used during the build, it
>>> should go elsewhere.
>>
>> What if it's used during `make test`?
>>
>> By the same argument, should we remove scripts that are used during
>> development, but not required for build (stuff that is kept in the dev/
>> subfolder)? If we get too strict on the requirement of "only things
>> necessary for build", people may start just not checking in useful
>> scripts, which to me seems worse. And it's already quite annoying having
>> to check out three repos just to work on one language pair; if
>> development depends on corpora repos, you have not just three, but *six*
>> places where you can forget to git push, or where you have to compare
>> git logs to review changes.
>>
>>> We need corpus data under Apertium's control so that we don't rely on
3rd
>>> parties. However, bundling this data in the languages' and pairs' repos
>>> means that those repos grow unbounded, especially when the data is
changed.
>>
>> I agree that "big" data shouldn't be in the regular repos, since it
>> slows down checking them out. But less than a few megabytes of text
>> won't make much difference to a repo with tens of MB's of .dix entries.
>>
>>> It also messes up the changelog. I use a script to generate AUTHORS from
>>> the changelog, because nobody keeps that up to date. But this gets
muddied
>>> when unnecessary data is in the repo.
>>
>> In general I would want to include annotators as authors, though I can
>> imagine situations where it's not clear-cut, e.g. where the dataset is
>> too large or is not quite relevant for developing the rest of the repo.
>>
>> I think having corpus-xxx and corpus-xxx-yyy repos could be a good
>> thing, but I don't think we should have a hard requirement of moving
>> data over there, especially if the data is useful during testing and
>> development. I do think it makes sense to move larger corpora out, for
>> faster cloning.
>>
>
> I like the idea of not having large corpora in the git repos for
> languages and language pairs.

I feel pretty strongly about this: for the time being, git handles failed
clones by restarting from the beginning. If you have a poor internet
connection, git-clone'ing a massive repo can be nearly impossible, and this
disproportionately harms many of the very communities we want to help.

> I'm not sure if corpora-xxx in the github is the right way to go though.
>
> I think it would be better to store them on a web server and either:
>
> 1) Have apertium-xxx/text that has a script that will download the corpus
> from the server and a gitignore to not have it in the repo.
> 2) Use something like git-annex (this is bit more involved)

git-annex is essentially designed for exactly our use-case. Github and
Gitlab natively speak a protocol called "git LFS" which git-annex supports.
So I would be highly supportive of moving in that direction.

I would be happy to help put together a proposal for what that would look
like, but probably not before the end of the month. Potential problems I
can see with such a plan are:
- git-annex has a heavy build dependency set (Haskell)
- git-annex depends on stable hashes of the corpus data
- git-annex packages can be out-of-date outside of debian

These all have mitigations, which I can describe if there is more interest
in a proposal.

>
> It would be great to e.g. keep updated cleaned versions of Wikipedia
dumps,

> and also be able to store non-redistributable stuff.

About this, I don't understand. Are there corpora we are storing which we
can distribute, but others cannot? Surely we are not distributing without a
license?

Cheers,
Nick

>
> I can expand a bit on this proposal if necessary.
>
> Fran
>
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff