Re: [Apertium-stuff] [PATCH] supervised weighting of automata
Here's my (inline) review of the github PR at https://github.com/apertium/lttoolbox/pull/55 Cheers, Nick -- First patch should add yourself to AUTHORS > From 944ed2556c38f058a5118ab5e481b3412aa3e3d8 Mon Sep 17 00:00:00 2001 > From: Amr Keleg > Date: Sat, 11 May 2019 01:30:26 +0200 > Subject: [PATCH 01/12] Fix the out of alphabet token handling in analyses > generation > > Solves #45 > Consider alphanumeric characters to be part of the vocabulary. > --- > lttoolbox/fst_processor.cc | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/lttoolbox/fst_processor.cc b/lttoolbox/fst_processor.cc > index 2be326a..d3d029a 100644 > --- a/lttoolbox/fst_processor.cc > +++ b/lttoolbox/fst_processor.cc > @@ -837,7 +837,7 @@ FSTProcessor::isEscaped(wchar_t const c) const > bool > FSTProcessor::isAlphabetic(wchar_t const c) const > { > - return alphabetic_chars.find(c) != alphabetic_chars.end(); > + return (bool)std::iswalnum(c) || alphabetic_chars.find(c) != > alphabetic_chars.end(); > } > > void > -- > 2.21.0 Are we sure that std::iswalnum does what we want? Should it be encoding-dependent? Do we take encoding from the environment? > From be8673d196772a254e540a3fa92a920da35d8731 Mon Sep 17 00:00:00 2001 > From: Amr Keleg > Date: Sun, 19 May 2019 22:39:07 +0200 > Subject: [PATCH 02/12] Add basic structure for the supervised weighting of > automata > > --- > scripts/data/corpus.tagged | 8 > scripts/data/transducer.bin | Bin 0 -> 397 bytes > scripts/lt-weight | 2 ++ > 3 files changed, 10 insertions(+) > create mode 100644 scripts/data/corpus.tagged > create mode 100644 scripts/data/transducer.bin > create mode 100755 scripts/lt-weight > > diff --git a/scripts/data/corpus.tagged b/scripts/data/corpus.tagged > new file mode 100644 > index 000..60f60ea > --- /dev/null > +++ b/scripts/data/corpus.tagged > @@ -0,0 +1,8 @@ > +^a/a$ > +^a/a$ > +^b/b$ > +^b/b$ > +^c/d$ > +^c/d$ > +^d/d$ > +^d/d$ What does this file do? These things should be described in the long-form part of the commit message. > diff --git a/scripts/data/transducer.bin b/scripts/data/transducer.bin > new file mode 100644 > index > ..d5e4da3b1ba47886cc6701faa52711926d8aee23 > GIT binary patch > literal 397 > zcmZXQ*)Bs-6o$L^*%g%v9;EN^?W(CgraHzerIV?xQdGhlc?ZNSE{s83AZBr44C2CW > zab;cnS;<=Y{(mjoar~~|B8A3JH34d zvcxhgtg^;B8*H-0HaqOH$36!fa>Oww+NdQ1|M%o*_aX{X~foCqa?{1{q?Q5tAcL > zhEcMNF;0#NCYd77G^d<#&IOlTam@|4+;PtXk38|r3$MKK&Ig}->2h7ALpq`pdX_n# > z5~QS|W-E+o+L&aF=8RR;XPT7p&HJ#+{tv?*rA)?{vJ4_lT)B#AlO&ZCUbA07QA`>s > p;LKfFqPF7%{{HF`r5+`OC8Z^ll&a)MgaNuOVOhd0r4%K?^$i0xdb > literal 0 > HcmV?d1 Please don't add binary blobs to git history; better to add a text representation along with the necessary rules to the build system to reconstruct the binary. > diff --git a/scripts/lt-weight b/scripts/lt-weight > new file mode 100755 > index 000..a8b7c3e > --- /dev/null > +++ b/scripts/lt-weight > @@ -0,0 +1,2 @@ > +#! /bin/sh > + > -- > 2.21.0 No point in just adding an empty file; just add the basic implementation in the first step > > From 8b11bb4e0fc606c465815a3ec5dd15c13a34da8f Mon Sep 17 00:00:00 2001 > From: Amr Keleg > Date: Mon, 20 May 2019 21:52:17 +0200 > Subject: [PATCH 03/12] lt-weight: Estimate the weights using unigram counts > > --- > scripts/lt-weight | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/scripts/lt-weight b/scripts/lt-weight > index a8b7c3e..d1abd6b 100755 > --- a/scripts/lt-weight > +++ b/scripts/lt-weight > @@ -1,2 +1,7 @@ > #! /bin/sh add help and version output, and copyright/license info > +CORPUS=$2 > +LINES=$(wc -l $CORPUS | cut -d ' ' -f1) > > +cat $CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' | unnecessary use of "cat"; instead use <"$CORPUS" (quoting in case of whitespace in the filename). > +sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' | Combine the multiple sed commands into one execution > +awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "::" $1}' indent for command continuations > -- > 2.21.0 > > From 1d78c1698727de7abcd17e3d92a9d099d2f920f7 Mon Sep 17 00:00:00 2001 > From: Amr Keleg > Date: Tue, 21 May 2019 00:02:04 +0200 > Subject: [PATCH 04/12] lt-weight: Use HFST to weight the input fst long-form commit message? > > --- > scripts/lt-weight | 30 +- > 1 file changed, 29 insertions(+), 1 deletion(-) > > diff --git a/scripts/lt-weight b/scripts/lt-weight > index d1abd6b..092fc66 100755 > --- a/scripts/lt-weight > +++ b/scripts/lt-weight > @@ -1,7 +1,35 @@ > #! /bin/sh > +FST=$1 > CORPUS=$2 > +OUTPUT_FST=$3 > + > +# Temporary intermediate files > +ATTFST='transducer.att' > +HFST_FST='transducer.hfst' > +WEIGHTED_FST='weighted-pairs.hfst' > +COMPOSED_FST='weighted-transducer.hfst' > +MULTICHAR='multichar_symbols' use a temporary directory for temp files, created using "mktemp" > + > +# Convert the inpu
Re: [Apertium-stuff] Separate Corpus Repos
Apologies for the ugly e-mail; I am traveling. On Thursday, December 12, 2019, Francis Tyers wrote: > El 2019-12-12 09:16, Kevin Brubeck Unhammer escribió: >> >> Tino Didriksen >> čálii: >> >>> I would like for corpus and other indirect data to go in separate >>> repositories. Basically, if the data is not used during the build, it >>> should go elsewhere. >> >> What if it's used during `make test`? >> >> By the same argument, should we remove scripts that are used during >> development, but not required for build (stuff that is kept in the dev/ >> subfolder)? If we get too strict on the requirement of "only things >> necessary for build", people may start just not checking in useful >> scripts, which to me seems worse. And it's already quite annoying having >> to check out three repos just to work on one language pair; if >> development depends on corpora repos, you have not just three, but *six* >> places where you can forget to git push, or where you have to compare >> git logs to review changes. >> >>> We need corpus data under Apertium's control so that we don't rely on 3rd >>> parties. However, bundling this data in the languages' and pairs' repos >>> means that those repos grow unbounded, especially when the data is changed. >> >> I agree that "big" data shouldn't be in the regular repos, since it >> slows down checking them out. But less than a few megabytes of text >> won't make much difference to a repo with tens of MB's of .dix entries. >> >>> It also messes up the changelog. I use a script to generate AUTHORS from >>> the changelog, because nobody keeps that up to date. But this gets muddied >>> when unnecessary data is in the repo. >> >> In general I would want to include annotators as authors, though I can >> imagine situations where it's not clear-cut, e.g. where the dataset is >> too large or is not quite relevant for developing the rest of the repo. >> >> I think having corpus-xxx and corpus-xxx-yyy repos could be a good >> thing, but I don't think we should have a hard requirement of moving >> data over there, especially if the data is useful during testing and >> development. I do think it makes sense to move larger corpora out, for >> faster cloning. >> > > I like the idea of not having large corpora in the git repos for > languages and language pairs. I feel pretty strongly about this: for the time being, git handles failed clones by restarting from the beginning. If you have a poor internet connection, git-clone'ing a massive repo can be nearly impossible, and this disproportionately harms many of the very communities we want to help. > I'm not sure if corpora-xxx in the github is the right way to go though. > > I think it would be better to store them on a web server and either: > > 1) Have apertium-xxx/text that has a script that will download the corpus > from the server and a gitignore to not have it in the repo. > 2) Use something like git-annex (this is bit more involved) git-annex is essentially designed for exactly our use-case. Github and Gitlab natively speak a protocol called "git LFS" which git-annex supports. So I would be highly supportive of moving in that direction. I would be happy to help put together a proposal for what that would look like, but probably not before the end of the month. Potential problems I can see with such a plan are: - git-annex has a heavy build dependency set (Haskell) - git-annex depends on stable hashes of the corpus data - git-annex packages can be out-of-date outside of debian These all have mitigations, which I can describe if there is more interest in a proposal. > > It would be great to e.g. keep updated cleaned versions of Wikipedia dumps, > and also be able to store non-redistributable stuff. About this, I don't understand. Are there corpora we are storing which we can distribute, but others cannot? Surely we are not distributing without a license? Cheers, Nick > > I can expand a bit on this proposal if necessary. > > Fran > > > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff