Here's my (inline) review of the github PR at https://github.com/apertium/lttoolbox/pull/55
Cheers, Nick -- First patch should add yourself to AUTHORS > From 944ed2556c38f058a5118ab5e481b3412aa3e3d8 Mon Sep 17 00:00:00 2001 > From: Amr Keleg <amr_moha...@live.com> > Date: Sat, 11 May 2019 01:30:26 +0200 > Subject: [PATCH 01/12] Fix the out of alphabet token handling in analyses > generation > > Solves #45 > Consider alphanumeric characters to be part of the vocabulary. > --- > lttoolbox/fst_processor.cc | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/lttoolbox/fst_processor.cc b/lttoolbox/fst_processor.cc > index 2be326a..d3d029a 100644 > --- a/lttoolbox/fst_processor.cc > +++ b/lttoolbox/fst_processor.cc > @@ -837,7 +837,7 @@ FSTProcessor::isEscaped(wchar_t const c) const > bool > FSTProcessor::isAlphabetic(wchar_t const c) const > { > - return alphabetic_chars.find(c) != alphabetic_chars.end(); > + return (bool)std::iswalnum(c) || alphabetic_chars.find(c) != > alphabetic_chars.end(); > } > > void > -- > 2.21.0 Are we sure that std::iswalnum does what we want? Should it be encoding-dependent? Do we take encoding from the environment? > From be8673d196772a254e540a3fa92a920da35d8731 Mon Sep 17 00:00:00 2001 > From: Amr Keleg <amr_moha...@live.com> > Date: Sun, 19 May 2019 22:39:07 +0200 > Subject: [PATCH 02/12] Add basic structure for the supervised weighting of > automata > > --- > scripts/data/corpus.tagged | 8 ++++++++ > scripts/data/transducer.bin | Bin 0 -> 397 bytes > scripts/lt-weight | 2 ++ > 3 files changed, 10 insertions(+) > create mode 100644 scripts/data/corpus.tagged > create mode 100644 scripts/data/transducer.bin > create mode 100755 scripts/lt-weight > > diff --git a/scripts/data/corpus.tagged b/scripts/data/corpus.tagged > new file mode 100644 > index 0000000..60f60ea > --- /dev/null > +++ b/scripts/data/corpus.tagged > @@ -0,0 +1,8 @@ > +^a/a<n><compound-only-L>$ > +^a/a<pr><compound-only-L>$ > +^b/b<n><compound-R>$ > +^b/b<pr><compound-R>$ > +^c/d<compound-only-L><n>$ > +^c/d<compound-only-L><pr>$ > +^d/d<compound-R><n>$ > +^d/d<compound-R><pr>$ What does this file do? These things should be described in the long-form part of the commit message. > diff --git a/scripts/data/transducer.bin b/scripts/data/transducer.bin > new file mode 100644 > index > 0000000000000000000000000000000000000000..d5e4da3b1ba47886cc6701faa52711926d8aee23 > GIT binary patch > literal 397 > zcmZXQ*)Bs-6o$L^*%g%v9;EN^?W(CgraHzerIV?xQdGhlc?ZNSE{s83AZBr44C2CW > zab;cnS;<=Y{(mjoar~~|B8A3JH34d<rH*<U2ol1kktUjHp_MRgIJDD2CtY;Y!#oQt > zvcxhgtg^;B8*H-0HaqOH$36!fa>Oww+N&#dQ1|M%o*_aX{X~foCqa?{1{q?Q5tAcL > zhEcMNF;0#NCYd77G^d<#&IOlTam@|4+;PtXk38|r3$MKK&Ig}->2h7ALpq`pdX_n# > z5~QS|W-E+o+L&aF=8RR;XPT7p&HJ#+{tv?*rA)?{vJ4_lT)B#AlO&ZCUbA07QA`>s > p;LKfFqPF7%{{HF`r5+`OC8Z^ll&a)MgaNuOVOhd0r4%K?^$i0xdb<Ds > > literal 0 > HcmV?d00001 Please don't add binary blobs to git history; better to add a text representation along with the necessary rules to the build system to reconstruct the binary. > diff --git a/scripts/lt-weight b/scripts/lt-weight > new file mode 100755 > index 0000000..a8b7c3e > --- /dev/null > +++ b/scripts/lt-weight > @@ -0,0 +1,2 @@ > +#! /bin/sh > + > -- > 2.21.0 No point in just adding an empty file; just add the basic implementation in the first step > > From 8b11bb4e0fc606c465815a3ec5dd15c13a34da8f Mon Sep 17 00:00:00 2001 > From: Amr Keleg <amr_moha...@live.com> > Date: Mon, 20 May 2019 21:52:17 +0200 > Subject: [PATCH 03/12] lt-weight: Estimate the weights using unigram counts > > --- > scripts/lt-weight | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/scripts/lt-weight b/scripts/lt-weight > index a8b7c3e..d1abd6b 100755 > --- a/scripts/lt-weight > +++ b/scripts/lt-weight > @@ -1,2 +1,7 @@ > #! /bin/sh add help and version output, and copyright/license info > +CORPUS=$2 > +LINES=$(wc -l $CORPUS | cut -d ' ' -f1) > > +cat $CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' | unnecessary use of "cat"; instead use <"$CORPUS" (quoting in case of whitespace in the filename). > +sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' | Combine the multiple sed commands into one execution > +awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "::" $1}' indent for command continuations > -- > 2.21.0 > > From 1d78c1698727de7abcd17e3d92a9d099d2f920f7 Mon Sep 17 00:00:00 2001 > From: Amr Keleg <amr_moha...@live.com> > Date: Tue, 21 May 2019 00:02:04 +0200 > Subject: [PATCH 04/12] lt-weight: Use HFST to weight the input fst long-form commit message? > > --- > scripts/lt-weight | 30 +++++++++++++++++++++++++++++- > 1 file changed, 29 insertions(+), 1 deletion(-) > > diff --git a/scripts/lt-weight b/scripts/lt-weight > index d1abd6b..092fc66 100755 > --- a/scripts/lt-weight > +++ b/scripts/lt-weight > @@ -1,7 +1,35 @@ > #! /bin/sh > +FST=$1 > CORPUS=$2 > +OUTPUT_FST=$3 > + > +# Temporary intermediate files > +ATTFST='transducer.att' > +HFST_FST='transducer.hfst' > +WEIGHTED_FST='weighted-pairs.hfst' > +COMPOSED_FST='weighted-transducer.hfst' > +MULTICHAR='multichar_symbols' use a temporary directory for temp files, created using "mktemp" > + > +# Convert the input FST to HFST > +lt-print $FST > $ATTFST > +hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST > + > LINES=$(wc -l $CORPUS | cut -d ' ' -f1) > > +# Prepare the multichar symbols files > +awk -F '[<>]' '{print "<"$(NF-1)">"}' data/corpus.tagged | tr -d , | sort | > uniq > $MULTICHAR > + > +# Generate a weighted FST from the string pairs > cat $CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' | > sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' | > -awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "::" $1}' > +awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "\t" $1}' | > +hfst-strings2fst -j -o $WEIGHTED_FST -m $MULTICHAR > + > +# Compose the input FST and the weighted FST > +hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST | hfst-fst2txt > $COMPOSED_FST > + > +# Compile the FST back using lttoolbox > +lt-comp lr $COMPOSED_FST $OUTPUT_FST > + > +# Delete the temporary files > +rm $ATTFST $HFST_FST $WEIGHTED_FST $MULTICHAR $COMPOSED_FST It seems to me that this version and the previous version should be two different scripts: one that reads a corpus and produces weights which can be read by hfst-strings2fst, and one which performs the reweighting of the fst. > -- > 2.21.0 > > From b34b6769980fc2a1cb7d3659eb5c16c3894c8348 Mon Sep 17 00:00:00 2001 > From: Amr Keleg <amr_moha...@live.com> > Date: Tue, 21 May 2019 17:59:05 +0200 > Subject: [PATCH 05/12] lt-weight: Update the script to handle non-trivial > corpora/transducers in the long-form message, explain what problems this patch is solving? > > --- > scripts/lt-weight | 34 ++++++++++++++++++++++++---------- > 1 file changed, 24 insertions(+), 10 deletions(-) > > diff --git a/scripts/lt-weight b/scripts/lt-weight > index 092fc66..5420997 100755 > --- a/scripts/lt-weight > +++ b/scripts/lt-weight > @@ -4,23 +4,31 @@ CORPUS=$2 > OUTPUT_FST=$3 > > # Temporary intermediate files > -ATTFST='transducer.att' > -HFST_FST='transducer.hfst' > -WEIGHTED_FST='weighted-pairs.hfst' > -COMPOSED_FST='weighted-transducer.hfst' > -MULTICHAR='multichar_symbols' > +TEMP_DIR=".tmp" > +mkdir $TEMP_DIR > +ATTFST="$TEMP_DIR/transducer.att" > +HFST_FST="$TEMP_DIR/transducer.hfst" > +WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst" > +COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst" > +MULTICHAR="$TEMP_DIR/multichar-symbols" > +CLEANED_CORPUS="$TEMP_DIR/clean-corpus.tagged" > + > +# Clean the input tagged corpus > +sed -e '/^$/d' $CORPUS > $CLEANED_CORPUS > +sed -i -e :a -e 's/:/__COLON__/;ta' $CLEANED_CORPUS > +sed -i -e :a -e 's/__COLON__/\\:/;ta' $CLEANED_CORPUS cleaning of the corpus should be performed in a separate script > > # Convert the input FST to HFST > -lt-print $FST > $ATTFST > +lt-print $FST | sed -e 's/:/\\:/'| sed -e :a -e 's/ > /\'@_SPACE_@\'/;ta'> $ATTFST > hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST > > -LINES=$(wc -l $CORPUS | cut -d ' ' -f1) > +LINES=$(wc -l $CLEANED_CORPUS | cut -d ' ' -f1) > > # Prepare the multichar symbols files > -awk -F '[<>]' '{print "<"$(NF-1)">"}' data/corpus.tagged | tr -d , | sort | > uniq > $MULTICHAR > +awk -F '[<>]' '{print "<"$(NF-1)">"}' $CLEANED_CORPUS | tr -d , | sort | > uniq > $MULTICHAR > > # Generate a weighted FST from the string pairs > -cat $CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' | > +cat $CLEANED_CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' | > sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' | > awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "\t" $1}' | > hfst-strings2fst -j -o $WEIGHTED_FST -m $MULTICHAR > @@ -28,8 +36,14 @@ hfst-strings2fst -j -o $WEIGHTED_FST -m $MULTICHAR > # Compose the input FST and the weighted FST > hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST | hfst-fst2txt > $COMPOSED_FST > > +# Clean the composed FST for lt-comp > +# Remove -- at any line and empty lines from the ATT file > +sed -i -e 's/--//' $COMPOSED_FST > +sed -i -e '/^$/d' $COMPOSED_FST > + > # Compile the FST back using lttoolbox > lt-comp lr $COMPOSED_FST $OUTPUT_FST > > # Delete the temporary files > -rm $ATTFST $HFST_FST $WEIGHTED_FST $MULTICHAR $COMPOSED_FST > +rm -rf $TEMP_DIR > + > -- > 2.21.0 > > From 8171aca8e52290e3dac6620aa17d01117486d500 Mon Sep 17 00:00:00 2001 > From: Amr Keleg <amr_moha...@live.com> > Date: Fri, 31 May 2019 20:39:57 +0200 > Subject: [PATCH 06/12] Use weighted regexp > > --- > scripts/lt-weight | 53 +++++++++++++++++--------------- > scripts/prepare_regex_strings.py | 41 ++++++++++++++++++++++++ > 2 files changed, 69 insertions(+), 25 deletions(-) > create mode 100644 scripts/prepare_regex_strings.py > > diff --git a/scripts/lt-weight b/scripts/lt-weight > index 5420997..8cfe185 100755 > --- a/scripts/lt-weight > +++ b/scripts/lt-weight > @@ -3,47 +3,50 @@ FST=$1 > CORPUS=$2 > OUTPUT_FST=$3 > > -# Temporary intermediate files > +# Temporary directory for intermediate files > TEMP_DIR=".tmp" > -mkdir $TEMP_DIR > +# Check if it exists > +if [-d "$TEMP_DIR"]; then > + mkdir $TEMP_DIR > +fi doesn't this only create the directory if it already exists? I don't understand the logic > + > +CLEANED_CORPUS="$TEMP_DIR/clean-corpus.tagged" > + > ATTFST="$TEMP_DIR/transducer.att" > HFST_FST="$TEMP_DIR/transducer.hfst" > + > +WEIGHTED_REGEXP="$TEMP_DIR/weighted-regex" > WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst" > COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst" > -MULTICHAR="$TEMP_DIR/multichar-symbols" > -CLEANED_CORPUS="$TEMP_DIR/clean-corpus.tagged" > - > -# Clean the input tagged corpus > -sed -e '/^$/d' $CORPUS > $CLEANED_CORPUS > -sed -i -e :a -e 's/:/__COLON__/;ta' $CLEANED_CORPUS > -sed -i -e :a -e 's/__COLON__/\\:/;ta' $CLEANED_CORPUS > +COMPOSED_ATTFST="$TEMP_DIR/weighted-transducer.att" > > # Convert the input FST to HFST > -lt-print $FST | sed -e 's/:/\\:/'| sed -e :a -e 's/ > /\'@_SPACE_@\'/;ta'> $ATTFST > -hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST > +lt-print $FST | > +sed -e 's/:/\\:/'| > +sed -e :a -e 's/ /\'@_SPACE_@\'/;ta'> $ATTFST > > -LINES=$(wc -l $CLEANED_CORPUS | cut -d ' ' -f1) > +hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST > > -# Prepare the multichar symbols files > -awk -F '[<>]' '{print "<"$(NF-1)">"}' $CLEANED_CORPUS | tr -d , | sort | > uniq > $MULTICHAR > +# Clean the input tagged corpus > +# REMOVE EMPTY LINES > +sed -e '/^$/d' $CORPUS > $CLEANED_CORPUS > > # Generate a weighted FST from the string pairs > +LINES=$(wc -l $CLEANED_CORPUS | cut -d ' ' -f1) > + > cat $CLEANED_CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' | > -sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' | > -awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "\t" $1}' | > -hfst-strings2fst -j -o $WEIGHTED_FST -m $MULTICHAR > +sed -e 's/\$$//' > $WEIGHTED_REGEXP > +python prepare_regex_strings.py $WEIGHTED_REGEXP > +cat $WEIGHTED_REGEXP | hfst-regexp2fst -j -o $WEIGHTED_FST unnecessary "cat" > > # Compose the input FST and the weighted FST > -hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST | hfst-fst2txt > $COMPOSED_FST > - > -# Clean the composed FST for lt-comp > -# Remove -- at any line and empty lines from the ATT file > -sed -i -e 's/--//' $COMPOSED_FST > -sed -i -e '/^$/d' $COMPOSED_FST > +hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST -v -o $COMPOSED_FST > +hfst-fst2txt -i $COMPOSED_FST -o $COMPOSED_ATTFST > > # Compile the FST back using lttoolbox > -lt-comp lr $COMPOSED_FST $OUTPUT_FST > +lt-comp lr $COMPOSED_ATTFST $OUTPUT_FST > > # Delete the temporary files > -rm -rf $TEMP_DIR > +# rm -rf $TEMP_DIR > > +# ./lt-weight ../../apertium-eng/eng.automorf.bin > ../../apertium-eng/texts/eng.tagged efst.bin don't submit patches with commented-out code :( better to add a "--debug" flag or DEBUG env variable, and lock it behind that > diff --git a/scripts/prepare_regex_strings.py > b/scripts/prepare_regex_strings.py > new file mode 100644 > index 0000000..d2eac86 > --- /dev/null > +++ b/scripts/prepare_regex_strings.py > @@ -0,0 +1,41 @@ > +import re > +import sys > +import numpy as np > +import pandas as pd now we have dependencies on numpy and pandas; maybe it is possible to get away with just vanilla python? > + > +#TODO: HANDLE THE REST OF THE SPECIAL CHARACTERS what does "the rest" mean? > +special_regex_chars = '%,.;!#-—+*:0?[]()~"\'' > + > +def clean_tag_patterns(reg): > + whitesace_free_reg = re.sub(' ', '', reg) ------------^ typo in variable name > + return '%{}%>'.format(whitesace_free_reg[:-1]) > + > +def clean_line(line): > + line = line.strip() > + if line.endswith('$"'): > + # ERROR LINE LIKE ^./.<sent>$" > + line = line[:-2] > + line = re.sub(r'(.)', r'\1 ', line) > + > + for special_char in special_regex_chars: > + line = re.sub('\\{}'.format(special_char), > '%{}'.format(special_char), line) > + > + line = re.sub(r'(\<.*?\>)', lambda m: > + clean_tag_patterns(m.group(0)),line) > + # HANDLE TAGS > + line = line.strip() > + line = '[{}]'.format(line) > + # line = '[?*][{}][?*]'.format(line) > + return line > + > +if __name__ == '__main__': > + FILE_NAME = sys.argv[1] > + with open(FILE_NAME, 'r') as f: > + lines =[clean_line(line) for line in f.readlines() if not > line.startswith('*')] > + > + lines = list(pd.Series(lines).value_counts().reset_index().apply(lambda > r: '{}::{}'.format(r['index'].strip(), -np.log(r[0]/len(lines))), axis=1)) > + > + with open(FILE_NAME, 'w') as f: > + f.write('[?*]::1000000\n') > + for line in lines: > + f.write(line+'\n') > \ No newline at end of file needs "--help", license and copyright info; probably could mark it executable. It's also important to figure out what *exact* dependencies this has; does it work with python2? python3? Does lttoolbox already have a python dependency? If not, better to use something we already depend on. > -- > 2.21.0 > > From 3a746cd0afe16f92ffc9628d71b3e37c95244da0 Mon Sep 17 00:00:00 2001 > From: Amr Keleg <amr_moha...@live.com> > Date: Fri, 31 May 2019 21:03:57 +0200 > Subject: [PATCH 07/12] Remove un-necessary data files > > --- > scripts/data/corpus.tagged | 8 -------- > scripts/data/transducer.bin | Bin 397 -> 0 bytes > 2 files changed, 8 deletions(-) > delete mode 100644 scripts/data/corpus.tagged > delete mode 100644 scripts/data/transducer.bin > > diff --git a/scripts/data/corpus.tagged b/scripts/data/corpus.tagged > deleted file mode 100644 > index 60f60ea..0000000 > --- a/scripts/data/corpus.tagged > +++ /dev/null > @@ -1,8 +0,0 @@ > -^a/a<n><compound-only-L>$ > -^a/a<pr><compound-only-L>$ > -^b/b<n><compound-R>$ > -^b/b<pr><compound-R>$ > -^c/d<compound-only-L><n>$ > -^c/d<compound-only-L><pr>$ > -^d/d<compound-R><n>$ > -^d/d<compound-R><pr>$ > diff --git a/scripts/data/transducer.bin b/scripts/data/transducer.bin > deleted file mode 100644 > index > d5e4da3b1ba47886cc6701faa52711926d8aee23..0000000000000000000000000000000000000000 > GIT binary patch > literal 0 > HcmV?d00001 > > literal 397 > zcmZXQ*)Bs-6o$L^*%g%v9;EN^?W(CgraHzerIV?xQdGhlc?ZNSE{s83AZBr44C2CW > zab;cnS;<=Y{(mjoar~~|B8A3JH34d<rH*<U2ol1kktUjHp_MRgIJDD2CtY;Y!#oQt > zvcxhgtg^;B8*H-0HaqOH$36!fa>Oww+N&#dQ1|M%o*_aX{X~foCqa?{1{q?Q5tAcL > zhEcMNF;0#NCYd77G^d<#&IOlTam@|4+;PtXk38|r3$MKK&Ig}->2h7ALpq`pdX_n# > z5~QS|W-E+o+L&aF=8RR;XPT7p&HJ#+{tv?*rA)?{vJ4_lT)B#AlO&ZCUbA07QA`>s > p;LKfFqPF7%{{HF`r5+`OC8Z^ll&a)MgaNuOVOhd0r4%K?^$i0xdb<Ds > OK, when you submit a patch upstream, please clean up the history so that we don't have no-op sequences, like "create data files, remove data files" :) > -- > 2.21.0 > > From d92efe0a07bfb0b7a267f16ccdd26fd3e182ae19 Mon Sep 17 00:00:00 2001 > From: Amr Keleg <amr_moha...@live.com> > Date: Tue, 4 Jun 2019 21:17:45 +0200 > Subject: [PATCH 08/12] Fix the repeated analyses bug > > FSTs used to have two paths (weighted and unweighted ones) > for the same analysis. > FST subtraction and disjunction were used to solve the issue. Great that you include a long-form message! But it's not quite detailed enough for me to understand; can you provide some bug report-like data, like "what previously happens" vs "what you want to happen"? And describe how you use FST arithmetic to solve the problem? > --- > scripts/lt-weight | 12 ++++++++---- > scripts/prepare_regex_strings.py | 1 - > 2 files changed, 8 insertions(+), 5 deletions(-) > > diff --git a/scripts/lt-weight b/scripts/lt-weight > index 8cfe185..25f44dd 100755 > --- a/scripts/lt-weight > +++ b/scripts/lt-weight > @@ -6,7 +6,7 @@ OUTPUT_FST=$3 > # Temporary directory for intermediate files > TEMP_DIR=".tmp" > # Check if it exists > -if [-d "$TEMP_DIR"]; then > +if ! [ -d "$TEMP_DIR" ]; then should fix this bug in the patch it was introduced in > mkdir $TEMP_DIR > fi > > @@ -18,7 +18,9 @@ HFST_FST="$TEMP_DIR/transducer.hfst" > WEIGHTED_REGEXP="$TEMP_DIR/weighted-regex" > WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst" > COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst" > -COMPOSED_ATTFST="$TEMP_DIR/weighted-transducer.att" > +SUBTRACTED_FST="$TEMP_DIR/subtracted-transducer.hfst" > +DISJUNCTED_FST="$TEMP_DIR/disjuncted-weighted-transducer.hfst" > +DISJUNCTED_ATTFST="$TEMP_DIR/weighted-transducer.att" > > # Convert the input FST to HFST > lt-print $FST | > @@ -41,10 +43,12 @@ cat $WEIGHTED_REGEXP | hfst-regexp2fst -j -o $WEIGHTED_FST > > # Compose the input FST and the weighted FST > hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST -v -o $COMPOSED_FST > -hfst-fst2txt -i $COMPOSED_FST -o $COMPOSED_ATTFST > +hfst-subtract $HFST_FST $COMPOSED_FST -o $SUBTRACTED_FST > +hfst-disjunct $SUBTRACTED_FST $COMPOSED_FST -o $DISJUNCTED_FST > +hfst-fst2txt -i $DISJUNCTED_FST -o $DISJUNCTED_ATTFST > > # Compile the FST back using lttoolbox > -lt-comp lr $COMPOSED_ATTFST $OUTPUT_FST > +lt-comp lr $DISJUNCTED_ATTFST $OUTPUT_FST > > # Delete the temporary files > # rm -rf $TEMP_DIR > diff --git a/scripts/prepare_regex_strings.py > b/scripts/prepare_regex_strings.py > index d2eac86..2f9300a 100644 > --- a/scripts/prepare_regex_strings.py > +++ b/scripts/prepare_regex_strings.py > @@ -36,6 +36,5 @@ if __name__ == '__main__': > lines = list(pd.Series(lines).value_counts().reset_index().apply(lambda > r: '{}::{}'.format(r['index'].strip(), -np.log(r[0]/len(lines))), axis=1)) > > with open(FILE_NAME, 'w') as f: > - f.write('[?*]::1000000\n') you don't mention why this is being removed; can't you just remove it from the earlier patch? > for line in lines: > f.write(line+'\n') > \ No newline at end of file > -- > 2.21.0 > > From 6caa6cd356732dcf3249522e67fe7c03ec2134f9 Mon Sep 17 00:00:00 2001 > From: Amr Keleg <amr_moha...@live.com> > Date: Tue, 4 Jun 2019 23:03:01 +0200 > Subject: [PATCH 09/12] Fix the used character to replace a space What was wrong with the old one? Maybe fix it earlier in the patch series? > > --- > scripts/lt-weight | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/scripts/lt-weight b/scripts/lt-weight > index 25f44dd..5e7561a 100755 > --- a/scripts/lt-weight > +++ b/scripts/lt-weight > @@ -25,7 +25,7 @@ DISJUNCTED_ATTFST="$TEMP_DIR/weighted-transducer.att" > # Convert the input FST to HFST > lt-print $FST | > sed -e 's/:/\\:/'| > -sed -e :a -e 's/ /\'@_SPACE_@\'/;ta'> $ATTFST > +sed -e :a -e 's/ /@_SPACE_@/;ta'> $ATTFST > > hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST > > -- > 2.21.0 > > From 96322670d0d957f8b3fae476fe6b255247fe4f56 Mon Sep 17 00:00:00 2001 > From: Amr Keleg <amr_moha...@live.com> > Date: Wed, 5 Jun 2019 00:25:40 +0200 > Subject: [PATCH 10/12] Minimize the FST to prevent segmenation faults in > lt-comp why is lt-comp segfaulting? that shouldn't happen; can you describe the problem / link a bug for lt-comp? > --- > scripts/lt-weight | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/scripts/lt-weight b/scripts/lt-weight > index 5e7561a..0fa104c 100755 > --- a/scripts/lt-weight > +++ b/scripts/lt-weight > @@ -20,7 +20,8 @@ WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst" > COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst" > SUBTRACTED_FST="$TEMP_DIR/subtracted-transducer.hfst" > DISJUNCTED_FST="$TEMP_DIR/disjuncted-weighted-transducer.hfst" > -DISJUNCTED_ATTFST="$TEMP_DIR/weighted-transducer.att" > +MINIMIZED_FST="$TEMP_DIR/minimized-weighted-transducer.hfst" > +MINIMIZED_ATTFST="$TEMP_DIR/weighted-transducer.att" > > # Convert the input FST to HFST > lt-print $FST | > @@ -45,12 +46,13 @@ cat $WEIGHTED_REGEXP | hfst-regexp2fst -j -o $WEIGHTED_FST > hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST -v -o $COMPOSED_FST > hfst-subtract $HFST_FST $COMPOSED_FST -o $SUBTRACTED_FST > hfst-disjunct $SUBTRACTED_FST $COMPOSED_FST -o $DISJUNCTED_FST > -hfst-fst2txt -i $DISJUNCTED_FST -o $DISJUNCTED_ATTFST > +hfst-minimize $DISJUNCTED_FST -o $MINIMIZED_FST > +hfst-fst2txt -i $MINIMIZED_FST -o $MINIMIZED_ATTFST > > # Compile the FST back using lttoolbox > lt-comp lr $DISJUNCTED_ATTFST $OUTPUT_FST > > # Delete the temporary files > -# rm -rf $TEMP_DIR > +rm -rf $TEMP_DIR > > # ./lt-weight ../../apertium-eng/eng.automorf.bin > ../../apertium-eng/texts/eng.tagged efst.bin > -- > 2.21.0 > > From b638ab547c9af84d390bcf66365ded9607f775e0 Mon Sep 17 00:00:00 2001 > From: Amr Keleg <amr_moha...@live.com> > Date: Thu, 6 Jun 2019 22:33:17 +0200 > Subject: [PATCH 11/12] Fix the transducer file name bug > earlier patch should be fixed > --- > scripts/lt-weight | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/scripts/lt-weight b/scripts/lt-weight > index 0fa104c..dd4ab69 100755 > --- a/scripts/lt-weight > +++ b/scripts/lt-weight > @@ -50,7 +50,7 @@ hfst-minimize $DISJUNCTED_FST -o $MINIMIZED_FST > hfst-fst2txt -i $MINIMIZED_FST -o $MINIMIZED_ATTFST > > # Compile the FST back using lttoolbox > -lt-comp lr $DISJUNCTED_ATTFST $OUTPUT_FST > +lt-comp lr $MINIMIZED_ATTFST $OUTPUT_FST > > # Delete the temporary files > rm -rf $TEMP_DIR > -- > 2.21.0 > > From b621ab62c784a09a5dd8721e75b7896168679244 Mon Sep 17 00:00:00 2001 > From: Amr Keleg <amr_moha...@live.com> > Date: Thu, 13 Jun 2019 09:35:04 +0200 > Subject: [PATCH 12/12] Add a large weight to unweighted final states provide some explanation for this? > > --- > scripts/lt-weight | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/scripts/lt-weight b/scripts/lt-weight > index dd4ab69..0d08cea 100755 > --- a/scripts/lt-weight > +++ b/scripts/lt-weight > @@ -19,6 +19,7 @@ WEIGHTED_REGEXP="$TEMP_DIR/weighted-regex" > WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst" > COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst" > SUBTRACTED_FST="$TEMP_DIR/subtracted-transducer.hfst" > +DEFAULT_WEIGHTED_FST="$TEMP_DIR/default-weighted-transducer.hfst" > DISJUNCTED_FST="$TEMP_DIR/disjuncted-weighted-transducer.hfst" > MINIMIZED_FST="$TEMP_DIR/minimized-weighted-transducer.hfst" > MINIMIZED_ATTFST="$TEMP_DIR/weighted-transducer.att" > @@ -45,12 +46,13 @@ cat $WEIGHTED_REGEXP | hfst-regexp2fst -j -o $WEIGHTED_FST > # Compose the input FST and the weighted FST > hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST -v -o $COMPOSED_FST > hfst-subtract $HFST_FST $COMPOSED_FST -o $SUBTRACTED_FST > -hfst-disjunct $SUBTRACTED_FST $COMPOSED_FST -o $DISJUNCTED_FST > +hfst-reweight -i $SUBTRACTED_FST -o $DEFAULT_WEIGHTED_FST -e -a 1000000000 > +hfst-disjunct $DEFAULT_WEIGHTED_FST $COMPOSED_FST -o $DISJUNCTED_FST > hfst-minimize $DISJUNCTED_FST -o $MINIMIZED_FST > hfst-fst2txt -i $MINIMIZED_FST -o $MINIMIZED_ATTFST > > # Compile the FST back using lttoolbox > -lt-comp lr $MINIMIZED_ATTFST $OUTPUT_FST > +../lttoolbox/lt-comp lr $MINIMIZED_ATTFST $OUTPUT_FST > > # Delete the temporary files > rm -rf $TEMP_DIR > -- > 2.21.0 > _______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff