Re: [Apertium-stuff] [PATCH] supervised weighting of automata

Nick Howell Thu, 13 Jun 2019 06:25:57 -0700

Here's my (inline) review of the github PR at
https://github.com/apertium/lttoolbox/pull/55


Cheers,
Nick

--

First patch should add yourself to AUTHORS

> From 944ed2556c38f058a5118ab5e481b3412aa3e3d8 Mon Sep 17 00:00:00 2001

> From: Amr Keleg <amr_moha...@live.com>
> Date: Sat, 11 May 2019 01:30:26 +0200
> Subject: [PATCH 01/12] Fix the out of alphabet token handling in analyses
>  generation
> 
> Solves #45
> Consider alphanumeric characters to be part of the vocabulary.
> ---
>  lttoolbox/fst_processor.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lttoolbox/fst_processor.cc b/lttoolbox/fst_processor.cc

> index 2be326a..d3d029a 100644
> --- a/lttoolbox/fst_processor.cc
> +++ b/lttoolbox/fst_processor.cc
> @@ -837,7 +837,7 @@ FSTProcessor::isEscaped(wchar_t const c) const
>  bool
>  FSTProcessor::isAlphabetic(wchar_t const c) const
>  {
> -  return alphabetic_chars.find(c) != alphabetic_chars.end();
> +  return (bool)std::iswalnum(c) || alphabetic_chars.find(c) != 
> alphabetic_chars.end();
>  }
>  
>  void
> -- 
> 2.21.0

Are we sure that std::iswalnum does what we want? Should it be
encoding-dependent? Do we take encoding from the environment?

> From be8673d196772a254e540a3fa92a920da35d8731 Mon Sep 17 00:00:00 2001

> From: Amr Keleg <amr_moha...@live.com>
> Date: Sun, 19 May 2019 22:39:07 +0200
> Subject: [PATCH 02/12] Add basic structure for the supervised weighting of
>  automata
> 
> ---
>  scripts/data/corpus.tagged  |   8 ++++++++
>  scripts/data/transducer.bin | Bin 0 -> 397 bytes
>  scripts/lt-weight           |   2 ++
>  3 files changed, 10 insertions(+)
>  create mode 100644 scripts/data/corpus.tagged
>  create mode 100644 scripts/data/transducer.bin
>  create mode 100755 scripts/lt-weight
> 
> diff --git a/scripts/data/corpus.tagged b/scripts/data/corpus.tagged

> new file mode 100644
> index 0000000..60f60ea
> --- /dev/null
> +++ b/scripts/data/corpus.tagged
> @@ -0,0 +1,8 @@
> +^a/a<n><compound-only-L>$
> +^a/a<pr><compound-only-L>$
> +^b/b<n><compound-R>$
> +^b/b<pr><compound-R>$
> +^c/d<compound-only-L><n>$ 
> +^c/d<compound-only-L><pr>$
> +^d/d<compound-R><n>$
> +^d/d<compound-R><pr>$

What does this file do? These things should be described in the
long-form part of the commit message.

> diff --git a/scripts/data/transducer.bin b/scripts/data/transducer.bin

> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..d5e4da3b1ba47886cc6701faa52711926d8aee23
> GIT binary patch
> literal 397
> zcmZXQ*)Bs-6o$L^*%g%v9;EN^?W(CgraHzerIV?xQdGhlc?ZNSE{s83AZBr44C2CW
> zab;cnS;<=Y{(mjoar~~|B8A3JH34d<rH*<U2ol1kktUjHp_MRgIJDD2CtY;Y!#oQt

> zvcxhgtg^;B8*H-0HaqOH$36!fa>Oww+N&#dQ1|M%o*_aX{X~foCqa?{1{q?Q5tAcL
> zhEcMNF;0#NCYd77G^d<#&IOlTam@|4+;PtXk38|r3$MKK&Ig}->2h7ALpq`pdX_n#
> z5~QS|W-E+o+L&aF=8RR;XPT7p&HJ#+{tv?*rA)?{vJ4_lT)B#AlO&ZCUbA07QA`>s

> p;LKfFqPF7%{{HF`r5+`OC8Z^ll&a)MgaNuOVOhd0r4%K?^$i0xdb<Ds
> 
> literal 0
> HcmV?d00001

Please don't add binary blobs to git history; better to add a text
representation along with the necessary rules to the build system to
reconstruct the binary.

> diff --git a/scripts/lt-weight b/scripts/lt-weight
> new file mode 100755
> index 0000000..a8b7c3e
> --- /dev/null
> +++ b/scripts/lt-weight
> @@ -0,0 +1,2 @@
> +#! /bin/sh
> +
> -- 
> 2.21.0

No point in just adding an empty file; just add the basic
implementation in the first step

> 
> From 8b11bb4e0fc606c465815a3ec5dd15c13a34da8f Mon Sep 17 00:00:00 2001

> From: Amr Keleg <amr_moha...@live.com>
> Date: Mon, 20 May 2019 21:52:17 +0200
> Subject: [PATCH 03/12] lt-weight: Estimate the weights using unigram counts
> 
> ---
>  scripts/lt-weight | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/scripts/lt-weight b/scripts/lt-weight
> index a8b7c3e..d1abd6b 100755
> --- a/scripts/lt-weight
> +++ b/scripts/lt-weight
> @@ -1,2 +1,7 @@
>  #! /bin/sh

add help and version output, and copyright/license info

> +CORPUS=$2
> +LINES=$(wc -l $CORPUS | cut -d ' ' -f1)
>  
> +cat $CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' |

unnecessary use of "cat"; instead use <"$CORPUS" (quoting in case of
whitespace in the filename).

> +sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' |

Combine the multiple sed commands into one execution

> +awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "::" $1}'

indent for command continuations


> -- 
> 2.21.0
> 
> From 1d78c1698727de7abcd17e3d92a9d099d2f920f7 Mon Sep 17 00:00:00 2001

> From: Amr Keleg <amr_moha...@live.com>
> Date: Tue, 21 May 2019 00:02:04 +0200
> Subject: [PATCH 04/12] lt-weight: Use HFST to weight the input fst

long-form commit message?

> 
> ---
>  scripts/lt-weight | 30 +++++++++++++++++++++++++++++-
>  1 file changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/scripts/lt-weight b/scripts/lt-weight
> index d1abd6b..092fc66 100755
> --- a/scripts/lt-weight
> +++ b/scripts/lt-weight
> @@ -1,7 +1,35 @@
>  #! /bin/sh
> +FST=$1
>  CORPUS=$2
> +OUTPUT_FST=$3
> +
> +# Temporary intermediate files
> +ATTFST='transducer.att'
> +HFST_FST='transducer.hfst'
> +WEIGHTED_FST='weighted-pairs.hfst'
> +COMPOSED_FST='weighted-transducer.hfst'
> +MULTICHAR='multichar_symbols'

use a temporary directory for temp files, created using "mktemp"

> +
> +# Convert the input FST to HFST
> +lt-print $FST > $ATTFST
> +hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST
> +
>  LINES=$(wc -l $CORPUS | cut -d ' ' -f1)
>  
> +# Prepare the multichar symbols files
> +awk -F '[<>]' '{print "<"$(NF-1)">"}' data/corpus.tagged | tr -d , | sort | 
> uniq > $MULTICHAR
> +
> +# Generate a weighted FST from the string pairs
>  cat $CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' |
>  sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' |
> -awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "::" $1}'
> +awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "\t" $1}' |
> +hfst-strings2fst -j -o $WEIGHTED_FST -m $MULTICHAR
> +
> +# Compose the input FST and the weighted FST
> +hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST | hfst-fst2txt > $COMPOSED_FST
> +
> +# Compile the FST back using lttoolbox
> +lt-comp lr $COMPOSED_FST $OUTPUT_FST
> +
> +# Delete the temporary files
> +rm $ATTFST $HFST_FST $WEIGHTED_FST $MULTICHAR $COMPOSED_FST

It seems to me that this version and the previous version should be
two different scripts: one that reads a corpus and produces weights
which can be read by hfst-strings2fst, and one which performs the
reweighting of the fst.

> -- 
> 2.21.0
> 
> From b34b6769980fc2a1cb7d3659eb5c16c3894c8348 Mon Sep 17 00:00:00 2001

> From: Amr Keleg <amr_moha...@live.com>
> Date: Tue, 21 May 2019 17:59:05 +0200
> Subject: [PATCH 05/12] lt-weight: Update the script to handle non-trivial
>  corpora/transducers

in the long-form message, explain what problems this patch is solving?


> 
> ---
>  scripts/lt-weight | 34 ++++++++++++++++++++++++----------
>  1 file changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/scripts/lt-weight b/scripts/lt-weight
> index 092fc66..5420997 100755
> --- a/scripts/lt-weight
> +++ b/scripts/lt-weight
> @@ -4,23 +4,31 @@ CORPUS=$2
>  OUTPUT_FST=$3
>  
>  # Temporary intermediate files
> -ATTFST='transducer.att'
> -HFST_FST='transducer.hfst'
> -WEIGHTED_FST='weighted-pairs.hfst'
> -COMPOSED_FST='weighted-transducer.hfst'
> -MULTICHAR='multichar_symbols'
> +TEMP_DIR=".tmp"
> +mkdir $TEMP_DIR
> +ATTFST="$TEMP_DIR/transducer.att"
> +HFST_FST="$TEMP_DIR/transducer.hfst"
> +WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst"
> +COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst"
> +MULTICHAR="$TEMP_DIR/multichar-symbols"
> +CLEANED_CORPUS="$TEMP_DIR/clean-corpus.tagged"
> +
> +# Clean the input tagged corpus
> +sed -e '/^$/d' $CORPUS > $CLEANED_CORPUS
> +sed -i -e :a -e 's/:/__COLON__/;ta' $CLEANED_CORPUS
> +sed -i -e :a -e 's/__COLON__/\\:/;ta' $CLEANED_CORPUS

cleaning of the corpus should be performed in a separate script

>  
>  # Convert the input FST to HFST
> -lt-print $FST > $ATTFST
> +lt-print $FST | sed -e 's/:/\\:/'| sed -e :a -e 's/ 
> /\&#39;@_SPACE_@\&#39;/;ta'> $ATTFST
>  hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST
>  
> -LINES=$(wc -l $CORPUS | cut -d ' ' -f1)
> +LINES=$(wc -l $CLEANED_CORPUS | cut -d ' ' -f1)
>  
>  # Prepare the multichar symbols files
> -awk -F '[<>]' '{print "<"$(NF-1)">"}' data/corpus.tagged | tr -d , | sort | 
> uniq > $MULTICHAR
> +awk -F '[<>]' '{print "<"$(NF-1)">"}' $CLEANED_CORPUS | tr -d , | sort | 
> uniq > $MULTICHAR
>  
>  # Generate a weighted FST from the string pairs
> -cat $CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' |
> +cat $CLEANED_CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' |
>  sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' |
>  awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "\t" $1}' |
>  hfst-strings2fst -j -o $WEIGHTED_FST -m $MULTICHAR
> @@ -28,8 +36,14 @@ hfst-strings2fst -j -o $WEIGHTED_FST -m $MULTICHAR

>  # Compose the input FST and the weighted FST
>  hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST | hfst-fst2txt > $COMPOSED_FST
>  
> +# Clean the composed FST for lt-comp
> +# Remove -- at any line and empty lines from the ATT file
> +sed -i -e 's/--//' $COMPOSED_FST
> +sed -i -e '/^$/d' $COMPOSED_FST
> +
>  # Compile the FST back using lttoolbox
>  lt-comp lr $COMPOSED_FST $OUTPUT_FST
>  
>  # Delete the temporary files
> -rm $ATTFST $HFST_FST $WEIGHTED_FST $MULTICHAR $COMPOSED_FST
> +rm -rf $TEMP_DIR
> +
> -- 
> 2.21.0
> 
> From 8171aca8e52290e3dac6620aa17d01117486d500 Mon Sep 17 00:00:00 2001

> From: Amr Keleg <amr_moha...@live.com>
> Date: Fri, 31 May 2019 20:39:57 +0200
> Subject: [PATCH 06/12] Use weighted regexp
> 
> ---
>  scripts/lt-weight                | 53 +++++++++++++++++---------------
>  scripts/prepare_regex_strings.py | 41 ++++++++++++++++++++++++
>  2 files changed, 69 insertions(+), 25 deletions(-)
>  create mode 100644 scripts/prepare_regex_strings.py
> 
> diff --git a/scripts/lt-weight b/scripts/lt-weight
> index 5420997..8cfe185 100755
> --- a/scripts/lt-weight
> +++ b/scripts/lt-weight
> @@ -3,47 +3,50 @@ FST=$1
>  CORPUS=$2
>  OUTPUT_FST=$3
>  
> -# Temporary intermediate files
> +# Temporary directory for intermediate files
>  TEMP_DIR=".tmp"
> -mkdir $TEMP_DIR
> +# Check if it exists
> +if [-d "$TEMP_DIR"]; then
> +     mkdir $TEMP_DIR
> +fi

doesn't this only create the directory if it already exists? I don't
understand the logic

> +
> +CLEANED_CORPUS="$TEMP_DIR/clean-corpus.tagged"
> +
>  ATTFST="$TEMP_DIR/transducer.att"
>  HFST_FST="$TEMP_DIR/transducer.hfst"
> +
> +WEIGHTED_REGEXP="$TEMP_DIR/weighted-regex"
>  WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst"
>  COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst"
> -MULTICHAR="$TEMP_DIR/multichar-symbols"
> -CLEANED_CORPUS="$TEMP_DIR/clean-corpus.tagged"
> -
> -# Clean the input tagged corpus
> -sed -e '/^$/d' $CORPUS > $CLEANED_CORPUS
> -sed -i -e :a -e 's/:/__COLON__/;ta' $CLEANED_CORPUS
> -sed -i -e :a -e 's/__COLON__/\\:/;ta' $CLEANED_CORPUS
> +COMPOSED_ATTFST="$TEMP_DIR/weighted-transducer.att"
>  
>  # Convert the input FST to HFST
> -lt-print $FST | sed -e 's/:/\\:/'| sed -e :a -e 's/ 
> /\&#39;@_SPACE_@\&#39;/;ta'> $ATTFST
> -hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST
> +lt-print $FST |
> +sed -e 's/:/\\:/'|
> +sed -e :a -e 's/ /\&#39;@_SPACE_@\&#39;/;ta'> $ATTFST
>  
> -LINES=$(wc -l $CLEANED_CORPUS | cut -d ' ' -f1)
> +hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST
>  
> -# Prepare the multichar symbols files
> -awk -F '[<>]' '{print "<"$(NF-1)">"}' $CLEANED_CORPUS | tr -d , | sort | 
> uniq > $MULTICHAR
> +# Clean the input tagged corpus
> +# REMOVE EMPTY LINES
> +sed -e '/^$/d' $CORPUS > $CLEANED_CORPUS
>  
>  # Generate a weighted FST from the string pairs
> +LINES=$(wc -l $CLEANED_CORPUS | cut -d ' ' -f1)
> +
>  cat $CLEANED_CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' |
> -sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' |
> -awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "\t" $1}' |
> -hfst-strings2fst -j -o $WEIGHTED_FST -m $MULTICHAR
> +sed -e 's/\$$//' > $WEIGHTED_REGEXP
> +python prepare_regex_strings.py $WEIGHTED_REGEXP
> +cat $WEIGHTED_REGEXP | hfst-regexp2fst -j -o $WEIGHTED_FST

unnecessary "cat"

>  
>  # Compose the input FST and the weighted FST
> -hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST | hfst-fst2txt > $COMPOSED_FST
> -
> -# Clean the composed FST for lt-comp
> -# Remove -- at any line and empty lines from the ATT file
> -sed -i -e 's/--//' $COMPOSED_FST
> -sed -i -e '/^$/d' $COMPOSED_FST
> +hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST -v -o $COMPOSED_FST
> +hfst-fst2txt -i $COMPOSED_FST -o $COMPOSED_ATTFST
>  
>  # Compile the FST back using lttoolbox
> -lt-comp lr $COMPOSED_FST $OUTPUT_FST
> +lt-comp lr $COMPOSED_ATTFST $OUTPUT_FST
>  
>  # Delete the temporary files
> -rm -rf $TEMP_DIR
> +# rm -rf $TEMP_DIR
>  
> +# ./lt-weight ../../apertium-eng/eng.automorf.bin 
> ../../apertium-eng/texts/eng.tagged efst.bin

don't submit patches with commented-out code :( better to add a
"--debug" flag or DEBUG env variable, and lock it behind that

> diff --git a/scripts/prepare_regex_strings.py 
> b/scripts/prepare_regex_strings.py
> new file mode 100644
> index 0000000..d2eac86
> --- /dev/null
> +++ b/scripts/prepare_regex_strings.py
> @@ -0,0 +1,41 @@
> +import re
> +import sys
> +import numpy as np
> +import pandas as pd

now we have dependencies on numpy and pandas; maybe it is possible to
get away with just vanilla python?

> +
> +#TODO: HANDLE THE REST OF THE SPECIAL CHARACTERS
what does "the rest" mean?
> +special_regex_chars = '%,.;!#-—+*:0?[]()~"\''
> +
> +def clean_tag_patterns(reg):
> +     whitesace_free_reg = re.sub(' ', '', reg)
------------^

typo in variable name

> +     return '%{}%>'.format(whitesace_free_reg[:-1])
> +
> +def clean_line(line):
> +     line = line.strip()
> +     if line.endswith('$"'):
> +             # ERROR LINE LIKE ^./.<sent>$"
> +             line = line[:-2]
> +     line = re.sub(r'(.)', r'\1 ', line)
> +
> +     for special_char in special_regex_chars:
> +             line = re.sub('\\{}'.format(special_char), 
> '%{}'.format(special_char), line)
> +     
> +     line = re.sub(r'(\<.*?\>)', lambda m: 
> +             clean_tag_patterns(m.group(0)),line)
> +     # HANDLE TAGS
> +     line = line.strip()
> +     line = '[{}]'.format(line)
> +     # line = '[?*][{}][?*]'.format(line)
> +     return line
> +
> +if __name__ == '__main__':
> +     FILE_NAME = sys.argv[1]
> +     with open(FILE_NAME, 'r') as f:
> +             lines =[clean_line(line) for line in f.readlines() if not 
> line.startswith('*')]
> +
> +     lines = list(pd.Series(lines).value_counts().reset_index().apply(lambda 
> r: '{}::{}'.format(r['index'].strip(), -np.log(r[0]/len(lines))), axis=1))
> +
> +     with open(FILE_NAME, 'w') as f:
> +             f.write('[?*]::1000000\n')
> +             for line in lines:
> +                     f.write(line+'\n')
> \ No newline at end of file

needs "--help", license and copyright info; probably could mark it
executable. It's also important to figure out what *exact*
dependencies this has; does it work with python2? python3? Does
lttoolbox already have a python dependency? If not, better to use
something we already depend on.

> -- 
> 2.21.0
> 
> From 3a746cd0afe16f92ffc9628d71b3e37c95244da0 Mon Sep 17 00:00:00 2001

> From: Amr Keleg <amr_moha...@live.com>
> Date: Fri, 31 May 2019 21:03:57 +0200
> Subject: [PATCH 07/12] Remove un-necessary data files
> 
> ---
>  scripts/data/corpus.tagged  |   8 --------
>  scripts/data/transducer.bin | Bin 397 -> 0 bytes
>  2 files changed, 8 deletions(-)
>  delete mode 100644 scripts/data/corpus.tagged
>  delete mode 100644 scripts/data/transducer.bin
> 
> diff --git a/scripts/data/corpus.tagged b/scripts/data/corpus.tagged

> deleted file mode 100644
> index 60f60ea..0000000
> --- a/scripts/data/corpus.tagged
> +++ /dev/null
> @@ -1,8 +0,0 @@
> -^a/a<n><compound-only-L>$
> -^a/a<pr><compound-only-L>$
> -^b/b<n><compound-R>$
> -^b/b<pr><compound-R>$
> -^c/d<compound-only-L><n>$ 
> -^c/d<compound-only-L><pr>$
> -^d/d<compound-R><n>$
> -^d/d<compound-R><pr>$
> diff --git a/scripts/data/transducer.bin b/scripts/data/transducer.bin

> deleted file mode 100644
> index 
> d5e4da3b1ba47886cc6701faa52711926d8aee23..0000000000000000000000000000000000000000
> GIT binary patch
> literal 0
> HcmV?d00001
> 
> literal 397
> zcmZXQ*)Bs-6o$L^*%g%v9;EN^?W(CgraHzerIV?xQdGhlc?ZNSE{s83AZBr44C2CW
> zab;cnS;<=Y{(mjoar~~|B8A3JH34d<rH*<U2ol1kktUjHp_MRgIJDD2CtY;Y!#oQt

> zvcxhgtg^;B8*H-0HaqOH$36!fa>Oww+N&#dQ1|M%o*_aX{X~foCqa?{1{q?Q5tAcL
> zhEcMNF;0#NCYd77G^d<#&IOlTam@|4+;PtXk38|r3$MKK&Ig}->2h7ALpq`pdX_n#
> z5~QS|W-E+o+L&aF=8RR;XPT7p&HJ#+{tv?*rA)?{vJ4_lT)B#AlO&ZCUbA07QA`>s

> p;LKfFqPF7%{{HF`r5+`OC8Z^ll&a)MgaNuOVOhd0r4%K?^$i0xdb<Ds
> 

OK, when you submit a patch upstream, please clean up the history so
that we don't have no-op sequences, like "create data files, remove
data files" :)

> -- 
> 2.21.0
> 
> From d92efe0a07bfb0b7a267f16ccdd26fd3e182ae19 Mon Sep 17 00:00:00 2001

> From: Amr Keleg <amr_moha...@live.com>
> Date: Tue, 4 Jun 2019 21:17:45 +0200
> Subject: [PATCH 08/12] Fix the repeated analyses bug
> 
> FSTs used to have two paths (weighted and unweighted ones)
> for the same analysis.
> FST subtraction and disjunction were used to solve the issue.

Great that you include a long-form message! But it's not quite
detailed enough for me to understand; can you provide some bug
report-like data, like "what previously happens" vs "what you want to
happen"? And describe how you use FST arithmetic to solve the problem?


> ---
>  scripts/lt-weight                | 12 ++++++++----
>  scripts/prepare_regex_strings.py |  1 -
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/scripts/lt-weight b/scripts/lt-weight
> index 8cfe185..25f44dd 100755
> --- a/scripts/lt-weight
> +++ b/scripts/lt-weight
> @@ -6,7 +6,7 @@ OUTPUT_FST=$3
>  # Temporary directory for intermediate files
>  TEMP_DIR=".tmp"
>  # Check if it exists
> -if [-d "$TEMP_DIR"]; then
> +if ! [ -d "$TEMP_DIR" ]; then

should fix this bug in the patch it was introduced in

>       mkdir $TEMP_DIR
>  fi
>  
> @@ -18,7 +18,9 @@ HFST_FST="$TEMP_DIR/transducer.hfst"
>  WEIGHTED_REGEXP="$TEMP_DIR/weighted-regex"
>  WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst"
>  COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst"
> -COMPOSED_ATTFST="$TEMP_DIR/weighted-transducer.att"
> +SUBTRACTED_FST="$TEMP_DIR/subtracted-transducer.hfst"
> +DISJUNCTED_FST="$TEMP_DIR/disjuncted-weighted-transducer.hfst"
> +DISJUNCTED_ATTFST="$TEMP_DIR/weighted-transducer.att"
>  
>  # Convert the input FST to HFST
>  lt-print $FST |
> @@ -41,10 +43,12 @@ cat $WEIGHTED_REGEXP | hfst-regexp2fst -j -o $WEIGHTED_FST
>  
>  # Compose the input FST and the weighted FST
>  hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST -v -o $COMPOSED_FST
> -hfst-fst2txt -i $COMPOSED_FST -o $COMPOSED_ATTFST
> +hfst-subtract $HFST_FST $COMPOSED_FST -o $SUBTRACTED_FST
> +hfst-disjunct $SUBTRACTED_FST $COMPOSED_FST -o $DISJUNCTED_FST
> +hfst-fst2txt -i $DISJUNCTED_FST -o $DISJUNCTED_ATTFST
>  
>  # Compile the FST back using lttoolbox
> -lt-comp lr $COMPOSED_ATTFST $OUTPUT_FST
> +lt-comp lr $DISJUNCTED_ATTFST $OUTPUT_FST
>  
>  # Delete the temporary files
>  # rm -rf $TEMP_DIR
> diff --git a/scripts/prepare_regex_strings.py 
> b/scripts/prepare_regex_strings.py
> index d2eac86..2f9300a 100644
> --- a/scripts/prepare_regex_strings.py
> +++ b/scripts/prepare_regex_strings.py
> @@ -36,6 +36,5 @@ if __name__ == '__main__':
>       lines = list(pd.Series(lines).value_counts().reset_index().apply(lambda 
> r: '{}::{}'.format(r['index'].strip(), -np.log(r[0]/len(lines))), axis=1))
>  
>       with open(FILE_NAME, 'w') as f:
> -             f.write('[?*]::1000000\n')

you don't mention why this is being removed; can't you just remove it
from the earlier patch?

>               for line in lines:
>                       f.write(line+'\n')
> \ No newline at end of file
> -- 
> 2.21.0
> 
> From 6caa6cd356732dcf3249522e67fe7c03ec2134f9 Mon Sep 17 00:00:00 2001

> From: Amr Keleg <amr_moha...@live.com>
> Date: Tue, 4 Jun 2019 23:03:01 +0200
> Subject: [PATCH 09/12] Fix the used character to replace a space

What was wrong with the old one? Maybe fix it earlier in the patch
series?

> 
> ---
>  scripts/lt-weight | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/scripts/lt-weight b/scripts/lt-weight
> index 25f44dd..5e7561a 100755
> --- a/scripts/lt-weight
> +++ b/scripts/lt-weight
> @@ -25,7 +25,7 @@ DISJUNCTED_ATTFST="$TEMP_DIR/weighted-transducer.att"
>  # Convert the input FST to HFST
>  lt-print $FST |
>  sed -e 's/:/\\:/'|
> -sed -e :a -e 's/ /\&#39;@_SPACE_@\&#39;/;ta'> $ATTFST
> +sed -e :a -e 's/ /@_SPACE_@/;ta'> $ATTFST
>  
>  hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST
>  
> -- 
> 2.21.0
> 
> From 96322670d0d957f8b3fae476fe6b255247fe4f56 Mon Sep 17 00:00:00 2001

> From: Amr Keleg <amr_moha...@live.com>
> Date: Wed, 5 Jun 2019 00:25:40 +0200
> Subject: [PATCH 10/12] Minimize the FST to prevent segmenation faults in
>  lt-comp


why is lt-comp segfaulting? that shouldn't happen; can you describe
the problem / link a bug for lt-comp?

> ---
>  scripts/lt-weight | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/scripts/lt-weight b/scripts/lt-weight
> index 5e7561a..0fa104c 100755
> --- a/scripts/lt-weight
> +++ b/scripts/lt-weight
> @@ -20,7 +20,8 @@ WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst"
>  COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst"
>  SUBTRACTED_FST="$TEMP_DIR/subtracted-transducer.hfst"
>  DISJUNCTED_FST="$TEMP_DIR/disjuncted-weighted-transducer.hfst"
> -DISJUNCTED_ATTFST="$TEMP_DIR/weighted-transducer.att"
> +MINIMIZED_FST="$TEMP_DIR/minimized-weighted-transducer.hfst"
> +MINIMIZED_ATTFST="$TEMP_DIR/weighted-transducer.att"
>  
>  # Convert the input FST to HFST
>  lt-print $FST |
> @@ -45,12 +46,13 @@ cat $WEIGHTED_REGEXP | hfst-regexp2fst -j -o $WEIGHTED_FST
>  hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST -v -o $COMPOSED_FST
>  hfst-subtract $HFST_FST $COMPOSED_FST -o $SUBTRACTED_FST
>  hfst-disjunct $SUBTRACTED_FST $COMPOSED_FST -o $DISJUNCTED_FST
> -hfst-fst2txt -i $DISJUNCTED_FST -o $DISJUNCTED_ATTFST
> +hfst-minimize $DISJUNCTED_FST -o $MINIMIZED_FST
> +hfst-fst2txt -i $MINIMIZED_FST -o $MINIMIZED_ATTFST
>  
>  # Compile the FST back using lttoolbox
>  lt-comp lr $DISJUNCTED_ATTFST $OUTPUT_FST
>  
>  # Delete the temporary files
> -# rm -rf $TEMP_DIR
> +rm -rf $TEMP_DIR
>  
>  # ./lt-weight ../../apertium-eng/eng.automorf.bin 
> ../../apertium-eng/texts/eng.tagged efst.bin
> -- 
> 2.21.0
> 
> From b638ab547c9af84d390bcf66365ded9607f775e0 Mon Sep 17 00:00:00 2001

> From: Amr Keleg <amr_moha...@live.com>
> Date: Thu, 6 Jun 2019 22:33:17 +0200
> Subject: [PATCH 11/12] Fix the transducer file name bug
> 

earlier patch should be fixed

> ---
>  scripts/lt-weight | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/scripts/lt-weight b/scripts/lt-weight
> index 0fa104c..dd4ab69 100755
> --- a/scripts/lt-weight
> +++ b/scripts/lt-weight
> @@ -50,7 +50,7 @@ hfst-minimize $DISJUNCTED_FST -o $MINIMIZED_FST
>  hfst-fst2txt -i $MINIMIZED_FST -o $MINIMIZED_ATTFST
>  
>  # Compile the FST back using lttoolbox
> -lt-comp lr $DISJUNCTED_ATTFST $OUTPUT_FST
> +lt-comp lr $MINIMIZED_ATTFST $OUTPUT_FST
>  
>  # Delete the temporary files
>  rm -rf $TEMP_DIR
> -- 
> 2.21.0
> 
> From b621ab62c784a09a5dd8721e75b7896168679244 Mon Sep 17 00:00:00 2001

> From: Amr Keleg <amr_moha...@live.com>
> Date: Thu, 13 Jun 2019 09:35:04 +0200
> Subject: [PATCH 12/12] Add a large weight to unweighted final states


provide some explanation for this?

> 
> ---
>  scripts/lt-weight | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/scripts/lt-weight b/scripts/lt-weight
> index dd4ab69..0d08cea 100755
> --- a/scripts/lt-weight
> +++ b/scripts/lt-weight
> @@ -19,6 +19,7 @@ WEIGHTED_REGEXP="$TEMP_DIR/weighted-regex"
>  WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst"
>  COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst"
>  SUBTRACTED_FST="$TEMP_DIR/subtracted-transducer.hfst"
> +DEFAULT_WEIGHTED_FST="$TEMP_DIR/default-weighted-transducer.hfst"

>  DISJUNCTED_FST="$TEMP_DIR/disjuncted-weighted-transducer.hfst"
>  MINIMIZED_FST="$TEMP_DIR/minimized-weighted-transducer.hfst"
>  MINIMIZED_ATTFST="$TEMP_DIR/weighted-transducer.att"
> @@ -45,12 +46,13 @@ cat $WEIGHTED_REGEXP | hfst-regexp2fst -j -o $WEIGHTED_FST
>  # Compose the input FST and the weighted FST
>  hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST -v -o $COMPOSED_FST
>  hfst-subtract $HFST_FST $COMPOSED_FST -o $SUBTRACTED_FST
> -hfst-disjunct $SUBTRACTED_FST $COMPOSED_FST -o $DISJUNCTED_FST
> +hfst-reweight -i $SUBTRACTED_FST -o $DEFAULT_WEIGHTED_FST -e -a 1000000000
> +hfst-disjunct $DEFAULT_WEIGHTED_FST $COMPOSED_FST -o $DISJUNCTED_FST

>  hfst-minimize $DISJUNCTED_FST -o $MINIMIZED_FST
>  hfst-fst2txt -i $MINIMIZED_FST -o $MINIMIZED_ATTFST
>  
>  # Compile the FST back using lttoolbox
> -lt-comp lr $MINIMIZED_ATTFST $OUTPUT_FST
> +../lttoolbox/lt-comp lr $MINIMIZED_ATTFST $OUTPUT_FST
>  
>  # Delete the temporary files
>  rm -rf $TEMP_DIR
> -- 
> 2.21.0
> 

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] [PATCH] supervised weighting of automata

Reply via email to