Re: [Apertium-stuff] [PATCH] supervised weighting of automata

Amr Mohamed Hosny Anwar Sat, 15 Jun 2019 22:35:18 -0700


Best Regards,
Amr



________________________________
From: Nick Howell <nlhow...@gmail.com>
Sent: Thursday, June 13, 2019 3:24 PM
To: [apertium-stuff]
Cc: Amr Mohamed Hosny Anwar; Nick Howell; Tommi A Pirinen; Francis Tyers
Subject: Re: [PATCH] supervised weighting of automata

Here's my (inline) review of the github PR at

https://github.com/apertium/lttoolbox/pull/55



Cheers,

Nick



--



First patch should add yourself to AUTHORS
- Done.


> From 944ed2556c38f058a5118ab5e481b3412aa3e3d8 Mon Sep 17 00:00:00 2001


> From: Amr Keleg <amr_moha...@live.com>

> Date: Sat, 11 May 2019 01:30:26 +0200

> Subject: [PATCH 01/12] Fix the out of alphabet token handling in analyses

>  generation

>

> Solves #45

> Consider alphanumeric characters to be part of the vocabulary.

> ---

>  lttoolbox/fst_processor.cc | 2 +-

>  1 file changed, 1 insertion(+), 1 deletion(-)

>

> diff --git a/lttoolbox/fst_processor.cc b/lttoolbox/fst_processor.cc


> index 2be326a..d3d029a 100644

> --- a/lttoolbox/fst_processor.cc

> +++ b/lttoolbox/fst_processor.cc

> @@ -837,7 +837,7 @@ FSTProcessor::isEscaped(wchar_t const c) const

>  bool

>  FSTProcessor::isAlphabetic(wchar_t const c) const

>  {

> -  return alphabetic_chars.find(c) != alphabetic_chars.end();

> +  return (bool)std::iswalnum(c) || alphabetic_chars.find(c) != 
> alphabetic_chars.end();

>  }

>

>  void

> --

> 2.21.0



Are we sure that std::iswalnum does what we want? Should it be

encoding-dependent? Do we take encoding from the environment?
- This commit isn't part of the PR.
A separate issue was filed and can be found through: 
https://github.com/apertium/lttoolbox/issues/45
[https://avatars0.githubusercontent.com/u/4594256?s=400&v=4]<https://github.com/apertium/lttoolbox/issues/45>
Strange output when transducer is compiled from a .att file · Issue #45 · 
apertium/lttoolbox<https://github.com/apertium/lttoolbox/issues/45>
github.com
I am trying to add weights to the morphological analyser. So while I was 
checking last year's project 
(http://wiki.apertium.org/wiki/User:Techievena/GSoC_2018_Work_Product_Submission)
 I noticed...




> From be8673d196772a254e540a3fa92a920da35d8731 Mon Sep 17 00:00:00 2001


> From: Amr Keleg <amr_moha...@live.com>

> Date: Sun, 19 May 2019 22:39:07 +0200

> Subject: [PATCH 02/12] Add basic structure for the supervised weighting of

>  automata

>

> ---

>  scripts/data/corpus.tagged  |   8 ++++++++

>  scripts/data/transducer.bin | Bin 0 -> 397 bytes

>  scripts/lt-weight           |   2 ++

>  3 files changed, 10 insertions(+)

>  create mode 100644 scripts/data/corpus.tagged

>  create mode 100644 scripts/data/transducer.bin

>  create mode 100755 scripts/lt-weight

>

> diff --git a/scripts/data/corpus.tagged b/scripts/data/corpus.tagged


> new file mode 100644

> index 0000000..60f60ea

> --- /dev/null

> +++ b/scripts/data/corpus.tagged

> @@ -0,0 +1,8 @@

> +^a/a<n><compound-only-L>$

> +^a/a<pr><compound-only-L>$

> +^b/b<n><compound-R>$

> +^b/b<pr><compound-R>$

> +^c/d<compound-only-L><n>$

> +^c/d<compound-only-L><pr>$

> +^d/d<compound-R><n>$

> +^d/d<compound-R><pr>$



What does this file do? These things should be described in the

long-form part of the commit message.

- I deleted these files in a later commit.

> diff --git a/scripts/data/transducer.bin b/scripts/data/transducer.bin


> new file mode 100644

> index 
> 0000000000000000000000000000000000000000..d5e4da3b1ba47886cc6701faa52711926d8aee23

> GIT binary patch

> literal 397

> zcmZXQ*)Bs-6o$L^*%g%v9;EN^?W(CgraHzerIV?xQdGhlc?ZNSE{s83AZBr44C2CW

> zab;cnS;<=Y{(mjoar~~|B8A3JH34d<rH*<U2ol1kktUjHp_MRgIJDD2CtY;Y!#oQt


> zvcxhgtg^;B8*H-0HaqOH$36!fa>Oww+N&#dQ1|M%o*_aX{X~foCqa?{1{q?Q5tAcL

> zhEcMNF;0#NCYd77G^d<#&IOlTam@|4+;PtXk38|r3$MKK&Ig}->2h7ALpq`pdX_n#

> z5~QS|W-E+o+L&aF=8RR;XPT7p&HJ#+{tv?*rA)?{vJ4_lT)B#AlO&ZCUbA07QA`>s


> p;LKfFqPF7%{{HF`r5+`OC8Z^ll&a)MgaNuOVOhd0r4%K?^$i0xdb<Ds

>

> literal 0

> HcmV?d00001



Please don't add binary blobs to git history; better to add a text

representation along with the necessary rules to the build system to

reconstruct the binary.

- I deleted these files in a later commit.

> diff --git a/scripts/lt-weight b/scripts/lt-weight

> new file mode 100755

> index 0000000..a8b7c3e

> --- /dev/null

> +++ b/scripts/lt-weight

> @@ -0,0 +1,2 @@

> +#! /bin/sh

> +

> --

> 2.21.0



No point in just adding an empty file; just add the basic

implementation in the first step

- I thought of using commits to reflect the progress and squashing these 
commits later on merging.

>

> From 8b11bb4e0fc606c465815a3ec5dd15c13a34da8f Mon Sep 17 00:00:00 2001


> From: Amr Keleg <amr_moha...@live.com>

> Date: Mon, 20 May 2019 21:52:17 +0200

> Subject: [PATCH 03/12] lt-weight: Estimate the weights using unigram counts

>

> ---

>  scripts/lt-weight | 5 +++++

>  1 file changed, 5 insertions(+)

>

> diff --git a/scripts/lt-weight b/scripts/lt-weight

> index a8b7c3e..d1abd6b 100755

> --- a/scripts/lt-weight

> +++ b/scripts/lt-weight

> @@ -1,2 +1,7 @@

>  #! /bin/sh



add help and version output, and copyright/license info
- Will we end-up using a shell script or we will need to implement a CC file?
I believe we need to decide on the script's language first.


> +CORPUS=$2

> +LINES=$(wc -l $CORPUS | cut -d ' ' -f1)

>

> +cat $CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' |



unnecessary use of "cat"; instead use <"$CORPUS" (quoting in case of

whitespace in the filename).

- Ok, this cat wasn't really beneficial.

> +sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' |



Combine the multiple sed commands into one execution
- Ok, let me check how to do so.
It should be easy.

> +awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "::" $1}'



indent for command continuations
- Done


> --

> 2.21.0

>

> From 1d78c1698727de7abcd17e3d92a9d099d2f920f7 Mon Sep 17 00:00:00 2001


> From: Amr Keleg <amr_moha...@live.com>

> Date: Tue, 21 May 2019 00:02:04 +0200

> Subject: [PATCH 04/12] lt-weight: Use HFST to weight the input fst



long-form commit message?



>

> ---

>  scripts/lt-weight | 30 +++++++++++++++++++++++++++++-

>  1 file changed, 29 insertions(+), 1 deletion(-)

>

> diff --git a/scripts/lt-weight b/scripts/lt-weight

> index d1abd6b..092fc66 100755

> --- a/scripts/lt-weight

> +++ b/scripts/lt-weight

> @@ -1,7 +1,35 @@

>  #! /bin/sh

> +FST=$1

>  CORPUS=$2

> +OUTPUT_FST=$3

> +

> +# Temporary intermediate files

> +ATTFST='transducer.att'

> +HFST_FST='transducer.hfst'

> +WEIGHTED_FST='weighted-pairs.hfst'

> +COMPOSED_FST='weighted-transducer.hfst'

> +MULTICHAR='multichar_symbols'



use a temporary directory for temp files, created using "mktemp"
- Nice :)


> +

> +# Convert the input FST to HFST

> +lt-print $FST > $ATTFST

> +hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST

> +

>  LINES=$(wc -l $CORPUS | cut -d ' ' -f1)

>

> +# Prepare the multichar symbols files

> +awk -F '[<>]' '{print "<"$(NF-1)">"}' data/corpus.tagged | tr -d , | sort | 
> uniq > $MULTICHAR

> +

> +# Generate a weighted FST from the string pairs

>  cat $CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' |

>  sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' |

> -awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "::" $1}'

> +awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "\t" $1}' |

> +hfst-strings2fst -j -o $WEIGHTED_FST -m $MULTICHAR

> +

> +# Compose the input FST and the weighted FST

> +hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST | hfst-fst2txt > $COMPOSED_FST

> +

> +# Compile the FST back using lttoolbox

> +lt-comp lr $COMPOSED_FST $OUTPUT_FST

> +

> +# Delete the temporary files

> +rm $ATTFST $HFST_FST $WEIGHTED_FST $MULTICHAR $COMPOSED_FST



It seems to me that this version and the previous version should be

two different scripts: one that reads a corpus and produces weights

which can be read by hfst-strings2fst, and one which performs the

reweighting of the fst.
- Agreed.

> --

> 2.21.0

>

> From b34b6769980fc2a1cb7d3659eb5c16c3894c8348 Mon Sep 17 00:00:00 2001


> From: Amr Keleg <amr_moha...@live.com>

> Date: Tue, 21 May 2019 17:59:05 +0200

> Subject: [PATCH 05/12] lt-weight: Update the script to handle non-trivial

>  corpora/transducers



in the long-form message, explain what problems this patch is solving?
- I am checking the git rebase command so that we can discuss how to update the 
commit messages in the best way.


>

> ---

>  scripts/lt-weight | 34 ++++++++++++++++++++++++----------

>  1 file changed, 24 insertions(+), 10 deletions(-)

>

> diff --git a/scripts/lt-weight b/scripts/lt-weight

> index 092fc66..5420997 100755

> --- a/scripts/lt-weight

> +++ b/scripts/lt-weight

> @@ -4,23 +4,31 @@ CORPUS=$2

>  OUTPUT_FST=$3

>

>  # Temporary intermediate files

> -ATTFST='transducer.att'

> -HFST_FST='transducer.hfst'

> -WEIGHTED_FST='weighted-pairs.hfst'

> -COMPOSED_FST='weighted-transducer.hfst'

> -MULTICHAR='multichar_symbols'

> +TEMP_DIR=".tmp"

> +mkdir $TEMP_DIR

> +ATTFST="$TEMP_DIR/transducer.att"

> +HFST_FST="$TEMP_DIR/transducer.hfst"

> +WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst"

> +COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst"

> +MULTICHAR="$TEMP_DIR/multichar-symbols"

> +CLEANED_CORPUS="$TEMP_DIR/clean-corpus.tagged"

> +

> +# Clean the input tagged corpus

> +sed -e '/^$/d' $CORPUS > $CLEANED_CORPUS

> +sed -i -e :a -e 's/:/__COLON__/;ta' $CLEANED_CORPUS

> +sed -i -e :a -e 's/__COLON__/\\:/;ta' $CLEANED_CORPUS



cleaning of the corpus should be performed in a separate script
- Ok.


>

>  # Convert the input FST to HFST

> -lt-print $FST > $ATTFST

> +lt-print $FST | sed -e 's/:/\\:/'| sed -e :a -e 's/ 
> /\&#39;@_SPACE_@\&#39;/;ta'> $ATTFST

>  hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST

>

> -LINES=$(wc -l $CORPUS | cut -d ' ' -f1)

> +LINES=$(wc -l $CLEANED_CORPUS | cut -d ' ' -f1)

>

>  # Prepare the multichar symbols files

> -awk -F '[<>]' '{print "<"$(NF-1)">"}' data/corpus.tagged | tr -d , | sort | 
> uniq > $MULTICHAR

> +awk -F '[<>]' '{print "<"$(NF-1)">"}' $CLEANED_CORPUS | tr -d , | sort | 
> uniq > $MULTICHAR

>

>  # Generate a weighted FST from the string pairs

> -cat $CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' |

> +cat $CLEANED_CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' |

>  sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' |

>  awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "\t" $1}' |

>  hfst-strings2fst -j -o $WEIGHTED_FST -m $MULTICHAR

> @@ -28,8 +36,14 @@ hfst-strings2fst -j -o $WEIGHTED_FST -m $MULTICHAR


>  # Compose the input FST and the weighted FST

>  hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST | hfst-fst2txt > $COMPOSED_FST

>

> +# Clean the composed FST for lt-comp

> +# Remove -- at any line and empty lines from the ATT file

> +sed -i -e 's/--//' $COMPOSED_FST

> +sed -i -e '/^$/d' $COMPOSED_FST

> +

>  # Compile the FST back using lttoolbox

>  lt-comp lr $COMPOSED_FST $OUTPUT_FST

>

>  # Delete the temporary files

> -rm $ATTFST $HFST_FST $WEIGHTED_FST $MULTICHAR $COMPOSED_FST

> +rm -rf $TEMP_DIR

> +

> --

> 2.21.0

>

> From 8171aca8e52290e3dac6620aa17d01117486d500 Mon Sep 17 00:00:00 2001


> From: Amr Keleg <amr_moha...@live.com>

> Date: Fri, 31 May 2019 20:39:57 +0200

> Subject: [PATCH 06/12] Use weighted regexp

>

> ---

>  scripts/lt-weight                | 53 +++++++++++++++++---------------

>  scripts/prepare_regex_strings.py | 41 ++++++++++++++++++++++++

>  2 files changed, 69 insertions(+), 25 deletions(-)

>  create mode 100644 scripts/prepare_regex_strings.py

>

> diff --git a/scripts/lt-weight b/scripts/lt-weight

> index 5420997..8cfe185 100755

> --- a/scripts/lt-weight

> +++ b/scripts/lt-weight

> @@ -3,47 +3,50 @@ FST=$1

>  CORPUS=$2

>  OUTPUT_FST=$3

>

> -# Temporary intermediate files

> +# Temporary directory for intermediate files

>  TEMP_DIR=".tmp"

> -mkdir $TEMP_DIR

> +# Check if it exists

> +if [-d "$TEMP_DIR"]; then

> +     mkdir $TEMP_DIR

> +fi



doesn't this only create the directory if it already exists? I don't

understand the logic

- Solved in a later commit. Sorry for that.


> +

> +CLEANED_CORPUS="$TEMP_DIR/clean-corpus.tagged"

> +

>  ATTFST="$TEMP_DIR/transducer.att"

>  HFST_FST="$TEMP_DIR/transducer.hfst"

> +

> +WEIGHTED_REGEXP="$TEMP_DIR/weighted-regex"

>  WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst"

>  COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst"

> -MULTICHAR="$TEMP_DIR/multichar-symbols"

> -CLEANED_CORPUS="$TEMP_DIR/clean-corpus.tagged"

> -

> -# Clean the input tagged corpus

> -sed -e '/^$/d' $CORPUS > $CLEANED_CORPUS

> -sed -i -e :a -e 's/:/__COLON__/;ta' $CLEANED_CORPUS

> -sed -i -e :a -e 's/__COLON__/\\:/;ta' $CLEANED_CORPUS

> +COMPOSED_ATTFST="$TEMP_DIR/weighted-transducer.att"

>

>  # Convert the input FST to HFST

> -lt-print $FST | sed -e 's/:/\\:/'| sed -e :a -e 's/ 
> /\&#39;@_SPACE_@\&#39;/;ta'> $ATTFST

> -hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST

> +lt-print $FST |

> +sed -e 's/:/\\:/'|

> +sed -e :a -e 's/ /\&#39;@_SPACE_@\&#39;/;ta'> $ATTFST

>

> -LINES=$(wc -l $CLEANED_CORPUS | cut -d ' ' -f1)

> +hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST

>

> -# Prepare the multichar symbols files

> -awk -F '[<>]' '{print "<"$(NF-1)">"}' $CLEANED_CORPUS | tr -d , | sort | 
> uniq > $MULTICHAR

> +# Clean the input tagged corpus

> +# REMOVE EMPTY LINES

> +sed -e '/^$/d' $CORPUS > $CLEANED_CORPUS

>

>  # Generate a weighted FST from the string pairs

> +LINES=$(wc -l $CLEANED_CORPUS | cut -d ' ' -f1)

> +

>  cat $CLEANED_CORPUS | sed -e 's/[ \t]//' | sed -e 's/\^.*\///' |

> -sed -e 's/\$$//' | sort | uniq -c | sed -e 's/^[ \t]*//' |

> -awk -v lines="$LINES" '{$1=-log($1/lines); print $2":" $2 "\t" $1}' |

> -hfst-strings2fst -j -o $WEIGHTED_FST -m $MULTICHAR

> +sed -e 's/\$$//' > $WEIGHTED_REGEXP

> +python prepare_regex_strings.py $WEIGHTED_REGEXP

> +cat $WEIGHTED_REGEXP | hfst-regexp2fst -j -o $WEIGHTED_FST



unnecessary "cat"
- Done.


>

>  # Compose the input FST and the weighted FST

> -hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST | hfst-fst2txt > $COMPOSED_FST

> -

> -# Clean the composed FST for lt-comp

> -# Remove -- at any line and empty lines from the ATT file

> -sed -i -e 's/--//' $COMPOSED_FST

> -sed -i -e '/^$/d' $COMPOSED_FST

> +hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST -v -o $COMPOSED_FST

> +hfst-fst2txt -i $COMPOSED_FST -o $COMPOSED_ATTFST

>

>  # Compile the FST back using lttoolbox

> -lt-comp lr $COMPOSED_FST $OUTPUT_FST

> +lt-comp lr $COMPOSED_ATTFST $OUTPUT_FST

>

>  # Delete the temporary files

> -rm -rf $TEMP_DIR

> +# rm -rf $TEMP_DIR

>

> +# ./lt-weight ../../apertium-eng/eng.automorf.bin 
> ../../apertium-eng/texts/eng.tagged efst.bin



don't submit patches with commented-out code :( better to add a

"--debug" flag or DEBUG env variable, and lock it behind that

- Could you elaborate more on using a debug flag?
How could it be used?


> diff --git a/scripts/prepare_regex_strings.py 
> b/scripts/prepare_regex_strings.py

> new file mode 100644

> index 0000000..d2eac86

> --- /dev/null

> +++ b/scripts/prepare_regex_strings.py

> @@ -0,0 +1,41 @@

> +import re

> +import sys

> +import numpy as np

> +import pandas as pd



now we have dependencies on numpy and pandas; maybe it is possible to

get away with just vanilla python?

- Yes, vanilla python could do the job.
I just wanted to get the whole script up and running.
In the end, I don't think we can use python (an additional dependency).


> +

> +#TODO: HANDLE THE REST OF THE SPECIAL CHARACTERS

what does "the rest" mean?
- There may exist other special characters that need to be escaped.
I don't know/have an exhaustive list of special characters.

> +special_regex_chars = '%,.;!#-—+*:0?[]()~"\''

> +

> +def clean_tag_patterns(reg):

> +     whitesace_free_reg = re.sub(' ', '', reg)

------------^



typo in variable name
- Done


> +     return '%{}%>'.format(whitesace_free_reg[:-1])

> +

> +def clean_line(line):

> +     line = line.strip()

> +     if line.endswith('$"'):

> +             # ERROR LINE LIKE ^./.<sent>$"

> +             line = line[:-2]

> +     line = re.sub(r'(.)', r'\1 ', line)

> +

> +     for special_char in special_regex_chars:

> +             line = re.sub('\\{}'.format(special_char), 
> '%{}'.format(special_char), line)

> +

> +     line = re.sub(r'(\<.*?\>)', lambda m:

> +             clean_tag_patterns(m.group(0)),line)

> +     # HANDLE TAGS

> +     line = line.strip()

> +     line = '[{}]'.format(line)

> +     # line = '[?*][{}][?*]'.format(line)

> +     return line

> +

> +if __name__ == '__main__':

> +     FILE_NAME = sys.argv[1]

> +     with open(FILE_NAME, 'r') as f:

> +             lines =[clean_line(line) for line in f.readlines() if not 
> line.startswith('*')]

> +

> +     lines = list(pd.Series(lines).value_counts().reset_index().apply(lambda 
> r: '{}::{}'.format(r['index'].strip(), -np.log(r[0]/len(lines))), axis=1))

> +

> +     with open(FILE_NAME, 'w') as f:

> +             f.write('[?*]::1000000\n')

> +             for line in lines:

> +                     f.write(line+'\n')

> \ No newline at end of file



needs "--help", license and copyright info; probably could mark it

executable. It's also important to figure out what *exact*

dependencies this has; does it work with python2? python3? Does

lttoolbox already have a python dependency? If not, better to use

something we already depend on.
- lttoolbox uses python only for testing 
(https://github.com/apertium/lttoolbox/search?l=python) so I believe python 
isn't a dependency.
Let's discuss the used languages for the scripts on the upcoming meeting.

> --

> 2.21.0

>

> From 3a746cd0afe16f92ffc9628d71b3e37c95244da0 Mon Sep 17 00:00:00 2001


> From: Amr Keleg <amr_moha...@live.com>

> Date: Fri, 31 May 2019 21:03:57 +0200

> Subject: [PATCH 07/12] Remove un-necessary data files

>

> ---

>  scripts/data/corpus.tagged  |   8 --------

>  scripts/data/transducer.bin | Bin 397 -> 0 bytes

>  2 files changed, 8 deletions(-)

>  delete mode 100644 scripts/data/corpus.tagged

>  delete mode 100644 scripts/data/transducer.bin

>

> diff --git a/scripts/data/corpus.tagged b/scripts/data/corpus.tagged


> deleted file mode 100644

> index 60f60ea..0000000

> --- a/scripts/data/corpus.tagged

> +++ /dev/null

> @@ -1,8 +0,0 @@

> -^a/a<n><compound-only-L>$

> -^a/a<pr><compound-only-L>$

> -^b/b<n><compound-R>$

> -^b/b<pr><compound-R>$

> -^c/d<compound-only-L><n>$

> -^c/d<compound-only-L><pr>$

> -^d/d<compound-R><n>$

> -^d/d<compound-R><pr>$

> diff --git a/scripts/data/transducer.bin b/scripts/data/transducer.bin


> deleted file mode 100644

> index 
> d5e4da3b1ba47886cc6701faa52711926d8aee23..0000000000000000000000000000000000000000

> GIT binary patch

> literal 0

> HcmV?d00001

>

> literal 397

> zcmZXQ*)Bs-6o$L^*%g%v9;EN^?W(CgraHzerIV?xQdGhlc?ZNSE{s83AZBr44C2CW

> zab;cnS;<=Y{(mjoar~~|B8A3JH34d<rH*<U2ol1kktUjHp_MRgIJDD2CtY;Y!#oQt


> zvcxhgtg^;B8*H-0HaqOH$36!fa>Oww+N&#dQ1|M%o*_aX{X~foCqa?{1{q?Q5tAcL

> zhEcMNF;0#NCYd77G^d<#&IOlTam@|4+;PtXk38|r3$MKK&Ig}->2h7ALpq`pdX_n#

> z5~QS|W-E+o+L&aF=8RR;XPT7p&HJ#+{tv?*rA)?{vJ4_lT)B#AlO&ZCUbA07QA`>s


> p;LKfFqPF7%{{HF`r5+`OC8Z^ll&a)MgaNuOVOhd0r4%K?^$i0xdb<Ds

>



OK, when you submit a patch upstream, please clean up the history so

that we don't have no-op sequences, like "create data files, remove

data files" :)

- Noted

> --

> 2.21.0

>

> From d92efe0a07bfb0b7a267f16ccdd26fd3e182ae19 Mon Sep 17 00:00:00 2001


> From: Amr Keleg <amr_moha...@live.com>

> Date: Tue, 4 Jun 2019 21:17:45 +0200

> Subject: [PATCH 08/12] Fix the repeated analyses bug

>

> FSTs used to have two paths (weighted and unweighted ones)

> for the same analysis.

> FST subtraction and disjunction were used to solve the issue.



Great that you include a long-form message! But it's not quite

detailed enough for me to understand; can you provide some bug

report-like data, like "what previously happens" vs "what you want to

happen"? And describe how you use FST arithmetic to solve the problem?

- Noted


> ---

>  scripts/lt-weight                | 12 ++++++++----

>  scripts/prepare_regex_strings.py |  1 -

>  2 files changed, 8 insertions(+), 5 deletions(-)

>

> diff --git a/scripts/lt-weight b/scripts/lt-weight

> index 8cfe185..25f44dd 100755

> --- a/scripts/lt-weight

> +++ b/scripts/lt-weight

> @@ -6,7 +6,7 @@ OUTPUT_FST=$3

>  # Temporary directory for intermediate files

>  TEMP_DIR=".tmp"

>  # Check if it exists

> -if [-d "$TEMP_DIR"]; then

> +if ! [ -d "$TEMP_DIR" ]; then



should fix this bug in the patch it was introduced in

- Is editing the history the better choice? My mindset was to fix bugs in new 
commits other than missing-up with the commits history.


>        mkdir $TEMP_DIR

>  fi

>

> @@ -18,7 +18,9 @@ HFST_FST="$TEMP_DIR/transducer.hfst"

>  WEIGHTED_REGEXP="$TEMP_DIR/weighted-regex"

>  WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst"

>  COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst"

> -COMPOSED_ATTFST="$TEMP_DIR/weighted-transducer.att"

> +SUBTRACTED_FST="$TEMP_DIR/subtracted-transducer.hfst"

> +DISJUNCTED_FST="$TEMP_DIR/disjuncted-weighted-transducer.hfst"

> +DISJUNCTED_ATTFST="$TEMP_DIR/weighted-transducer.att"

>

>  # Convert the input FST to HFST

>  lt-print $FST |

> @@ -41,10 +43,12 @@ cat $WEIGHTED_REGEXP | hfst-regexp2fst -j -o $WEIGHTED_FST

>

>  # Compose the input FST and the weighted FST

>  hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST -v -o $COMPOSED_FST

> -hfst-fst2txt -i $COMPOSED_FST -o $COMPOSED_ATTFST

> +hfst-subtract $HFST_FST $COMPOSED_FST -o $SUBTRACTED_FST

> +hfst-disjunct $SUBTRACTED_FST $COMPOSED_FST -o $DISJUNCTED_FST

> +hfst-fst2txt -i $DISJUNCTED_FST -o $DISJUNCTED_ATTFST

>

>  # Compile the FST back using lttoolbox

> -lt-comp lr $COMPOSED_ATTFST $OUTPUT_FST

> +lt-comp lr $DISJUNCTED_ATTFST $OUTPUT_FST

>

>  # Delete the temporary files

>  # rm -rf $TEMP_DIR

> diff --git a/scripts/prepare_regex_strings.py 
> b/scripts/prepare_regex_strings.py

> index d2eac86..2f9300a 100644

> --- a/scripts/prepare_regex_strings.py

> +++ b/scripts/prepare_regex_strings.py

> @@ -36,6 +36,5 @@ if __name__ == '__main__':

>        lines = 
> list(pd.Series(lines).value_counts().reset_index().apply(lambda r: 
> '{}::{}'.format(r['index'].strip(), -np.log(r[0]/len(lines))), axis=1))

>

>        with open(FILE_NAME, 'w') as f:

> -             f.write('[?*]::1000000\n')



you don't mention why this is being removed; can't you just remove it

from the earlier patch?
- OOV tokens are weighted in a different way now.


>                for line in lines:

>                        f.write(line+'\n')

> \ No newline at end of file

> --

> 2.21.0

>

> From 6caa6cd356732dcf3249522e67fe7c03ec2134f9 Mon Sep 17 00:00:00 2001


> From: Amr Keleg <amr_moha...@live.com>

> Date: Tue, 4 Jun 2019 23:03:01 +0200

> Subject: [PATCH 09/12] Fix the used character to replace a space



What was wrong with the old one? Maybe fix it earlier in the patch

series?
- I thought using hfst's special token for a space was better.

>

> ---

>  scripts/lt-weight | 2 +-

>  1 file changed, 1 insertion(+), 1 deletion(-)

>

> diff --git a/scripts/lt-weight b/scripts/lt-weight

> index 25f44dd..5e7561a 100755

> --- a/scripts/lt-weight

> +++ b/scripts/lt-weight

> @@ -25,7 +25,7 @@ DISJUNCTED_ATTFST="$TEMP_DIR/weighted-transducer.att"

>  # Convert the input FST to HFST

>  lt-print $FST |

>  sed -e 's/:/\\:/'|

> -sed -e :a -e 's/ /\&#39;@_SPACE_@\&#39;/;ta'> $ATTFST

> +sed -e :a -e 's/ /@_SPACE_@/;ta'> $ATTFST

>

>  hfst-txt2fst --epsilon=ε -i $ATTFST -o $HFST_FST

>

> --

> 2.21.0

>

> From 96322670d0d957f8b3fae476fe6b255247fe4f56 Mon Sep 17 00:00:00 2001


> From: Amr Keleg <amr_moha...@live.com>

> Date: Wed, 5 Jun 2019 00:25:40 +0200

> Subject: [PATCH 10/12] Minimize the FST to prevent segmenation faults in

>  lt-comp





why is lt-comp segfaulting? that shouldn't happen; can you describe

the problem / link a bug for lt-comp?

- Large fsts. Didn't really investigate the size of the fsts back then.
Just used minimize to optimize the fsts' sizes.


> ---

>  scripts/lt-weight | 8 +++++---

>  1 file changed, 5 insertions(+), 3 deletions(-)

>

> diff --git a/scripts/lt-weight b/scripts/lt-weight

> index 5e7561a..0fa104c 100755

> --- a/scripts/lt-weight

> +++ b/scripts/lt-weight

> @@ -20,7 +20,8 @@ WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst"

>  COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst"

>  SUBTRACTED_FST="$TEMP_DIR/subtracted-transducer.hfst"

>  DISJUNCTED_FST="$TEMP_DIR/disjuncted-weighted-transducer.hfst"

> -DISJUNCTED_ATTFST="$TEMP_DIR/weighted-transducer.att"

> +MINIMIZED_FST="$TEMP_DIR/minimized-weighted-transducer.hfst"

> +MINIMIZED_ATTFST="$TEMP_DIR/weighted-transducer.att"

>

>  # Convert the input FST to HFST

>  lt-print $FST |

> @@ -45,12 +46,13 @@ cat $WEIGHTED_REGEXP | hfst-regexp2fst -j -o $WEIGHTED_FST

>  hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST -v -o $COMPOSED_FST

>  hfst-subtract $HFST_FST $COMPOSED_FST -o $SUBTRACTED_FST

>  hfst-disjunct $SUBTRACTED_FST $COMPOSED_FST -o $DISJUNCTED_FST

> -hfst-fst2txt -i $DISJUNCTED_FST -o $DISJUNCTED_ATTFST

> +hfst-minimize $DISJUNCTED_FST -o $MINIMIZED_FST

> +hfst-fst2txt -i $MINIMIZED_FST -o $MINIMIZED_ATTFST

>

>  # Compile the FST back using lttoolbox

>  lt-comp lr $DISJUNCTED_ATTFST $OUTPUT_FST

>

>  # Delete the temporary files

> -# rm -rf $TEMP_DIR

> +rm -rf $TEMP_DIR

>

>  # ./lt-weight ../../apertium-eng/eng.automorf.bin 
> ../../apertium-eng/texts/eng.tagged efst.bin

> --

> 2.21.0

>

> From b638ab547c9af84d390bcf66365ded9607f775e0 Mon Sep 17 00:00:00 2001


> From: Amr Keleg <amr_moha...@live.com>

> Date: Thu, 6 Jun 2019 22:33:17 +0200

> Subject: [PATCH 11/12] Fix the transducer file name bug

>



earlier patch should be fixed



> ---

>  scripts/lt-weight | 2 +-

>  1 file changed, 1 insertion(+), 1 deletion(-)

>

> diff --git a/scripts/lt-weight b/scripts/lt-weight

> index 0fa104c..dd4ab69 100755

> --- a/scripts/lt-weight

> +++ b/scripts/lt-weight

> @@ -50,7 +50,7 @@ hfst-minimize $DISJUNCTED_FST -o $MINIMIZED_FST

>  hfst-fst2txt -i $MINIMIZED_FST -o $MINIMIZED_ATTFST

>

>  # Compile the FST back using lttoolbox

> -lt-comp lr $DISJUNCTED_ATTFST $OUTPUT_FST

> +lt-comp lr $MINIMIZED_ATTFST $OUTPUT_FST

>

>  # Delete the temporary files

>  rm -rf $TEMP_DIR

> --

> 2.21.0

>

> From b621ab62c784a09a5dd8721e75b7896168679244 Mon Sep 17 00:00:00 2001


> From: Amr Keleg <amr_moha...@live.com>

> Date: Thu, 13 Jun 2019 09:35:04 +0200

> Subject: [PATCH 12/12] Add a large weight to unweighted final states




provide some explanation for this?
- Ok, noted


>

> ---

>  scripts/lt-weight | 6 ++++--

>  1 file changed, 4 insertions(+), 2 deletions(-)

>

> diff --git a/scripts/lt-weight b/scripts/lt-weight

> index dd4ab69..0d08cea 100755

> --- a/scripts/lt-weight

> +++ b/scripts/lt-weight

> @@ -19,6 +19,7 @@ WEIGHTED_REGEXP="$TEMP_DIR/weighted-regex"

>  WEIGHTED_FST="$TEMP_DIR/weighted-pairs.hfst"

>  COMPOSED_FST="$TEMP_DIR/weighted-transducer.hfst"

>  SUBTRACTED_FST="$TEMP_DIR/subtracted-transducer.hfst"

> +DEFAULT_WEIGHTED_FST="$TEMP_DIR/default-weighted-transducer.hfst"


>  DISJUNCTED_FST="$TEMP_DIR/disjuncted-weighted-transducer.hfst"

>  MINIMIZED_FST="$TEMP_DIR/minimized-weighted-transducer.hfst"

>  MINIMIZED_ATTFST="$TEMP_DIR/weighted-transducer.att"

> @@ -45,12 +46,13 @@ cat $WEIGHTED_REGEXP | hfst-regexp2fst -j -o $WEIGHTED_FST

>  # Compose the input FST and the weighted FST

>  hfst-compose -1 $HFST_FST -2 $WEIGHTED_FST -v -o $COMPOSED_FST

>  hfst-subtract $HFST_FST $COMPOSED_FST -o $SUBTRACTED_FST

> -hfst-disjunct $SUBTRACTED_FST $COMPOSED_FST -o $DISJUNCTED_FST

> +hfst-reweight -i $SUBTRACTED_FST -o $DEFAULT_WEIGHTED_FST -e -a 1000000000

> +hfst-disjunct $DEFAULT_WEIGHTED_FST $COMPOSED_FST -o $DISJUNCTED_FST


>  hfst-minimize $DISJUNCTED_FST -o $MINIMIZED_FST

>  hfst-fst2txt -i $MINIMIZED_FST -o $MINIMIZED_ATTFST

>

>  # Compile the FST back using lttoolbox

> -lt-comp lr $MINIMIZED_ATTFST $OUTPUT_FST

> +../lttoolbox/lt-comp lr $MINIMIZED_ATTFST $OUTPUT_FST

>

>  # Delete the temporary files

>  rm -rf $TEMP_DIR

> --

> 2.21.0

>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] [PATCH] supervised weighting of automata

Reply via email to