Hi Amr, I'm not sure it may help you, but in apertium-oci/texts there are several texts in Occitan manually disambiguated. Aprox. 14,000 words. They are: atom_gascon.tagged.txt continent.tagged.txt glacier.tagged.txt cors_aran.tagged.txt hlama_coming.tagged.txt uranus_prov.tagged.txt
Best, Hector Missatge de Amr Mohamed Hosny Anwar <amr.ke...@eng.asu.edu.eg> del dia dg., 19 de maig 2019 a les 2:59: > Dear maintainers, contributors, > > Hope this email finds you well. > > This mail can be considered as a status report for detailing next week's > plan in addition to seeking feedback/ suggestions regarding the project. > After a fruitful discussion with my mentors Nick, Flammie and Francis, we > have agreed on implementing the supervised way of weighing automata as > follows: > > The command will look like: lt-weight transducer.bin corpus.tagged > > transducer.bin: A FST compiled using lttoolbox. > corpus.tagged: A tagged corpus that will be used to estimate the weights. > > The weighting will be done by composing the main "unweighted" FST with a > set of simple FSTs that are generated for each token. > A simplified example: If the main FST had an edge a:b::0 and the estimated > weight for this edge is W, then The main FST will be composed with a simple > FST of an edge b:b::w generating a new FST with an edge a:b::W. > > To achieve this, I will create a new shell script that makes use of hfst's > compose (Instead of implementing/adding a compose function to the > lttoolbox). We will approve and use this approach if the prototype has > proven to be functioning as expected. > > The shell script will work as follows: > 1) lt-print will be used to convert the FST to at&t format. > 2) The weights will be estimated from the tagged corpus by counting the > unigram lexical forms (A clever set of shell commands can do the job but I > am not an expert in shell scripting so it will take me some time - I am > open to suggestions/ sources/ examples for doing so). > 3) For each weighted string, hfst-str2fst (or the corresponding regex > version) will be used to generate simple FSTS. > 4) The FSTs will be composed using hfst-compose. > 5) The final FST will be converted to at&t format. > 6) lt-comp will be be used to regenerate a weighted FST that is compatable > with all the tools that rely on apertium. > > In this version, We will just use unigram counts for the lexical forms to > estimate the weights. > Additionally, The weight will be assigned to the final state and won't be > distributed among the edges (We will most probably want to change this > later). > > On the other hand, I will try to improve the list of publications/ideas > that will be used to weigh automata in an unsupervised way. > I would be grateful if you can share with me resources/ ideas regarding > this part. > > Finally, Do you have recommendations for tagged corpora that can be used > throughout the project for benchmarking? > I am using this English Tagged corpus from the apertium-eng repository ( > https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged) > It would be better if we can do benchmarking on corpora and FSTs of > different sizes and complexity. > > Thanks and looking forward to hearing from you. > Your suggestions, feedback, feature requests are more than welcome. > > Best Regards, > Amr > > ------------------------------ > *From:* Amr Mohamed Hosny Anwar > *Sent:* Sunday, May 19, 2019 12:50:52 AM > *To:* apertium-stuff > *Cc:* nlhow...@gmail.com > *Subject:* GSoC 19: Unsupervised weighting of automata - Implementing the > supervised method of weighing autoamata > > > Dear maintainers, contributors, > > > Hope this email finds you well. > > This mail can be considered as a status report for detailing next week's > plan in addition to seeking feedback/ suggestions regarding the project. > > > > Best Regards, > Amr Keleg > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff