Hi Amr,

I'm not sure it may help you, but in apertium-oci/texts there are several
texts in Occitan manually disambiguated. Aprox. 14,000 words. They are:
atom_gascon.tagged.txt
continent.tagged.txt
glacier.tagged.txt
cors_aran.tagged.txt
hlama_coming.tagged.txt
uranus_prov.tagged.txt

Best,
Hector

Missatge de Amr Mohamed Hosny Anwar <amr.ke...@eng.asu.edu.eg> del dia dg.,
19 de maig 2019 a les 2:59:

> Dear maintainers, contributors,
>
> Hope this email finds you well.
>
> This mail can be considered as a status report for detailing next week's
> plan in addition to seeking feedback/ suggestions regarding the project.
> After a fruitful discussion with my mentors Nick, Flammie and Francis, we
> have agreed on implementing the supervised way of weighing automata as
> follows:
>
> The command will look like: lt-weight transducer.bin corpus.tagged
>
> transducer.bin: A FST compiled using lttoolbox.
> corpus.tagged: A tagged corpus that will be used to estimate the weights.
>
> The weighting will be done by composing the main "unweighted" FST with a
> set of simple FSTs that are generated for each token.
> A simplified example: If the main FST had an edge a:b::0 and the estimated
> weight for this edge is W, then The main FST will be composed with a simple
> FST of an edge b:b::w generating a new FST with an edge a:b::W.
>
> To achieve this, I will create a new shell script that makes use of hfst's
> compose (Instead of implementing/adding a compose function to the
> lttoolbox). We will approve and use this approach if the prototype has
> proven to be functioning as expected.
>
> The shell script will work as follows:
> 1) lt-print will be used to convert the FST to at&t format.
> 2) The weights will be estimated from the tagged corpus by counting the
> unigram lexical forms (A clever set of shell commands can do the job but I
> am not an expert in shell scripting so it will take me some time - I am
> open to suggestions/ sources/ examples for doing so).
> 3) For each weighted string, hfst-str2fst (or the corresponding regex
> version) will be used to generate simple FSTS.
> 4) The FSTs will be composed using hfst-compose.
> 5) The final FST will be converted to at&t format.
> 6) lt-comp will be be used to regenerate a weighted FST that is compatable
> with all the tools that rely on apertium.
>
> In this version, We will just use unigram counts for the lexical forms to
> estimate the weights.
> Additionally, The weight will be assigned to the final state and won't be
> distributed among the edges (We will most probably want to change this
> later).
>
> On the other hand, I will try to improve the list of publications/ideas
> that will be used to weigh automata in an unsupervised way.
> I would be grateful if you can share with me resources/ ideas regarding
> this part.
>
> Finally, Do you have recommendations for tagged corpora that can be used
> throughout the project for benchmarking?
> I am using this English Tagged corpus from the apertium-eng repository (
> https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged)
> It would be better if we can do benchmarking on corpora and FSTs of
> different sizes and complexity.
>
> Thanks and looking forward to hearing from you.
> Your suggestions, feedback, feature requests are more than welcome.
>
> Best Regards,
> Amr
>
> ------------------------------
> *From:* Amr Mohamed Hosny Anwar
> *Sent:* Sunday, May 19, 2019 12:50:52 AM
> *To:* apertium-stuff
> *Cc:* nlhow...@gmail.com
> *Subject:* GSoC 19: Unsupervised weighting of automata - Implementing the
> supervised method of weighing autoamata
>
>
> Dear maintainers, contributors,
>
>
> Hope this email finds you well.
>
> This mail can be considered as a status report for detailing next week's
> plan in addition to seeking feedback/ suggestions regarding the project.
>
>
>
> Best Regards,
> Amr Keleg
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to