Re: [Apertium-stuff] New tool for DixTools

Per Tunedal Wed, 20 Feb 2013 23:15:43 -0800

Hi,
is this work in progress? Can it be tested?

I'm in favor of all means to faciliate contributions from laymen.
Choosing a proper paradigm is a difficult task and thus all help is
wellcome. Your proposal is very interesting as it let's the computer do
a great deal of the work.


Just a thought: as a user, you cannot know what words are already
present in the dictionnary. Let the user have a guess, i.e. suggest a
word, and if it's not yet in the monodix, let him/her have an other
guess until there's a hit (or he/she has got enough of guessing).

Yours,
Per Tunedal

On Tue, Feb 19, 2013, at 22:03, Xavi Ivars wrote:

  Hi,



I'm working in a new tool (or should I say "option"?) for Apertium
DixTools.



I had this idea while adding some new words to the monolingual
dictionaries, and I think that most of the non-expert contributors to
language pairs will realize the same: when they want to add a word,
they think of another word that will have the same paradigm, look for
it, copy, replace the lemma and the stem with the ones from the new
word, and save the new dix file.



While expert users may now that, for example, in Catalan the paradigm
for feminine nouns that make the plural adding a "s" is "abella__n",
when I want to add half a dozen of words after days (or weeks) without
touching the dictionaries, I have to look how the paradigm was called.
And, probably, to add 6 words I may need to look for 3-4 different
paradigms.



There's a paper [1] by Miquel Esplà-Gomis, Víctor M. Sánchez-Cartagena
and Juan Antonio Pérez Ortiz where, with the help of a non-expert user,
and using a corpus, a tool tries to guess the paradigm of an unknown
word to add it to the dictionary. Francis Tyers also proposes [2] a
system to learn those paradigms in an unsupervised environment. But
what I want to achieve is a much easier task: avoid having to manually
look for a paradigm name when adding a new word if you know another
word that follows it. So you could think of a naive/dumb version of
Miquel et al.'s tool.



To do that, I'm planning two "versions" of the tool: a supervised and a
batch mode. Right now, I have implemented the supervised mode, and it
works as follow (examples from the Catalan en-ca dictionary):



* The user adds a pair of words, being the first one an unknown word
and the second one a word already in the dictionary. I'm using ,
sorrounded with whitespaces as a divider for clarity reasons

** assignatura | barrera

* The tool proposes one candidate (or more, in case the word is
ambiguous, and it's a name and an adjective)

** <e lm="barrera"><i>barrer</i><par n="abell/a__n"/></e>

* When the user accepts it, the tool "guesses" the most probable stem
by comparing the new word with the "existing" one and its stem, and
"expands" the new word according to it, asking for confirmation

** assignatura     :assignatura<n><f><s>

** assignatures    :assignatura<n><f><pl>

* If it's incorrect, the tool starts to show all possible stems,
starting by the full lemma and substracting a char each time, until the
user accepts one of them

* When accepted, the tool generates the entry and add it to the
dictionary

** <e lm="assignatura"><i>assignatur</i><par n="abell/a__n"/></e>



The batch mode may have some restrictions, i.e. the "existing" word
can't be ambigous (have more than one PoS), but will allow to add a
list of words in a text file (CSV, for example).



Now my questions are:

* Do you find this is a useful tool?

* Does it already exists something similar (and I'm unaware of)?

* Do you have any suggestion for the name of this "tool"?



As I said, this tool may not be used by people that works on a daily
basis with the dictionaries, as I'm pretty sure they'll go faster by
manually editing the dictionaries; also for totally newbies it won't be
useful either, as (right now) the tool assumes some knowledge about the
dictionary formats. But I guess that with a tool like this, we could
have a bigger base of contributors to some language pairs.



[1] [1]http://www.aclweb.org/anthology/R/R11/R11-1047.pdf

[2] [2]http://wiki.apertium.org/wiki/Improved_corpus-based_paradigm_mat
ching



Regards,

--
< Xavi Ivars >
< [3]http://xavi.ivars.me >

-----------------------------------------------------------------------
-------

Everyone hates slow websites. So do we.

Make your web apps faster with AppDynamics

Download AppDynamics Lite for free today:

[4]http://p.sf.net/sfu/appdyn_d2d_feb

_______________________________________________

Apertium-stuff mailing list

[5][email protected]

[6]https://lists.sourceforge.net/lists/listinfo/apertium-stuff

References

1. http://www.aclweb.org/anthology/R/R11/R11-1047.pdf
2. http://wiki.apertium.org/wiki/Improved_corpus-based_paradigm_matching
3. http://xavi.ivars.me/
4. http://p.sf.net/sfu/appdyn_d2d_feb
5. mailto:[email protected]
6. https://lists.sourceforge.net/lists/listinfo/apertium-stuff

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] New tool for DixTools

Reply via email to