Hello Francis,
I got a new idea in mind while reading some literature for handling unknown
abbreviations, may be its a little different. But I think, it should work

firstly, we need to create a mapping of common practices like by taking a
corpora of like 3000 tokens like.
@ ==> at, 8==> ate, er ==> ight==>yte

We also make a character n-gram model.
so when now an unknown word comes, it applies all possible patterns to it
and create words or say sequences (may not exist in real) and we take the
ones which have high scores after using character model.

Let's say we use 20, which have highest scores according to character
n-grams.
We can further reduce them by searching these words, if they are in the
dictionary.

Then, by applying them to the sentences, we can further disambiguate based
on a word-based language model.

Any comments or suggestions ?

Regards,
Karan
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to