[Apertium-stuff] Reg : GSOC "Improving support for non-standard text input"

karan singla Tue, 11 Mar 2014 13:00:07 -0700

Hello Francis,
I got a new idea in mind while reading some literature for handling unknown
abbreviations, may be its a little different. But I think, it should work


firstly, we need to create a mapping of common practices like by taking a
corpora of like 3000 tokens like.
@ ==> at, 8==> ate, er ==> ight==>yte

We also make a character n-gram model.
so when now an unknown word comes, it applies all possible patterns to it
and create words or say sequences (may not exist in real) and we take the
ones which have high scores after using character model.

Let's say we use 20, which have highest scores according to character
n-grams.
We can further reduce them by searching these words, if they are in the
dictionary.

Then, by applying them to the sentences, we can further disambiguate based
on a word-based language model.

Any comments or suggestions ?

Regards,
Karan

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] Reg : GSOC "Improving support for non-standard text input"

Reply via email to