Can we have the Hebrew discussion on another thread? FWIW, I do agree
it would be a good thing to add.
Thanks,
Grant
On Oct 1, 2008, at 4:02 PM, Nadav Har'El wrote:
On Tue, Sep 30, 2008, Robert Muir wrote about "Re: [jira] Commented:
(LUCENE-1406) new Arabic Analyzer (Apache license)":
Thanks for clarification. With this method arabic analyzer could
lemmatize,
not stem, using buckwalter dictionary, and things like broken
plural will
work correctly.
I'm not sure yet if hspell has this type of information, but it
would at
least be a better stem for hebrew as well.
Indeed Hspell also has this information. You can see for example
http://www.cs.technion.ac.il/~danken/cgi-bin/hspell.cgi?text=%E4%F8%EB%E1%FA&ling=on
(but you'll need to be able to read Hebrew to understand what this
means).
But one thing to remember is that if you use Hspell, or basically
any other
dictionary, you are committing yourself to a particular vocabulary
and a
particular spelling of it. If your stemmer comes across a word
outside your
vocabulary, or spelled a bit differently, it won't know what to do
with it.
This problem is particularly visible in Hebrew, because its unvowelled
spelling standard (defined by the Academy of the Hebrew Language) is
not very well known - When I was in school, twenty years ago, it
wasn't
even mentioned, let alone taught! As a result, some words have a few
spelling
variants in the wild, with each dictionary typically considering one
correct
and the others mispellings.
--
Nadav Har'El | Wednesday, Oct 1 2008, 3
Tishri 5769
IBM Haifa Research Lab
|-----------------------------------------
|The two most common elements in
the
http://nadav.harel.org.il |universe are hydrogen and
stupidity.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]