On Jun 13, 2005, at 6:18 AM, Markus Wiederkehr wrote:

I see, the list of exceptions makes this a lot more complicated than I
thought... Thanks a lot, Erik!

There is a section about the problems that hyphens create in "Foundations of Statistical Natural Language Processing". Not only are the cases numerous, but seemingly simple rules such as joining hyphenated forms at the ends of lines does not always work. Sometimes the hyphen was added to break the word, sometimes you are already dealing with a hyphenated form that just happened to occur at the end of a line, so the hyphen serves two purposes. I've toyed with the idea of indexing hyphenated words in their raw as well as split forms, but I think that would wreak havoc on the word position stuff, as well as bloat the index with potentially meaningless gibberish.

Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to