Hi Andres, On Fri, 8 Jul 2005, Andres Hohendahl wrote: > I am working in natural language processing (personal) project, to play > around with syntactic, semantic and morphologic processing.
You have a big project I would say :) > I want also to parse several part-of-speech segments for NL in order to > get a correct grammar testing using EBNF and C# under .NET framework. > > There are lots of mutual excluding parts when defining the different > tokens as words, and I don't think it's much use for many different "word" tokens. Just slice the text to sentences, then words and punctuation marks. Then all the fun starts. > the dictionary is not able nor practical to be loaded > as EBNF, True, and you even can not use grammatica with too rich grammar since generated parser might expand out of 64K which is the maximum class size with java (oh' I wouldn't count C# would still work fine =) > also the natural grammar is heavily context or inter-token > dependant, Do you have a good word database? Like if there is a input "A cat has a hat." it would match for Subject Predicate Object pattern. (sorry of my knowledge with spoken languages and right terms) A,a = noun(alphabet), or adverb cat = noun hat = noun has = verb The token stream for such thing would be: word(noun, adverb), word(noun), word(verb), word(noun, adverb) and word(noun) Well, it sounds like you would need a context sensitive tokenizer where different possibilities are tried to match for token stream. > To allow this (I guess) I must make the tokenizer somewhat context-dependent > and tokenize several alternate ways using a recursive pattern scanning, > allowing it to explore the combinations or word-functions that best fits a > production. > > I think this can be done adding a structure-layer on top of the Token / > Tokenizer classes, producing a callback or event to allow external classes > and methods to operate and get the context data for this token, and finally > there must be a trial-error or scoring to select the most appropriate token > which fulfills the production(s). > > I have already successfully coded several classes class which checks the > functions of a word as a set of types, using affix-reduction, dictionary > seek and intelligent de-stemming. > > Any suggestion or clue? I couldn't follow the idea in the last three paragraphs but I wish you a best luck with your project. Maybe I could understand if you provide some examples. -Matti _______________________________________________ Grammatica-users mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/grammatica-users
