Paul McGuire wrote: > On Aug 26, 8:05 pm, "Ryan Ginstrom" <[EMAIL PROTECTED]> wrote: >> The only caveat being that since Chinese and Japanese scripts don't >> typically delimit "words" with spaces, I think you'd have to pass the text >> through a tokenizer (like ChaSen for Japanese) before using PyParsing. > > Did you think pyparsing is so mundane as to require spaces between > tokens? Pyparsing has been doing this type of token-recognition since > Day 1. Looking for tokens without delimiting spaces was one of the > first applications for pyparsing. This issue is not unique to Chinese > or Japanese text. Pyparsing will easily find the tokens in this > string: > > y=a*x**2+b*x+c > > as > > ['y','=','a','*','x','**','2','+','b','*','x','+','c']
The difference is that in the expression above (and in many other tokenization problems) you can determine "word" boundaries by looking at the class of character, e.g. alphanumeric vs. punctuation vs. whatever. In Japanese and Chinese tokenization, word boundaries are not marked by different classes of characters. They only exist in the mind of the reader who knows which sequences of characters could be words given the context, and which sequences of characters couldn't. The closest analog would be to ask pyparsing to find the words in the following sentence: ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode. Most approaches that have been even marginally successful on these kinds of tasks have used statistical machine learning approaches. STeVe -- http://mail.python.org/mailman/listinfo/python-list