There are people here at Twitter who know this stuff inside and out. I just haven't, yet, roped them in for a fix. Once we have a fix in hand, we'll publish recommendations for everyone. Whatever our streaming servers have to do, your streaming clients have to do, and we might as well pool our efforts.
-John Kalucki http://twitter.com/jkalucki Infrastructure, Twitter Inc. On Wed, Apr 7, 2010 at 12:08 PM, <zn...@comcast.net> wrote: > > ----- "John Kalucki" <j...@twitter.com> wrote: > > > We break the status text into tokens by whitespace and punctuation, > > then apply the tokens to a hashmap of tracked terms. If the language > > doesn't have whitespace, the only thing that will match is the entire > > Tweet. > > > > I know that Search has struggled with this as well. I take it that the > > solutions aren't easy. At some point we'll have to figure something > > similar out for Streaming. I've filed a story to add support for these > > languages in Track. > > > > -John Kalucki > > http://twitter.com/jkalucki > > Infrastructure Twitter Inc. > > Thanks! I was just about to add CJK (Chinese - Japanese - Korean) regular > expressions to my list of research topics! ;-) There must be something in > the open source world we can (to use the tired old cliché) "leverage off > of." ;-) Oniguruma?? Namazu? > > I suppose we need to look at Cyrillic and right-to-left (Arabic and Hebrew) > too? > > -- > M. Edward (Ed) Borasky > http://borasky-research.net/smart-at-znmeb > > "A mathematician is a device for turning coffee into theorems." ~ Paul > Erdős > -- To unsubscribe, reply using "remove me" as the subject.