Ted Zlatanov <[EMAIL PROTECTED]> writes: > On Wed, 05 Nov 2008 14:14:31 -0600 "Robert D. Crawford" <[EMAIL PROTECTED]> > wrote: > > RDC> Ted Zlatanov <[EMAIL PROTECTED]> writes: >>> (string-match "[^\\000-\\1ff]" "hello") ;; OK >>> >>> This will match character values over 0x1FF, which is the limit of >>> extended ASCII. Does that work for you? > > RDC> Will this match the unicode double ">" and the like? Some people > RDC> feel the need to use these in their breadcrumbs and such. If > RDC> there is no way to just filter out the foreign characters, I will > RDC> use it. > > You can just try it! > > (string-match "[^\\000-\\1ff]" "ยป") ;; returns 0, meaning it's a match > (string-match "[^\\000-\\1ff]" ">>") ;; returns nil, meaning it's not a match > > RDC> The other possibility is to lower permanently on each character that is > RDC> read to me, but this seems tedious and time consuming on my part and > RDC> likely slow for gnus to score. > > Nah, the above should work. You will need a single backslash instead of > two, though (the doubling is needed to tell Emacs Lisp that's a real > backslash inside the string when it reads it in).
"<<" and ">>" have codes U+00AB and U+00BB so that's why they match but there are plenty of other characters which may show up in an English text, like (I'll use a (sequence of) ASCII characters which resembles the proper unicode character) "`" (U+2018), "'" (U+2019), "``" (U+201C) , "''" (U+201D) or "..." (U+2026) which will cause the entry to be filtered out. Besides, I think what you really meant was: (string-match "[^\\0-\\177]" "string") since "1ff" is not a valid octal number. I think that taking the title of the entry and checking if at least 90% are ASCII characters would be sufficient to filter out Asian texts. You can also try taking first 100 (or so) characters of the body. I think you could use replace-regexp-in-string for that purpose: (defun mn-non-english-p (string) (> (* (length (replace-regexp-in-string "[^\\0-\\77]" "" string)) 10) (* (length string) 9))) -- Best regards, _ _ .o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o ..o | Computer Science, Michal "mina86" Nazarewicz (o o) ooo +--<mina86*tlen.pl>--<jid:mina86*jabber.org>--ooO--(_)--Ooo-- _______________________________________________ info-gnus-english mailing list info-gnus-english@gnu.org http://lists.gnu.org/mailman/listinfo/info-gnus-english