On 1 Sep 2006, at 08:22, Prof Brian Ripley wrote: > On Wed, 30 Aug 2006, Hans-Joerg Bibiko wrote: > >> If you are using 'only' English then >> >> str <- "dog" >> strsplit(str,NULL)[[1]] >> >> works perfectly and it is fast. > > It does also work 'perfectly' and fast in 'Unicode' in all major > European > and CJK languages (and many others): extending the iconv example >
YES, of course, you are right. R supports Unicode and other encodings very well. This is one of the reasons why I've chosen R for my purposes. If you look at my first example at this Rwiki-site, it contains Russian, German, and two Chinese characters to illustrate that the R function strsplit can handle this perfectly. If I wrote about 'English' and 'Unicode' my only intention was to put it simply. My experience is if I'm writing about 'combining diacritics' or 'combining vowels' etc. some people don't understand these topics. If I'm writing about 'Unicode' some have a vage association what I'm writing about. Of course, in a scientific context this is absolutely wrong and misleading! > http://www.unicode.org/charts/, so your understanding of > 'character' seems > to differ from Unicode's. > Well, the term 'character' is highly ambiguous. So a better term would be glyph to emphasise that I mean a representation of a grapheme. But still, even the terms 'gylph', 'grapheme', 'phoneme', etc. are also ambiguous. Of course, my fault was that I didn't clarify my terminology in beforehand. > You write about 'combined Unicode diacritics (accents)', which is > misleading, as these are not accents (and it is 'combining' not > 'combined', a crucial difference). This was my grammatical fault. Sorry. I corrected this. > To quote Alan Wood > (http://www.alanwood.net/unicode/combining_diacritical_marks.html) > The _characters_ in this range are designed to be used in > combination > with alphanumeric _characters_, to produce a character+diacritic > that > is not present in any of the Unicode ranges. For example, ả > to produce a lower case "a" with a hook above. > Yes! This is right, but ... To illustrate MY problem I use your French example with 'façile'. >> xx > [1] "façile" >> strsplit(xx, NULL) > [[1]] > [1] "f" "a" "ç" "i" "l" "e" >> charToRaw(strsplit(xx, NULL)[[1]][3]) > [1] c3 a7 > > on a UTF-8 system. > There are two possibilities by using Unicode to write 'façile': 1) "f" "a" "ç" "i" "l" "e" 2) "f" "a" "c" "combining cedilla (\u0327)" "i" "l" "e" Now I use the R function strsplit and I will get two different results. > a <- "façile" > strsplit(a,NULL) [[1]] [1] "f" "a" "ç" "i" "l" "e" > b <- "façile" > strsplit(b,NULL) [[1]] [1] "f" "a" "c" "̧" "i" "l" "e" On the computer screen you don't see any difference in 1) and 2) {if your system supports this rendering}. Always, the questions are: 'What do I want to split?' 'What is a character/glyph in my context?' An other nice example I added to the wiki-site http://wiki.r-project.org/rwiki/doku.php?id=tips:data- strings:decomposestring > So they are used for very rare glyphs made up from two Unicode > characters, > and R correctly views them as two characters. R views them correctly if a character is defined as a single code point. On the other hand, in my research I'm using hundreds of languages using these 'rare' glyphs! To summarise: - My intention was only to put it simply and short. - It was NOT my intention to state that the R function strsplit doesn't support Unicode. The R developers did and still doing a great job! Thank you so much! - Last but not least, SORRY for my incompleteness! With regards, Hans ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.