Re: [Rd] R 2.9.2 crashes when sorting latin1-encoded strings

Simon Urbanek Wed, 30 Sep 2009 07:55:41 -0700

Stefan,

On Sep 30, 2009, at 5:11 , Stefan Evert wrote:

Hi everyone!

I think I stumbled over a bug in the latest R 2.9.2 patched for OS X:
R version 2.9.2 Patched (2009-09-24 r49861)
i386-apple-darwin9.8.0
When I try to sort latin1-encoded character vectors, R sometimescrashes with a segmentation fault. I'm running OS X 10.5.8 and haveobserved this behaviour both with the i386 and x86_64 builds, in theR.app GUI as well as on the command line.
Here's a minimal example that reliably triggers the crash on mymachine:
=====
print(sessionInfo())

words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
str(words)

print(table(Encoding(words)))
Encoding(words) <- "latin1"  # this is the correct encoding!
print(table(Encoding(words)))

N <- 1000
words <- rep(words, length.out=N)

print(N)
for (i in 1:N) {
x <- words[1:i]
# the following line will crash for some i, depending on theparticular
# strings in <words> and the subset selected for <x> above
order(x)
}
=====
The output I get from this code is appended at the end of the mail.Note that R incorrectly declares the latin1 strings in <word> tohave UTF-8 encoding (this seems wrong to me because the \x escapesinsert raw bytes into the string).

It is correct, because you're in a UTF-8 locale (see l10n_info()) soall strings are UTF-8 by default - you're just manually creating astring that is not valid in UTF-8.

The crash only occurs if the correct "latin1" encoding (or"unknown") is explicitly specified. Otherwise the string handlingcode appears to ignore everything after the first invalid multibytecharacter.
I haven't been able to trigger the bug without some kind of loop.The crash always occurs at the same iteration, but this changesdepending on the contents of <words> and the specific subsetselected in each loop iteration. Also note that the 64-bit versionof R gives a different error message. If I omit the unrelatedstatement "print(N)", the 64-bit version segfaults and the 32-bitversion just hangs with high CPU load. All this suggests to me thatthere must be some insidious memory corruption or stack/rangeoverflow in the internal ordering code.


Yup:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: 13 at address: 0x0000000000000000

0x0000000100167e0d in R_gc_internal (size_needed=1) at ../../../../R-2.9-branch/src/main/memory.c:1327

1327        PROCESS_NODES();
(gdb) bt

#0 0x0000000100167e0d in R_gc_internal (size_needed=1) at ../../../../R-2.9-branch/src/main/memory.c:1327#1 0x000000010016a2bf in Rf_allocVector (type=607, length=0)at ../../../../R-2.9-branch/src/main/memory.c:1991#2 0x000000010016aa65 in R_alloc (nelem=<value temporarilyunavailable, due to optimizations>, eltsize=<value temporarilyunavailable, due to optimizations>) at ../../../../R-2.9-branch/src/main/memory.c:1669#3 0x000000010020f316 in Rf_translateCharUTF8 (x=<value temporarilyunavailable, due to optimizations>) at ../../../../R-2.9-branch/src/main/sysutils.c:858#4 0x0000000100216140 in Rf_Scollate (a=0x1023c1518, b=0x0)at ../../../../R-2.9-branch/src/main/util.c:1691#5 0x00000001001f894e in orderVector1 (indx=<value temporarilyunavailable, due to optimizations>, n=<value temporarily unavailable,due to optimizations>, key=0x11b024c00, nalast=TRUE, decreasing=FALSE,rho=0x1020a4778) at ../../../../R-2.9-branch/src/main/sort.c:846#6 0x00000001001f9605 in orderVector [inlined] () at ../../../../R-2.9-branch/src/main/sort.c:888#7 do_order (call=<value temporarily unavailable, due tooptimizations>, op=<value temporarily unavailable, due tooptimizations>, args=0x11843fc38, rho=<value temporarily unavailable,due to optimizations>) at ../../../../R-2.9-branch/src/main/sort.c:891

Note that b=0x0 in the call to Rf_Scollate -- seems like some arrayoverflow in the sorting code... will need some more investigation ...

In the meantime I can offer you a work-around -- working with non-native strings (latin1 in your case) is very expensive because theyget converted all the time into the native locale, so you want to run

words<-iconv(words,"latin1","")

and then proceed - it's faster and doesn't crash ;).

Can other people reproduce this problem on different platforms andpossibly with different versions of R?
BTW, I ran into the crash when trying to read.delim() a file inlatin1 encoding, using either encoding="latin1" orfileEncoding="latin1", and then converting it back and forth betweena character vector and a factor. I still don't understand what'sgoing on there. The behaviour of read.delim() seems to depend verymuch on my locale settings when running R, which is rather unpleasant.

?? The whole point of a locale is that it declares how you are goingto interact with the system. Handling of strings is entirely differentdepending on the encoding used by the locale - and that is the pointof locales. When you are dealing with text (e.g. as files) you mustalways take the encoding into account and by default they are assumedto be in the same encoding as your locale - you really wouldn't want Rto suddenly read all files as let's say eucJP even though your localeis UTF-8 ...

Is there a way to find out how strings are stored internally (i.e.getting the exact byte representation) and whether R believes themto be in UTF-8 or latin1 encoding?

charToRaw() will show you the raw bytes and you define usingEncoding() how you want the string to be interpreted (supported isUTF-8, latin1 and unknown). If the encoding is known, R will convertit where needed. Normally R uses the native encoding of the localeyou're running in. If you are dealing with files from other locales,you have to tell R accordingly - in most cases it's better to re-encode the strings (?iconv) than to work with the foreign encoding.


Cheers,
Simon



Output of sample code on my machine:

> print(sessionInfo())
R version 2.9.2 Patched (2009-09-24 r49861)
i386-apple-darwin9.8.0

locale:
en_GB/en_GB/C/C/en_GB/en_GB

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
>

> words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc","\xe4\xfc")

> str(words)
chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
> print(table(Encoding(words)))

unknown   UTF-8
   2       5
>
> Encoding(words) <- "latin1"  # this is the correct encoding!
> print(table(Encoding(words)))

latin1 unknown
   5       2
>
> N <- 1000
> words <- rep(words, length.out=N)
>
> print(N)
[1] 1000
> for (i in 1:N) {
+   x <- words[1:i]

+ # the following line will crash for some i, depending on theparticular

+   # strings in <words> and the subset selected for <x> above
+   order(x)
+ }

*** caught bus error ***
address 0x86, cause 'non-existent physical address'

Traceback:
1: order(x)
aborting ...
Bus error


64-bit version:

> print(sessionInfo())
R version 2.9.2 Patched (2009-09-24 r49861)
x86_64-apple-darwin9.8.0

locale:
en_GB/en_GB/C/C/en_GB/en_GB

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
>

> words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc","\xe4\xfc")

> str(words)
chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
> print(table(Encoding(words)))

unknown   UTF-8
   2       5
>
> Encoding(words) <- "latin1"  # this is the correct encoding!
> print(table(Encoding(words)))

latin1 unknown
   5       2
>
> N <- 1000
> words <- rep(words, length.out=N)
>
> print(N)
[1] 1000
> for (i in 1:N) {
+   x <- words[1:i]

+ # the following line will crash for some i, depending on theparticular

+   # strings in <words> and the subset selected for <x> above
+   order(x)
+ }
Error in order(x) : 'translateCharUTF8' must be called on a CHARSXP
Execution halted


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] R 2.9.2 crashes when sorting latin1-encoded strings

Reply via email to