[Rd] R 2.9.2 crashes when sorting latin1-encoded strings

Stefan Evert Wed, 30 Sep 2009 02:12:29 -0700

Hi everyone!

I think I stumbled over a bug in the latest R 2.9.2 patched for OS X:

R version 2.9.2 Patched (2009-09-24 r49861)
i386-apple-darwin9.8.0

When I try to sort latin1-encoded character vectors, R sometimescrashes with a segmentation fault. I'm running OS X 10.5.8 and haveobserved this behaviour both with the i386 and x86_64 builds, in theR.app GUI as well as on the command line.


Here's a minimal example that reliably triggers the crash on my machine:

=====
print(sessionInfo())

words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
str(words)

print(table(Encoding(words)))
Encoding(words) <- "latin1"  # this is the correct encoding!
print(table(Encoding(words)))

N <- 1000
words <- rep(words, length.out=N)

print(N)
for (i in 1:N) {
x <- words[1:i]
# the following line will crash for some i, depending on the particular
# strings in <words> and the subset selected for <x> above
order(x)
}
=====

The output I get from this code is appended at the end of the mail.Note that R incorrectly declares the latin1 strings in <word> to haveUTF-8 encoding (this seems wrong to me because the \x escapes insertraw bytes into the string). The crash only occurs if the correct"latin1" encoding (or "unknown") is explicitly specified. Otherwisethe string handling code appears to ignore everything after the firstinvalid multibyte character.

I haven't been able to trigger the bug without some kind of loop. Thecrash always occurs at the same iteration, but this changes dependingon the contents of <words> and the specific subset selected in eachloop iteration. Also note that the 64-bit version of R gives adifferent error message. If I omit the unrelated statement"print(N)", the 64-bit version segfaults and the 32-bit version justhangs with high CPU load. All this suggests to me that there must besome insidious memory corruption or stack/range overflow in theinternal ordering code.

Can other people reproduce this problem on different platforms andpossibly with different versions of R?

BTW, I ran into the crash when trying to read.delim() a file in latin1encoding, using either encoding="latin1" or fileEncoding="latin1", andthen converting it back and forth between a character vector and afactor. I still don't understand what's going on there. Thebehaviour of read.delim() seems to depend very much on my localesettings when running R, which is rather unpleasant. Is there a wayto find out how strings are stored internally (i.e. getting the exactbyte representation) and whether R believes them to be in UTF-8 orlatin1 encoding?



Best regards,
Stefan Evert

[ stefan.ev...@uos.de | http://purl.org/stefan.evert ]





Output of sample code on my machine:

> print(sessionInfo())
R version 2.9.2 Patched (2009-09-24 r49861)
i386-apple-darwin9.8.0

locale:
en_GB/en_GB/C/C/en_GB/en_GB

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
>

> words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc","\xe4\xfc")

> str(words)
chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
> print(table(Encoding(words)))

unknown   UTF-8
    2       5
>
> Encoding(words) <- "latin1"  # this is the correct encoding!
> print(table(Encoding(words)))

latin1 unknown
    5       2
>
> N <- 1000
> words <- rep(words, length.out=N)
>
> print(N)
[1] 1000
> for (i in 1:N) {
+   x <- words[1:i]

+ # the following line will crash for some i, depending on theparticular

+   # strings in <words> and the subset selected for <x> above
+   order(x)
+ }

*** caught bus error ***
address 0x86, cause 'non-existent physical address'

Traceback:
1: order(x)
aborting ...
Bus error


64-bit version:

> print(sessionInfo())
R version 2.9.2 Patched (2009-09-24 r49861)
x86_64-apple-darwin9.8.0

locale:
en_GB/en_GB/C/C/en_GB/en_GB

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
>

> words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc","\xe4\xfc")

> str(words)
chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
> print(table(Encoding(words)))

unknown   UTF-8
    2       5
>
> Encoding(words) <- "latin1"  # this is the correct encoding!
> print(table(Encoding(words)))

latin1 unknown
    5       2
>
> N <- 1000
> words <- rep(words, length.out=N)
>
> print(N)
[1] 1000
> for (i in 1:N) {
+   x <- words[1:i]

+ # the following line will crash for some i, depending on theparticular

+   # strings in <words> and the subset selected for <x> above
+   order(x)
+ }
Error in order(x) : 'translateCharUTF8' must be called on a CHARSXP
Execution halted


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] R 2.9.2 crashes when sorting latin1-encoded strings

Reply via email to