[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-135945798 @shivaram, @sun-rui Thanks you guys for the work. It would be great to go through more like you said. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on a diff in the pull request: https://github.com/apache/spark/pull/7494#discussion_r38078086 --- Diff: R/pkg/inst/tests/test_sparkSQL.R --- @@ -417,6 +417,32 @@ test_that("collect() and take() on a DataFrame return the same number of rows an expect_equal(ncol(collect(df)), ncol(take(df, 10))) }) +test_that("collect() support Unicode characters", { + markUtf8 <- function(s) { +Encoding(s) <- "UTF-8" +s + } + + lines <- c("{\"name\":\"ìë íì¸ì\"}", + "{\"name\":\"æ¨å¥½\", \"age\":30}", + "{\"name\":\"ããã«ã¡ã¯\", \"age\":19}", + "{\"name\":\"Xin chà o\"}") + --- End diff -- I guess this is related to the R interpreter. I made some to see the behavior of the R interpreter with different locale. You can see R adds 80 before 90 with the UTF-8 locale. This looks the reason why the test case passed with UTF-8 not using `markUtf8`. test.R ``` a<-"ê°" Encoding(a) print(serialize(a, connection=NULL)) ``` with UTF-8 locale the output is ``` $ r -f test.R > a<-"ê°" > Encoding(a) [1] "UTF-8" > print(serialize(a, connection=NULL)) [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 80 [26] 09 00 00 00 03 ea b0 80 ``` with C(ascii) locale the output is ``` $ r -f test.R > a<-"ê°" > Encoding(a) [1] "unknown" > print(serialize(a, connection=NULL)) [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00 [26] 09 00 00 00 03 ea b0 80 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-134893165 Timeout occurred while fetching from the origin. ``` > git config remote.origin.url https://github.com/apache/spark.git # timeout=10 Fetching upstream changes from https://github.com/apache/spark.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/7494/*:refs/remotes/origin/pr/7494/* # timeout=15 ERROR: Timeout after 15 minutes ERROR: Error fetching remote repo 'origin' hudson.plugins.git.GitException: Failed to fetch from https://github.com/apache/spark.git at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:735) at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:983) at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1016) at hudson.scm.SCM.checkout(SCM.java:485) at hudson.model.AbstractProject.checkout(AbstractProject.java:1282) at ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-134832178 @sun-rui I call `writeBin` with `useBytes=TRUE`, which is default to FALSE, refering to the below. https://stat.ethz.ch/R-manual/R-devel/library/base/html/writeLines.html ``` useBytes is for expert use. Normally (when false) character strings with marked encodings are converted to the current encoding before being passed to the connection (which might do further re-encoding). useBytes = TRUE suppresses the re-encoding of marked strings so they are passed byte-by-byte to the connection: this can be useful when strings have already been re-encoded by e.g. iconv. (It is invoked automatically for strings with marked encoding "bytes".) ``` It seems to work. And calling `markUtf8` is not necessary in the testcase when I set the locale to UTF-8. @shivaram Do you agree on using `rawToChar`? It appears to Okay to use here. ``` writeString <- function(con, value) { utfVal <- enc2utf8(value) writeInt(con, as.integer(nchar(utfVal, type = "bytes") + 1)) writeBin(utfVal, con, endian = "big", useBytes=TRUE) } ``` ``` readString <- function(con) { stringLen <- readInt(con) raw <- readBin(con, raw(), stringLen, endian = "big") string <- rawToChar(raw) Encoding(string) <- "UTF-8" string } ``` ``` test_that("collect() support Unicode characters", { markUtf8 <- function(s) { Encoding(s) <- "UTF-8" s } lines <- c("{\"name\":\"ìë íì¸ì\"}", "{\"name\":\"æ¨å¥½\", \"age\":30}", "{\"name\":\"ããã«ã¡ã¯\", \"age\":19}", "{\"name\":\"Xin chà o\"}") jsonPath <- tempfile(pattern="sparkr-test", fileext=".tmp") writeLines(lines, jsonPath) df <- read.df(sqlContext, jsonPath, "json") rdf <- collect(df) expect_true(is.data.frame(rdf)) expect_equal(rdf$name[1], markUtf8("ìë íì¸ì")) expect_equal(rdf$name[2], markUtf8("æ¨å¥½")) expect_equal(rdf$name[3], markUtf8("ããã«ã¡ã¯")) expect_equal(rdf$name[4], markUtf8("Xin chà o")) df1 <- createDataFrame(sqlContext, rdf) expect_equal(collect(where(df1, df1$name == markUtf8("æ¨å¥½")))$name, markUtf8("æ¨å¥½")) }) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-134795102 Sorry for being late. I got this error message in `SerDe.scala`. The byte sequence sent from `worker.R` is not null-terminated. I tried to append '\0' at the end of the byte sequence, it passed for this string but the bytes of the next call caused the same failure. ``` java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:100) at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:110) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:49) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:37) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) [1] "returnStatus" integer(0) Error in readTypedObject(con, type) : Unsupported type for deserialization Calls: test_package ... callJStatic -> invokeJava -> readObject -> readTypedObject ``` ``` SerDe.scala def readStringBytes(in: DataInputStream, len: Int): String = { val bytes = new Array[Byte](len) in.readFully(bytes) println("bytes is " + bytes.map("%02x".format(_)).mkString(" ")) assert(bytes(len - 1) == 0) val str = new String(bytes.dropRight(1), "UTF-8") str } ``` The output. ``` bytes is 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 72 6b 2e 61 70 69 2e 72 2e 52 52 44 44 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-129846643 For example, declare a variable with a string "ê°" ``` > a<-"ê°" ``` With `LC_ALL=C` and `Encoding(a)<-"bytes" ``` > Sys.setlocale("LC_ALL", "C") [1] "C/C/C/C/C/ko_KR.UTF-8" > Encoding(a) <- "bytes" > serialize(a, connection=NULL) [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 20 [26] 09 00 00 00 03 ea b0 80 > a [1] "\\xea\\xb0\\x80" ``` with `LC_ALL=C` and `Encoding(a)<-"UTF8"` ``` > Encoding(a) <- "UTF-8" > serialize(a, connection=NULL) [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 80 [26] 09 00 00 00 03 ea b0 80 > a [1] "" ``` with `LC_ALL=UTF-8` and `Encoding(a)<-"byte"` ``` > Sys.setlocale("LC_ALL", "ko_KR.UTF-8") [1] "ko_KR.UTF-8/ko_KR.UTF-8/ko_KR.UTF-8/C/ko_KR.UTF-8/ko_KR.UTF-8" > Encoding(a) <- "bytes" > serialize(a, connection=NULL) [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 20 [26] 09 00 00 00 03 ea b0 80 > a [1] "\\xea\\xb0\\x80" ``` with `LC_ALL=UTF-8` and `Encoding(a)<-"UTF-8"` ``` > Encoding(a) <- "UTF-8" > serialize(a, connection=NULL) [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 80 [26] 09 00 00 00 03 ea b0 80 > a [1] "ê°" ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-129171042 Okay, using `iconv` works nicely. But, It still needs to modify `context.R` to get rid of the leading UTF-8 indicating bit in a string before sending to the Spark though. However, Do we need to consider using `Sys.setlocale()` to set the locale setting back to that of before the testcase because of the side effect? or Is there any other ways to handle it? deserialize.R ``` readString <- function(con) { stringLen <- readInt(con) raw <- readBin(con, raw(), stringLen, endian = "big") iconv(list(raw), to="UTF-8") } ``` context.R ``` 124 sliceLen <- ceiling(length(coll) / numSlices) 125 slices <- split(coll, rep(1:(numSlices + 1), each = sliceLen)[1:length(coll)]) 126 127 # Remove the leading UTF-8 indicating bit 128 removeUtf8EncodingBit <- function(s) { 129 Encoding(s) <- "bytes" 130 s 131 } 132 slices_ <- rapply(slices, function(x) ifelse(is.character(x), removeUtf8EncodingBit(x), x), how="list") 133 134 # Serialize each slice: obtain a list of raws, or a list of lists (slices) of 135 # 2-tuples of raws 136 serializedSlices <- lapply(slices_, serialize, connection = NULL) ``` ``` test_that("collect() support Unicode characters", { locale <- Sys.getlocale() Sys.setlocale("LC_ALL", "en_US.UTF-8") lines <- c("{\"name\":\"ìë íì¸ì\"}", "{\"name\":\"æ¨å¥½\", \"age\":30}", "{\"name\":\"ããã«ã¡ã¯\", \"age\":19}", "{\"name\":\"Xin chà o\"}") jsonPath <- tempfile(pattern="sparkr-test", fileext=".tmp") writeLines(lines, jsonPath) df <- read.df(sqlContext, jsonPath, "json") rdf <- collect(df) expect_true(is.data.frame(rdf)) expect_equal(rdf$name[1], "ìë íì¸ì") expect_equal(rdf$name[2], "æ¨å¥½") expect_equal(rdf$name[3], "ããã«ã¡ã¯") expect_equal(rdf$name[4], "Xin chà o") df1 <- createDataFrame(sqlContext, rdf) expect_equal(collect(where(df1, df1$name == "æ¨å¥½"))$name, "æ¨å¥½") Sys.setlocale("LC_ALL", locale) }) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-128335460 The testcase passed with `Sys.setlocale("LC_ALL", "en_US.UTF-8")`. But, I had to modify context.R to clear out the UTF-8 indicating bit. It seems to turn out that the UTF-8 indicating bit in R with `Encoding(x) <- "UTF-8" causes an error in passing a string to Spark. I hope the code blow support and preserve UTF-8 encoding in R under `LC_ALL=C`. context.R ``` 127 # Serialize each slice: obtain a list of raws, or a list of lists (slices) of 128 # 2-tuples of raws 129 removeUtf8EncodingBit <- function(s) { 130 Encoding(s) <- "bytes" 131 s 132 } 133 slices_ <- rapply(slices, function(x) ifelse(is.character(x), removeUtf8EncodingBit(x), x), how="list") 134 serializedSlices <- lapply(slices_, serialize, connection = NULL) 135 136 jrdd <- callJStatic("org.apache.spark.api.r.RRDD", 137 "createRDDFromArray", sc, serializedSlices) 138 139 RDD(jrdd, "byte") 140 } ``` deserializer.R ``` readString <- function(con) { stringLen <- readInt(con) raw <- readBin(con, raw(), stringLen, endian = "big") string <- rawToChar(raw) Encoding(string) <- "UTF-8" string } ``` testcase ``` test_that("collect() support Unicode characters", { convertToUtf8 <- function(s) { Encoding(s) <- "UTF-8" s } Sys.setlocale("LC_ALL", "en_US.UTF-8") lines <- c("{\"name\":\"ìë íì¸ì\"}", "{\"name\":\"æ¨å¥½\", \"age\":30}", "{\"name\":\"ããã«ã¡ã¯\", \"age\":19}", "{\"name\":\"Xin chà o\"}") jsonPath <- tempfile(pattern="sparkr-test", fileext=".tmp") writeLines(lines, jsonPath) df <- read.df(sqlContext, jsonPath, "json") rdf <- collect(df) expect_true(is.data.frame(rdf)) expect_equal(rdf$name[1], convertToUtf8("ìë íì¸ì")) expect_equal(rdf$name[2], convertToUtf8("æ¨å¥½")) expect_equal(rdf$name[3], convertToUtf8("ããã«ã¡ã¯")) expect_equal(rdf$name[4], convertToUtf8("Xin chà o")) df1 <- createDataFrame(sqlContext, rdf) expect_equal(collect(where(df1, df1$name == "æ¨å¥½"))$name, convertToUtf8("æ¨å¥½")) }) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-127481181 Firstly, I am not able to enter multibyte character sequences in the R shell under `LC_ALL=C` either. I seems that R does not support it. What I am trying to do is enabling UTF-8 characters to be used in R, which are sent from java, no matter what locale a user has. Basically, the testcase passed in my local UTF-8 machine, too. But, it failed in Jenkins unfortunately. (I guess its locale is not UTF-8). So I started to find out the way in which R supports and also perseves UTF-8 characters in various locales, e.g. `LC_ALL=C` like sun-rui suggested. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-127034019 @shivaram The testcase passed under the UTF-8 locale(e.g. LC_ALL=ko_KR.UTF). But, unfortunately, iconv() returns NA under "LC_ALL=C" which makes the testcase fail. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on a diff in the pull request: https://github.com/apache/spark/pull/7494#discussion_r35721159 --- Diff: R/pkg/R/deserialize.R --- @@ -56,8 +56,10 @@ readTypedObject <- function(con, type) { readString <- function(con) { stringLen <- readInt(con) - string <- readBin(con, raw(), stringLen, endian = "big") - rawToChar(string) + raw <- readBin(con, raw(), stringLen, endian = "big") + string <- rawToChar(raw) + Encoding(string) <- "UTF-8" + enc2native(string) --- End diff -- Yes, Perserving UTF-8 encodings sounds much better. Do you mean that enc2native should be removed like the below? ``` readString <- function(con) { stringLen <- readInt(con) raw <- readBin(con, raw(), stringLen, endian = "big") string <- rawToChar(raw) Encoding(string) <- "UTF-8" string } ``` But, this makes an error in calling `createDataFrame()`, which converts a local R dataframe to spark RDD Dataframe. I tried to find out the reason and I guess MSB in a string is set when I use `Encoding(string)<-"UTF-8"`, which is not otherwise. I printed out `serializedSlices` in `paralleize()`, line 132. The result is like the below. context.R ``` 102 parallelize <- function(sc, coll, numSlices = 1) { 103 # TODO: bound/safeguard numSlices 104 # TODO: unit tests for if the split works for all primitives 105 # TODO: support matrix, data frame, etc 106 if ((!is.list(coll) && !is.vector(coll)) || is.data.frame(coll)) { 107 if (is.data.frame(coll)) { 108 message(paste("context.R: A data frame is parallelized by columns.")) 109 } else { 110 if (is.matrix(coll)) { 111 message(paste("context.R: A matrix is parallelized by elements.")) 112 } else { 113 message(paste("context.R: parallelize() currently only supports lists and vectors.", 114 "Calling as.list() to coerce coll into a list.")) 115 } 116 } 117 coll <- as.list(coll) 118 } 119 120 print(coll) 121 122 if (numSlices > length(coll)) 123 numSlices <- length(coll) 124 125 sliceLen <- ceiling(length(coll) / numSlices) 126 slices <- split(coll, rep(1:(numSlices + 1), each = sliceLen)[1:length(coll)]) 127 128 # Serialize each slice: obtain a list of raws, or a list of lists (slices) of 129 # 2-tuples of raws 130 serializedSlices <- lapply(slices, serialize, connection = NULL) 131 132 print(serializedSlices) 133 jrdd <- callJStatic("org.apache.spark.api.r.RRDD", 134 "createRDDFromArray", sc, serializedSlices) 135 136 RDD(jrdd, "byte") 137 } ``` Case 1. with Encoding(string) <- "UTF-8" [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 13 00 00 00 04 00 00 00 [26] 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 00 00 07 a2 00 00 00 10 [51] 00 00 00 01 00 00 80 09 00 00 00 0f ec 95 88 eb 85 95 ed 95 98 ec 84 b8 ec [76] 9a 94 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 3e 00 00 00 00 00 [101] 00 00 00 00 10 00 00 00 01 00 00 80 09 00 00 00 06 e6 82 a8 e5 a5 bd 00 00 [126] 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 33 00 00 00 00 00 00 00 00 00 [151] 10 00 00 00 01 00 00 80 09 00 00 00 0f e3 81 93 e3 82 93 e3 81 ab e3 81 a1 [176] e3 81 af 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 00 00 [201] 07 a2 00 00 00 10 00 00 00 01 00 00 80 09 00 00 00 09 58 69 6e 20 63 68 c3 [226] a0 6f Case 2. without Encoding(string) <- "UTF-8" [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 13 00 00 00 04 00 00 00 [26] 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 00 00 07 a2 00 00 00 10 [51] 00 00 00 01 00 00 00 09 00 00 00 0f ec 95 88 eb 85 95 ed 95 98 ec 84 b8 ec [76] 9a 94 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 3e 00 00 00 00 00 [101] 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 e6 82 a8 e5 a5 bd 00 00 [126] 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 33 00 00 00 00 00 00 00 00 00 [151] 10 00 00 00 01 00 00 00 09 00 00 00 0f e3 81 93 e3 82 93 e3 81 ab e3 81 a1 [176] e3 81 af 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 00 00 [201] 07 a2 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 09 58 69 6e 20 63 68 c3 [226] a0 6f You can see [51], [101], [151], [201] are different. There is a leading 80 before 09 with `Encoding()` which, I guess, makes an error. I think this is the the encoding indication bit in R according to the link you
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-124344827 good job, jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-124172706 Adding a zero doesn't solve the problem either. It seems to have the same effect as the one without it. But, the below works in which the locale is "C", not UTF-8. I set test strings' encoding to UTF-8 and conver them to the native form with enc2native(). The testcase passed. Also, It is not necessary to use the convertToNative() under UTF-8 locale. The testcase passed without it. ``` readString <- function(con) { stringLen <- readInt(con) raw <- readBin(con, raw(), stringLen, endian = "big") string <- rawToChar(raw) Encoding(string) <- "UTF-8" enc2native(string) } ``` ``` convertToNative <- function(s) { Encoding(s) <- "UTF-8" enc2native(s) } test_that("collect() support Unicode characters", { lines <- c("{\"name\":\"ìë íì¸ì\"}", "{\"name\":\"æ¨å¥½\", \"age\":30}", "{\"name\":\"ããã«ã¡ã¯\", \"age\":19}", "{\"name\":\"Xin chà o\"}") jsonPath <- tempfile(pattern="sparkr-test", fileext=".tmp") writeLines(lines, jsonPath) df <- read.df(sqlContext, jsonPath, "json") rdf <- collect(df) print(head(rdf)) print(convertToNative("ìë íì¸ì")) expect_true(is.data.frame(rdf)) expect_equal(rdf$name[1], convertToNative("ìë íì¸ì")) expect_equal(rdf$name[2], convertToNative("æ¨å¥½")) expect_equal(rdf$name[3], convertToNative("ããã«ã¡ã¯")) expect_equal(rdf$name[4], convertToNative("Xin chà o")) df1 <- createDataFrame(sqlContext, rdf) print(head(df1)) expect_equal(collect(where(df1, df1$name == convertToNative("æ¨å¥½")))$name, convertToNative("æ¨å¥½")) }) ``` ``` $) export LC_ALL=C $) ./run-tests.sh ``` ``` SparkSQL functions : age name 1 NA 2 30 3 19 4 NA Xin cho [1] "" . age name 1 NA 2 30 3 19 4 NA Xin cho ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122939278 Unfortunately, not. I guess the string in the testcase should be changed to be native form. ``` readString <- function(con) { stringLen <- readInt(con) raw <- readBin(con, raw(), stringLen, endian = "big") string <- rawToChar(raw) Encoding(string) <- "UTF-8" enc2native(string) } ``` ``` 1. Failure (at test_sparkSQL.R#432): collect() support Unicode characters -- rdf$name[1] not equal to "\354\225\210\353\205\225\355\225\230\354\204\270\354\232\224" 1 string mismatches: x[1]: "\354\225\210\353\205\225\355\225\230\354\204\270\354\232\224" y[1]: "" 2. Failure (at test_sparkSQL.R#433): collect() support Unicode characters -- rdf$name[2] not equal to "\346\202\250\345\245\275" 1 string mismatches: x[1]: "\346\202\250\345\245\275" y[1]: "" 3. Failure (at test_sparkSQL.R#434): collect() support Unicode characters -- rdf$name[3] not equal to "\343\201\223\343\202\223\343\201\253\343\201\241\343\201\257" 1 string mismatches: x[1]: "\343\201\223\343\202\223\343\201\253\343\201\241\343\201\257" y[1]: "" 4. Failure (at test_sparkSQL.R#435): collect() support Unicode characters -- rdf$name[4] not equal to "Xin ch\303\240o" 1 string mismatches: x[1]: "Xin ch\303\240o" y[1]: "Xin cho" 5. Error: collect() support Unicode characters - Unsupported type for deserialization 1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"), warning = function(c) invokeRestart("muffleWarning")) 2: eval(code, new_test_environment) 3: eval(expr, envir, enclos) 4: expect_equal(collect(where(df2, df2$name == "\346\202\250\345\245\275"))$name, "\346\202\250\345\245\275") at test_sparkSQL.R:438 5: expect_that(object, equals(expected, label = expected.label, ...), info = info, label = label) 6: condition(object) 7: compare(expected, actual, ...) 8: compare.character(expected, actual, ...) 9: identical(x, y) 10: collect(where(df2, df2$name == "\346\202\250\345\245\275")) 11: collect(where(df2, df2$name == "\346\202\250\345\245\275")) 12: .local(x, ...) 13: lapply(listCols, function(col) { objRaw <- rawConnection(col) numRows <- readInt(objRaw) col <- readCol(objRaw, numRows) close(objRaw) col }) 14: lapply(listCols, function(col) { objRaw <- rawConnection(col) numRows <- readInt(objRaw) col <- readCol(objRaw, numRows) close(objRaw) col }) 15: FUN(X[[i]], ...) 16: readCol(objRaw, numRows) 17: do.call(c, lapply(1:numRows, function(x) { value <- readObject(inputCon) if (is.null(value)) NA else value })) 18: lapply(1:numRows, function(x) { value <- readObject(inputCon) if (is.null(value)) NA else value }) 19: lapply(1:numRows, function(x) { value <- readObject(inputCon) if (is.null(value)) NA else value }) 20: FUN(X[[i]], ...) 21: readObject(inputCon) 22: readTypedObject(con, type) 23: stop(paste("Unsupported type for deserialization", type)) 24: .handleSimpleError(function (e) { e$calls <- head(sys.calls()[-seq_len(frame + 7)], -2) signalCondition(e) }, "Unsupported type for deserialization ", quote(readTypedObject(con, type))) Error: Test failures ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122866982 @sun-rui I reproduced the same error as the testcase's with ``` $) export LC_ALL=C $) ./run-tests.sh ``` It needs to call rawToChar(), R gives an error message below otherwise. ``` functions on binary files : Error in `Encoding<-`(`*tmp*`, value = "UTF-8") : a character vector argument expected Calls: test_package ... readTypedObject -> getJobj -> jobj -> readString -> Encoding<- ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122643950 I am not sure about `readString`, but the teatcase, which verifies the intactness of unicode characters in a native dataframe making a round trip to Spark's DataFrame, failed. There is something underneath. ``` 1. Failure(@test_sparkSQL.R#438): collect() support Unicode characters - collect(where(df2, df2$name == "\346\202\250\345\245\275"))[[2]] not equal to "\346\202\250\345\245\275" 1 string mismatches: x[1]: "\346\202\250\345\245\275" y[1]: "<82>" ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...
GitHub user CHOIJAEHONG1 opened a pull request: https://github.com/apache/spark/pull/7494 [SPARK-8951][SparkR] support Unicode characters in collect() Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK. I changed SerDe.scala in order that Spark support Unicode characters when writes a string to R. You can merge this pull request into a Git repository by running: $ git pull https://github.com/CHOIJAEHONG1/spark SPARK-8951 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7494.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7494 commit 5325cef60dfd00e0f7fa3156678677f1b3a2ef1a Author: CHOIJAEHONG Date: 2015-07-18T11:57:26Z [SPARK-8951] support Unicode characters in collect() --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org