[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-28 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request:

https://github.com/apache/spark/pull/7494#issuecomment-135945798
  
@shivaram, @sun-rui 
Thanks you guys for the work. It would be great to go through more like you 
said.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-27 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on a diff in the pull request:

https://github.com/apache/spark/pull/7494#discussion_r38078086
  
--- Diff: R/pkg/inst/tests/test_sparkSQL.R ---
@@ -417,6 +417,32 @@ test_that("collect() and take() on a DataFrame return 
the same number of rows an
   expect_equal(ncol(collect(df)), ncol(take(df, 10)))
 })
 
+test_that("collect() support Unicode characters", {
+  markUtf8 <- function(s) {
+Encoding(s) <- "UTF-8"
+s
+  }
+
+  lines <- c("{\"name\":\"안녕하세요\"}",
+ "{\"name\":\"您好\", \"age\":30}",
+ "{\"name\":\"こんにちは\", \"age\":19}",
+ "{\"name\":\"Xin chào\"}")
+
--- End diff --

I guess this is related to the R interpreter. I made some to see the 
behavior of the R interpreter with different locale.
You can see R adds 80 before 90 with the UTF-8 locale. This looks the 
reason why the test case passed with UTF-8 not using `markUtf8`.

test.R
```
a<-"가"
Encoding(a)
print(serialize(a, connection=NULL))
```

with UTF-8 locale the output is 
```
$ r -f test.R
> a<-"가"
> Encoding(a)
[1] "UTF-8"
> print(serialize(a, connection=NULL))
 [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 
00 80
[26] 09 00 00 00 03 ea b0 80
```

with C(ascii) locale the output is
```
$ r -f test.R
> a<-"가"
> Encoding(a)
[1] "unknown"
> print(serialize(a, connection=NULL))
 [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 
00 00
[26] 09 00 00 00 03 ea b0 80
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request:

https://github.com/apache/spark/pull/7494#issuecomment-134893165
  
Timeout occurred while fetching from the origin.

```
> git config remote.origin.url https://github.com/apache/spark.git # 
timeout=10
Fetching upstream changes from https://github.com/apache/spark.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/spark.git 
+refs/pull/7494/*:refs/remotes/origin/pr/7494/* # timeout=15
ERROR: Timeout after 15 minutes
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/spark.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:735)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:983)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1016)
at hudson.scm.SCM.checkout(SCM.java:485)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1282)
at 
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-25 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request:

https://github.com/apache/spark/pull/7494#issuecomment-134832178
  
@sun-rui 

I call `writeBin` with `useBytes=TRUE`, which is default to FALSE, refering 
to the below.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/writeLines.html
```
useBytes is for expert use. Normally (when false) character strings with 
marked encodings are converted to the current encoding before being passed to 
the connection (which might do further re-encoding). useBytes = TRUE suppresses 
the re-encoding of marked strings so they are passed byte-by-byte to the 
connection: this can be useful when strings have already been re-encoded by 
e.g. iconv. (It is invoked automatically for strings with marked encoding 
"bytes".)
```

It seems to work. And calling `markUtf8` is not necessary in the testcase 
when I set the locale to UTF-8.

@shivaram 
Do you agree on using `rawToChar`? It appears to Okay to use here.

```
writeString <- function(con, value) {
  utfVal <- enc2utf8(value)
  writeInt(con, as.integer(nchar(utfVal, type = "bytes") + 1))
  writeBin(utfVal, con, endian = "big", useBytes=TRUE)
}
```

```
readString <- function(con) {
  stringLen <- readInt(con)
  raw <- readBin(con, raw(), stringLen, endian = "big")
  string <- rawToChar(raw)
  Encoding(string) <- "UTF-8"
  string
}
```

```
test_that("collect() support Unicode characters", {
  markUtf8 <- function(s) {
Encoding(s) <- "UTF-8"
s
  }

  lines <- c("{\"name\":\"안녕하세요\"}",
 "{\"name\":\"您好\", \"age\":30}",
 "{\"name\":\"こんにちは\", \"age\":19}",
 "{\"name\":\"Xin chào\"}")

  jsonPath <- tempfile(pattern="sparkr-test", fileext=".tmp")
  writeLines(lines, jsonPath)

  df <- read.df(sqlContext, jsonPath, "json")
  rdf <- collect(df)
  expect_true(is.data.frame(rdf))
  expect_equal(rdf$name[1], markUtf8("안녕하세요"))
  expect_equal(rdf$name[2], markUtf8("您好"))
  expect_equal(rdf$name[3], markUtf8("こんにちは"))
  expect_equal(rdf$name[4], markUtf8("Xin chào"))

  df1 <- createDataFrame(sqlContext, rdf)
  expect_equal(collect(where(df1, df1$name == markUtf8("您好")))$name, 
markUtf8("您好"))
})
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-25 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request:

https://github.com/apache/spark/pull/7494#issuecomment-134795102
  

Sorry for being late. 

I got this error message in `SerDe.scala`. The byte sequence sent from 
`worker.R` is not null-terminated.
I tried to append '\0' at the end of the byte sequence, it passed for this 
string but the bytes of the next call caused the same failure. 

```
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:100)
at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:110)
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:49)
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:37)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
[1] "returnStatus"
integer(0)
Error in readTypedObject(con, type) :
  Unsupported type for deserialization
Calls: test_package ... callJStatic -> invokeJava -> readObject -> 
readTypedObject
```

```
SerDe.scala
  def readStringBytes(in: DataInputStream, len: Int): String = {
val bytes = new Array[Byte](len)
in.readFully(bytes)
println("bytes is " + bytes.map("%02x".format(_)).mkString(" "))
assert(bytes(len - 1) == 0)
val str = new String(bytes.dropRight(1), "UTF-8")
str
  }
```

The output.
```
bytes is 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 72 6b 2e 61 70 69 2e 72 
2e 52 52 44 44
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-11 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request:

https://github.com/apache/spark/pull/7494#issuecomment-129846643
  

For example, declare a variable with a string "가"
```
> a<-"가"
```
With `LC_ALL=C` and `Encoding(a)<-"bytes"
```
> Sys.setlocale("LC_ALL", "C")
[1] "C/C/C/C/C/ko_KR.UTF-8"
> Encoding(a) <- "bytes"
> serialize(a, connection=NULL)
 [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 
00 20
[26] 09 00 00 00 03 ea b0 80
> a
[1] "\\xea\\xb0\\x80"
```
with `LC_ALL=C` and `Encoding(a)<-"UTF8"`
```
> Encoding(a) <- "UTF-8"
> serialize(a, connection=NULL)
 [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 
00 80
[26] 09 00 00 00 03 ea b0 80
> a
[1] ""
```

with `LC_ALL=UTF-8` and `Encoding(a)<-"byte"`
```
> Sys.setlocale("LC_ALL", "ko_KR.UTF-8")
[1] "ko_KR.UTF-8/ko_KR.UTF-8/ko_KR.UTF-8/C/ko_KR.UTF-8/ko_KR.UTF-8"
> Encoding(a) <- "bytes"
> serialize(a, connection=NULL)
 [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 
00 20
[26] 09 00 00 00 03 ea b0 80
> a
[1] "\\xea\\xb0\\x80"
```

with `LC_ALL=UTF-8` and `Encoding(a)<-"UTF-8"`
```
> Encoding(a) <- "UTF-8"
> serialize(a, connection=NULL)
 [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 
00 80
[26] 09 00 00 00 03 ea b0 80
> a
[1] "가"
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-09 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request:

https://github.com/apache/spark/pull/7494#issuecomment-129171042
  
Okay, using `iconv` works nicely. But, It still needs to modify `context.R` 
to get rid of the leading UTF-8 indicating bit in a string before sending to 
the Spark though.
However, Do we need to consider using  `Sys.setlocale()` to set the locale 
setting back to that of before the testcase because of the side effect? or Is 
there any other ways to handle it?

deserialize.R
```
readString <- function(con) {
  stringLen <- readInt(con)
  raw <- readBin(con, raw(), stringLen, endian = "big")
  iconv(list(raw), to="UTF-8")
}
```

context.R
```
124   sliceLen <- ceiling(length(coll) / numSlices)
125   slices <- split(coll, rep(1:(numSlices + 1), each = 
sliceLen)[1:length(coll)])
126
127   # Remove the leading UTF-8 indicating bit
128   removeUtf8EncodingBit <- function(s) {
129 Encoding(s) <- "bytes"
130 s
131   }
132   slices_ <- rapply(slices, function(x) ifelse(is.character(x), 
removeUtf8EncodingBit(x), x), how="list")
133
134   # Serialize each slice: obtain a list of raws, or a list of lists 
(slices) of
135   # 2-tuples of raws
136   serializedSlices <- lapply(slices_, serialize, connection = NULL)
```

```
test_that("collect() support Unicode characters", {
  locale <- Sys.getlocale()
  Sys.setlocale("LC_ALL", "en_US.UTF-8")

  lines <- c("{\"name\":\"안녕하세요\"}",
 "{\"name\":\"您好\", \"age\":30}",
 "{\"name\":\"こんにちは\", \"age\":19}",
 "{\"name\":\"Xin chào\"}")
  jsonPath <- tempfile(pattern="sparkr-test", fileext=".tmp")
  writeLines(lines, jsonPath)

  df <- read.df(sqlContext, jsonPath, "json")
  rdf <- collect(df)
  expect_true(is.data.frame(rdf))
  expect_equal(rdf$name[1], "안녕하세요")
  expect_equal(rdf$name[2], "您好")
  expect_equal(rdf$name[3], "こんにちは")
  expect_equal(rdf$name[4], "Xin chào")

  df1 <- createDataFrame(sqlContext, rdf)
  expect_equal(collect(where(df1, df1$name == "您好"))$name, "您好")
  Sys.setlocale("LC_ALL", locale)
})
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-06 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request:

https://github.com/apache/spark/pull/7494#issuecomment-128335460
  
The testcase passed with `Sys.setlocale("LC_ALL", "en_US.UTF-8")`. But, I 
had to modify context.R to clear out the UTF-8 indicating bit. It seems to turn 
out that the UTF-8 indicating bit in R with `Encoding(x) <- "UTF-8" causes an 
error in passing a string to Spark.
I hope the code blow support and preserve UTF-8 encoding in R under 
`LC_ALL=C`. 

context.R
```
127   # Serialize each slice: obtain a list of raws, or a list of lists 
(slices) of
128   # 2-tuples of raws
129   removeUtf8EncodingBit <- function(s) {
130 Encoding(s) <- "bytes"
131 s
132   }
133   slices_ <- rapply(slices, function(x) ifelse(is.character(x), 
removeUtf8EncodingBit(x), x), how="list")
134   serializedSlices <- lapply(slices_, serialize, connection = NULL)
135
136   jrdd <- callJStatic("org.apache.spark.api.r.RRDD",
137   "createRDDFromArray", sc, serializedSlices)
138
139   RDD(jrdd, "byte")
140 }
```

deserializer.R
```
readString <- function(con) {
  stringLen <- readInt(con)
  raw <- readBin(con, raw(), stringLen, endian = "big")
  string <- rawToChar(raw)
  Encoding(string) <- "UTF-8"
  string
}
```

testcase
```
test_that("collect() support Unicode characters", {
  convertToUtf8 <- function(s) {
Encoding(s) <- "UTF-8"
s
  }
  Sys.setlocale("LC_ALL", "en_US.UTF-8")

  lines <- c("{\"name\":\"안녕하세요\"}",
 "{\"name\":\"您好\", \"age\":30}",
 "{\"name\":\"こんにちは\", \"age\":19}",
 "{\"name\":\"Xin chào\"}")
  jsonPath <- tempfile(pattern="sparkr-test", fileext=".tmp")
  writeLines(lines, jsonPath)

  df <- read.df(sqlContext, jsonPath, "json")
  rdf <- collect(df)
  expect_true(is.data.frame(rdf))
  expect_equal(rdf$name[1], convertToUtf8("안녕하세요"))
  expect_equal(rdf$name[2], convertToUtf8("您好"))
  expect_equal(rdf$name[3], convertToUtf8("こんにちは"))
  expect_equal(rdf$name[4], convertToUtf8("Xin chào"))

  df1 <- createDataFrame(sqlContext, rdf)
  expect_equal(collect(where(df1, df1$name == "您好"))$name, 
convertToUtf8("您好"))
})
```





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-03 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request:

https://github.com/apache/spark/pull/7494#issuecomment-127481181
  
Firstly, I am not able to enter multibyte character sequences in the R 
shell under `LC_ALL=C` either. I seems that R does not support it.  What I am 
trying to do is enabling UTF-8 characters to be used in R, which are sent from 
java, no matter what locale a user has. 
Basically, the testcase passed in my local UTF-8 machine, too. But, it 
failed in Jenkins unfortunately. (I guess its locale is not UTF-8). So I 
started to find out the way in which R supports and also perseves UTF-8 
characters in various locales, e.g. `LC_ALL=C` like sun-rui suggested.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-02 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request:

https://github.com/apache/spark/pull/7494#issuecomment-127034019
  
@shivaram 
The testcase passed under the UTF-8 locale(e.g. LC_ALL=ko_KR.UTF). But, 
unfortunately, iconv() returns NA under "LC_ALL=C" which makes the testcase 
fail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-28 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on a diff in the pull request:

https://github.com/apache/spark/pull/7494#discussion_r35721159
  
--- Diff: R/pkg/R/deserialize.R ---
@@ -56,8 +56,10 @@ readTypedObject <- function(con, type) {
 
 readString <- function(con) {
   stringLen <- readInt(con)
-  string <- readBin(con, raw(), stringLen, endian = "big")
-  rawToChar(string)
+  raw <- readBin(con, raw(), stringLen, endian = "big")
+  string <- rawToChar(raw)
+  Encoding(string) <- "UTF-8"
+  enc2native(string)
--- End diff --

Yes, Perserving UTF-8 encodings sounds much better.
Do you mean that enc2native should be removed like the below?
```
readString <- function(con) {
  stringLen <- readInt(con)
  raw <- readBin(con, raw(), stringLen, endian = "big")
  string <- rawToChar(raw)
  Encoding(string) <- "UTF-8"
  string
}
```
But, this makes an error in calling `createDataFrame()`, which converts a 
local R dataframe to spark RDD Dataframe.
I tried to find out the reason and I guess MSB in a string is set when I 
use `Encoding(string)<-"UTF-8"`, which is not otherwise.

I printed out `serializedSlices` in `paralleize()`, line 132. The result is 
like the below.
context.R
```
102 parallelize <- function(sc, coll, numSlices = 1) {
103   # TODO: bound/safeguard numSlices
104   # TODO: unit tests for if the split works for all primitives
105   # TODO: support matrix, data frame, etc
106   if ((!is.list(coll) && !is.vector(coll)) || is.data.frame(coll)) {
107 if (is.data.frame(coll)) {
108   message(paste("context.R: A data frame is parallelized by 
columns."))
109 } else {
110   if (is.matrix(coll)) {
111 message(paste("context.R: A matrix is parallelized by 
elements."))
112   } else {
113 message(paste("context.R: parallelize() currently only supports 
lists and vectors.",
114   "Calling as.list() to coerce coll into a list."))
115   }
116 }
117 coll <- as.list(coll)
118   }
119
120   print(coll)
121
122   if (numSlices > length(coll))
123 numSlices <- length(coll)
124
125   sliceLen <- ceiling(length(coll) / numSlices)
126   slices <- split(coll, rep(1:(numSlices + 1), each = 
sliceLen)[1:length(coll)])
127
128   # Serialize each slice: obtain a list of raws, or a list of lists 
(slices) of
129   # 2-tuples of raws
130   serializedSlices <- lapply(slices, serialize, connection = NULL)
131
132   print(serializedSlices)
133   jrdd <- callJStatic("org.apache.spark.api.r.RRDD",
134   "createRDDFromArray", sc, serializedSlices)
135
136   RDD(jrdd, "byte")
137 }
```
Case 1. with Encoding(string) <- "UTF-8"
  [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 13 00 00 00 04 00 
00 00
 [26] 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 00 00 07 a2 00 00 
00 10
 [51] 00 00 00 01 00 00 80 09 00 00 00 0f ec 95 88 eb 85 95 ed 95 98 ec 84 
b8 ec
 [76] 9a 94 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 3e 00 00 00 
00 00
[101] 00 00 00 00 10 00 00 00 01 00 00 80 09 00 00 00 06 e6 82 a8 e5 a5 bd 
00 00
[126] 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 33 00 00 00 00 00 00 00 
00 00
[151] 10 00 00 00 01 00 00 80 09 00 00 00 0f e3 81 93 e3 82 93 e3 81 ab e3 
81 a1
[176] e3 81 af 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 
00 00
[201] 07 a2 00 00 00 10 00 00 00 01 00 00 80 09 00 00 00 09 58 69 6e 20 63 
68 c3
[226] a0 6f

Case 2. without Encoding(string) <- "UTF-8"
 [1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 13 00 00 00 04 00 
00 00
 [26] 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 00 00 07 a2 00 00 
00 10
 [51] 00 00 00 01 00 00 00 09 00 00 00 0f ec 95 88 eb 85 95 ed 95 98 ec 84 
b8 ec
 [76] 9a 94 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 3e 00 00 00 
00 00
[101] 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 e6 82 a8 e5 a5 bd 
00 00
[126] 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 33 00 00 00 00 00 00 00 
00 00
[151] 10 00 00 00 01 00 00 00 09 00 00 00 0f e3 81 93 e3 82 93 e3 81 ab e3 
81 a1
[176] e3 81 af 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 
00 00
[201] 07 a2 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 09 58 69 6e 20 63 
68 c3
[226] a0 6f

You can see [51], [101], [151], [201] are different. There is a leading 80 
before 09 with `Encoding()` which, I guess, makes an error.
I think this is the the encoding indication bit in R according to the link 
you 

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-23 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request:

https://github.com/apache/spark/pull/7494#issuecomment-124344827
  
good job, jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-23 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request:

https://github.com/apache/spark/pull/7494#issuecomment-124172706
  
Adding a zero doesn't solve the problem either. It seems to have the same 
effect as the one without it.
But, the below works in which the locale is "C", not UTF-8.
I set test strings' encoding to UTF-8 and conver them to the native form 
with enc2native(). The testcase passed.
Also, It is not necessary to use the convertToNative() under UTF-8 locale. 
The testcase passed without it.

```
readString <- function(con) {
  stringLen <- readInt(con)
  raw <- readBin(con, raw(), stringLen, endian = "big")
  string <- rawToChar(raw)
  Encoding(string) <- "UTF-8"
  enc2native(string)
}
```

```
convertToNative <- function(s) {
  Encoding(s) <- "UTF-8"
  enc2native(s)
}

test_that("collect() support Unicode characters", {
  lines <- c("{\"name\":\"안녕하세요\"}",
 "{\"name\":\"您好\", \"age\":30}",
 "{\"name\":\"こんにちは\", \"age\":19}",
 "{\"name\":\"Xin chào\"}")
  jsonPath <- tempfile(pattern="sparkr-test", fileext=".tmp")
  writeLines(lines, jsonPath)

  df <- read.df(sqlContext, jsonPath, "json")
  rdf <- collect(df)
  print(head(rdf))
  print(convertToNative("안녕하세요"))
  expect_true(is.data.frame(rdf))
  expect_equal(rdf$name[1], convertToNative("안녕하세요"))
  expect_equal(rdf$name[2], convertToNative("您好"))
  expect_equal(rdf$name[3], convertToNative("こんにちは"))
  expect_equal(rdf$name[4], convertToNative("Xin chào"))

  df1 <- createDataFrame(sqlContext, rdf)
  print(head(df1))
  expect_equal(collect(where(df1, df1$name == 
convertToNative("您好")))$name, convertToNative("您好"))
})
```

```
$) export LC_ALL=C
$) ./run-tests.sh
```
```
SparkSQL functions :   age name
1  NA 
2  30 
3  19 
4  NA  Xin cho
[1] ""
.  age name
1  NA 
2  30 
3  19 
4  NA  Xin cho
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-20 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request:

https://github.com/apache/spark/pull/7494#issuecomment-122939278
  
Unfortunately, not.
I guess the string in the testcase should be changed to be native form.

```
readString <- function(con) {
  stringLen <- readInt(con)
  raw <- readBin(con, raw(), stringLen, endian = "big")
  string <- rawToChar(raw)
  Encoding(string) <- "UTF-8"
  enc2native(string)
}
```

```
1. Failure (at test_sparkSQL.R#432): collect() support Unicode characters 
--
rdf$name[1] not equal to 
"\354\225\210\353\205\225\355\225\230\354\204\270\354\232\224"
1 string mismatches:
x[1]: "\354\225\210\353\205\225\355\225\230\354\204\270\354\232\224"
y[1]: ""

2. Failure (at test_sparkSQL.R#433): collect() support Unicode characters 
--
rdf$name[2] not equal to "\346\202\250\345\245\275"
1 string mismatches:
x[1]: "\346\202\250\345\245\275"
y[1]: ""

3. Failure (at test_sparkSQL.R#434): collect() support Unicode characters 
--
rdf$name[3] not equal to 
"\343\201\223\343\202\223\343\201\253\343\201\241\343\201\257"
1 string mismatches:
x[1]: "\343\201\223\343\202\223\343\201\253\343\201\241\343\201\257"
y[1]: ""

4. Failure (at test_sparkSQL.R#435): collect() support Unicode characters 
--
rdf$name[4] not equal to "Xin ch\303\240o"
1 string mismatches:
x[1]: "Xin ch\303\240o"
y[1]: "Xin cho"

5. Error: collect() support Unicode characters 
-
Unsupported type for deserialization
1: withCallingHandlers(eval(code, new_test_environment), error = 
capture_calls, message = function(c) invokeRestart("muffleMessage"),
   warning = function(c) invokeRestart("muffleWarning"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: expect_equal(collect(where(df2, df2$name == 
"\346\202\250\345\245\275"))$name, "\346\202\250\345\245\275") at 
test_sparkSQL.R:438
5: expect_that(object, equals(expected, label = expected.label, ...), info 
= info, label = label)
6: condition(object)
7: compare(expected, actual, ...)
8: compare.character(expected, actual, ...)
9: identical(x, y)
10: collect(where(df2, df2$name == "\346\202\250\345\245\275"))
11: collect(where(df2, df2$name == "\346\202\250\345\245\275"))
12: .local(x, ...)
13: lapply(listCols, function(col) {
   objRaw <- rawConnection(col)
   numRows <- readInt(objRaw)
   col <- readCol(objRaw, numRows)
   close(objRaw)
   col
   })
14: lapply(listCols, function(col) {
   objRaw <- rawConnection(col)
   numRows <- readInt(objRaw)
   col <- readCol(objRaw, numRows)
   close(objRaw)
   col
   })
15: FUN(X[[i]], ...)
16: readCol(objRaw, numRows)
17: do.call(c, lapply(1:numRows, function(x) {
   value <- readObject(inputCon)
   if (is.null(value))
   NA
   else value
   }))
18: lapply(1:numRows, function(x) {
   value <- readObject(inputCon)
   if (is.null(value))
   NA
   else value
   })
19: lapply(1:numRows, function(x) {
   value <- readObject(inputCon)
   if (is.null(value))
   NA
   else value
   })
20: FUN(X[[i]], ...)
21: readObject(inputCon)
22: readTypedObject(con, type)
23: stop(paste("Unsupported type for deserialization", type))
24: .handleSimpleError(function (e)
   {
   e$calls <- head(sys.calls()[-seq_len(frame + 7)], -2)
   signalCondition(e)
   }, "Unsupported type for deserialization ", quote(readTypedObject(con, 
type)))
Error: Test failures
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-20 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request:

https://github.com/apache/spark/pull/7494#issuecomment-122866982
  
@sun-rui 
I reproduced the same error as the testcase's with
```
$) export LC_ALL=C
$) ./run-tests.sh
```
It needs to call rawToChar(), R gives an error message below otherwise.
```
functions on binary files : Error in `Encoding<-`(`*tmp*`, value = "UTF-8") 
:
  a character vector argument expected
Calls: test_package ... readTypedObject -> getJobj -> jobj -> readString -> 
Encoding<-
```






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-19 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request:

https://github.com/apache/spark/pull/7494#issuecomment-122643950
  
I am not sure about `readString`, but the teatcase, which verifies the 
intactness of unicode characters in a native dataframe making a round trip to 
Spark's DataFrame, failed. There is something underneath.

```
1. Failure(@test_sparkSQL.R#438): collect() support Unicode characters 
-
collect(where(df2, df2$name == "\346\202\250\345\245\275"))[[2]] not equal 
to "\346\202\250\345\245\275"
1 string mismatches:
x[1]: "\346\202\250\345\245\275"
y[1]: "<82>"
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-18 Thread CHOIJAEHONG1
GitHub user CHOIJAEHONG1 opened a pull request:

https://github.com/apache/spark/pull/7494

[SPARK-8951][SparkR] support Unicode characters in collect()

Spark gives an error message and does not show the output when a field of 
the result DataFrame contains characters in CJK.
I changed SerDe.scala in order that Spark support Unicode characters when 
writes a string to R. 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/CHOIJAEHONG1/spark SPARK-8951

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7494.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7494


commit 5325cef60dfd00e0f7fa3156678677f1b3a2ef1a
Author: CHOIJAEHONG 
Date:   2015-07-18T11:57:26Z

[SPARK-8951] support Unicode characters in collect()




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org