[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-09-03 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7494 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enab

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-09-03 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-137567784 @CHOIJAEHONG1 Sorry for the delay. I finally got a chance to try this out on my machine and it seems to work fine with UTF-8 strings with both the locale set to `C` and

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-28 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-135945798 @shivaram, @sun-rui Thanks you guys for the work. It would be great to go through more like you said. --- If your project is set up for it, you can reply to t

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-28 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-135905717 Thanks @CHOIJAEHONG1 and @sun-rui -- I just want to test this / go through this a bit carefully once more as its a pretty fundamental change in how we handle strings. W

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-27 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on a diff in the pull request: https://github.com/apache/spark/pull/7494#discussion_r38078086 --- Diff: R/pkg/inst/tests/test_sparkSQL.R --- @@ -417,6 +417,32 @@ test_that("collect() and take() on a DataFrame return the same number of rows an

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread sun-rui
Github user sun-rui commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-135265160 @CHOIJAEHONG1 , basically LGTM. Some minor comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If yo

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread sun-rui
Github user sun-rui commented on a diff in the pull request: https://github.com/apache/spark/pull/7494#discussion_r38058562 --- Diff: R/pkg/inst/tests/test_sparkSQL.R --- @@ -417,6 +417,32 @@ test_that("collect() and take() on a DataFrame return the same number of rows an ex

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-135132119 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-135132120 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-135131987 [Test build #41629 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41629/console) for PR 7494 at commit [`3686f15`](https://github.

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-135077981 [Test build #41629 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41629/consoleFull) for PR 7494 at commit [`3686f15`](https://gith

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-135077370 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not h

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-135077396 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-135075607 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-134893165 Timeout occurred while fetching from the origin. ``` > git config remote.origin.url https://github.com/apache/spark.git # timeout=10 Fetching upstre

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-134887390 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-134887387 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-134878845 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not h

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-134878904 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-25 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-134832178 @sun-rui I call `writeBin` with `useBytes=TRUE`, which is default to FALSE, refering to the below. https://stat.ethz.ch/R-manual/R-devel/library/base/h

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-25 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-134795102 Sorry for being late. I got this error message in `SerDe.scala`. The byte sequence sent from `worker.R` is not null-terminated. I tried to append

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-25 Thread sun-rui
Github user sun-rui commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-134789531 @CHOIJAEHONG1 , any update on this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-13 Thread sun-rui
Github user sun-rui commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-130662126 @CHOIJAEHONG1, @shivaram: 1. R worker can be in any locale, because R can recognize UTF-8 and preserve UTF-8 encoding when manipulating strings. The root cause o

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-12 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-130482232 Yeah I think the right thing to do is to make sure `worker.R` responds with UTF-8 strings always. Given the discussion in this thread I think the problem is that we use

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-11 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-129846643 For example, declare a variable with a string "가" ``` > a<-"가" ``` With `LC_ALL=C` and `Encoding(a)<-"bytes" ``` > Sys.setlocale("LC_

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-09 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-129227462 @CHOIJAEHONG1 This diff is looking good. Can you update the PR with this ? BTW why do we need to clear the encoding bit in `context.R` ? Does the serialization not work

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-09 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-129171042 Okay, using `iconv` works nicely. But, It still needs to modify `context.R` to get rid of the leading UTF-8 indicating bit in a string before sending to the Spark t

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-06 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-128447851 Can we still use the `iconv` solution with the `System.setlocale` in the test case ? It doesn't seem right to use `rawToChar` when we are decoding UTF-8 strings --- I

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-06 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-128335460 The testcase passed with `Sys.setlocale("LC_ALL", "en_US.UTF-8")`. But, I had to modify context.R to clear out the UTF-8 indicating bit. It seems to turn out that t

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-04 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-127663833 I see - so the problem here is on how to write a unit test that uses UTF-8 and works with `LC_ALL=C` ? One simple thing we might be able to do is to set the locale insi

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-03 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-127481181 Firstly, I am not able to enter multibyte character sequences in the R shell under `LC_ALL=C` either. I seems that R does not support it. What I am trying to do is

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-02 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-127059644 I'm not sure how why expect the tests to pass with `LC_ALL=C`. In that case I am not even able to enter the characters into the console (i.e. `parse` or `scan` in R won

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-08-02 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-127034019 @shivaram The testcase passed under the UTF-8 locale(e.g. LC_ALL=ko_KR.UTF). But, unfortunately, iconv() returns NA under "LC_ALL=C" which makes the testcase f

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-29 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-126061999 @sun-rui @CHOIJAEHONG1 I took a look at this and it looks like the right way to do this would be to use `iconv` and not use `rawToChar` which doesn't parse UTF-8 correc

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-29 Thread shivaram
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/7494#discussion_r35788337 --- Diff: R/pkg/inst/tests/test_sparkSQL.R --- @@ -417,6 +417,32 @@ test_that("collect() and take() on a DataFrame return the same number of rows an e

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-28 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on a diff in the pull request: https://github.com/apache/spark/pull/7494#discussion_r35721159 --- Diff: R/pkg/R/deserialize.R --- @@ -56,8 +56,10 @@ readTypedObject <- function(con, type) { readString <- function(con) { stringLen

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-24 Thread sun-rui
Github user sun-rui commented on a diff in the pull request: https://github.com/apache/spark/pull/7494#discussion_r35407059 --- Diff: R/pkg/R/deserialize.R --- @@ -56,8 +56,10 @@ readTypedObject <- function(con, type) { readString <- function(con) { stringLen <- re

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-23 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-124344827 good job, jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have thi

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-124343874 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-23 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-124343702 [Test build #38304 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38304/console) for PR 7494 at commit [`bc469d8`](https://github.

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-124324608 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-23 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-124322251 [Test build #38304 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38304/consoleFull) for PR 7494 at commit [`bc469d8`](https://gith

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-124321558 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not h

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-124321636 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-124319611 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-124319573 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not h

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-23 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-124172706 Adding a zero doesn't solve the problem either. It seems to have the same effect as the one without it. But, the below works in which the locale is "C", not UTF-

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-20 Thread sun-rui
Github user sun-rui commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-123162043 Could you try adding a zero as done previously in writeString(): val utf8 = value.getBytes("UTF-8") val len = utf8.length out.writeInt(len + 1

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-20 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122939278 Unfortunately, not. I guess the string in the testcase should be changed to be native form. ``` readString <- function(con) { stringLen <- read

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-20 Thread sun-rui
Github user sun-rui commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122867464 yeah, rawToChar() is needed. Then does it work now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If yo

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-20 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122866982 @sun-rui I reproduced the same error as the testcase's with ``` $) export LC_ALL=C $) ./run-tests.sh ``` It needs to call rawToChar(), R give

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-20 Thread sun-rui
Github user sun-rui commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122837016 I think readString() in deserialize.R should be updated accordingly. Could you try: string <- readBin(...) Encoding(string) <- "UTF-8" string <- enc2nat

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-19 Thread CHOIJAEHONG1
Github user CHOIJAEHONG1 commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122643950 I am not sure about `readString`, but the teatcase, which verifies the intactness of unicode characters in a native dataframe making a round trip to Spark's DataFra

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122581352 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122581339 [Test build #37723 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37723/console) for PR 7494 at commit [`5325cef`](https://github.

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122564017 [Test build #37723 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37723/consoleFull) for PR 7494 at commit [`5325cef`](https://gith

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-18 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122563983 @CHOIJAEHONG1 Thanks for the PR. Are we sure we don't need to change `readString` in https://github.com/apache/spark/blob/1017908205b7690dc0b0ed4753b36fab5641f7ac/R/pkg

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122563838 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122563832 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not h

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-18 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122563657 Jenkins, ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this f

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122535673 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your pr

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7494#issuecomment-122535662 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your pr

[GitHub] spark pull request: [SPARK-8951][SparkR] support Unicode character...

2015-07-18 Thread CHOIJAEHONG1
GitHub user CHOIJAEHONG1 opened a pull request: https://github.com/apache/spark/pull/7494 [SPARK-8951][SparkR] support Unicode characters in collect() Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK. I cha