Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/7494
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enab
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-137567784
@CHOIJAEHONG1 Sorry for the delay. I finally got a chance to try this out
on my machine and it seems to work fine with UTF-8 strings with both the locale
set to `C` and
Github user CHOIJAEHONG1 commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-135945798
@shivaram, @sun-rui
Thanks you guys for the work. It would be great to go through more like you
said.
---
If your project is set up for it, you can reply to t
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-135905717
Thanks @CHOIJAEHONG1 and @sun-rui -- I just want to test this / go through
this a bit carefully once more as its a pretty fundamental change in how we
handle strings. W
Github user CHOIJAEHONG1 commented on a diff in the pull request:
https://github.com/apache/spark/pull/7494#discussion_r38078086
--- Diff: R/pkg/inst/tests/test_sparkSQL.R ---
@@ -417,6 +417,32 @@ test_that("collect() and take() on a DataFrame return
the same number of rows an
Github user sun-rui commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-135265160
@CHOIJAEHONG1 , basically LGTM. Some minor comment.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If yo
Github user sun-rui commented on a diff in the pull request:
https://github.com/apache/spark/pull/7494#discussion_r38058562
--- Diff: R/pkg/inst/tests/test_sparkSQL.R ---
@@ -417,6 +417,32 @@ test_that("collect() and take() on a DataFrame return
the same number of rows an
ex
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-135132119
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-135132120
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-135131987
[Test build #41629 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41629/console)
for PR 7494 at commit
[`3686f15`](https://github.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-135077981
[Test build #41629 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41629/consoleFull)
for PR 7494 at commit
[`3686f15`](https://gith
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-135077370
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not h
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-135077396
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-135075607
Jenkins retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user CHOIJAEHONG1 commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-134893165
Timeout occurred while fetching from the origin.
```
> git config remote.origin.url https://github.com/apache/spark.git #
timeout=10
Fetching upstre
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-134887390
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-134887387
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-134878845
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not h
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-134878904
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user CHOIJAEHONG1 commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-134832178
@sun-rui
I call `writeBin` with `useBytes=TRUE`, which is default to FALSE, refering
to the below.
https://stat.ethz.ch/R-manual/R-devel/library/base/h
Github user CHOIJAEHONG1 commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-134795102
Sorry for being late.
I got this error message in `SerDe.scala`. The byte sequence sent from
`worker.R` is not null-terminated.
I tried to append
Github user sun-rui commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-134789531
@CHOIJAEHONG1 , any update on this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user sun-rui commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-130662126
@CHOIJAEHONG1, @shivaram:
1. R worker can be in any locale, because R can recognize UTF-8 and
preserve UTF-8 encoding when manipulating strings. The root cause o
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-130482232
Yeah I think the right thing to do is to make sure `worker.R` responds with
UTF-8 strings always. Given the discussion in this thread I think the problem
is that we use
Github user CHOIJAEHONG1 commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-129846643
For example, declare a variable with a string "ê°"
```
> a<-"ê°"
```
With `LC_ALL=C` and `Encoding(a)<-"bytes"
```
> Sys.setlocale("LC_
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-129227462
@CHOIJAEHONG1 This diff is looking good. Can you update the PR with this ?
BTW why do we need to clear the encoding bit in `context.R` ? Does the
serialization not work
Github user CHOIJAEHONG1 commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-129171042
Okay, using `iconv` works nicely. But, It still needs to modify `context.R`
to get rid of the leading UTF-8 indicating bit in a string before sending to
the Spark t
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-128447851
Can we still use the `iconv` solution with the `System.setlocale` in the
test case ? It doesn't seem right to use `rawToChar` when we are decoding UTF-8
strings
---
I
Github user CHOIJAEHONG1 commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-128335460
The testcase passed with `Sys.setlocale("LC_ALL", "en_US.UTF-8")`. But, I
had to modify context.R to clear out the UTF-8 indicating bit. It seems to turn
out that t
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-127663833
I see - so the problem here is on how to write a unit test that uses UTF-8
and works with `LC_ALL=C` ? One simple thing we might be able to do is to set
the locale insi
Github user CHOIJAEHONG1 commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-127481181
Firstly, I am not able to enter multibyte character sequences in the R
shell under `LC_ALL=C` either. I seems that R does not support it. What I am
trying to do is
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-127059644
I'm not sure how why expect the tests to pass with `LC_ALL=C`. In that case
I am not even able to enter the characters into the console (i.e. `parse` or
`scan` in R won
Github user CHOIJAEHONG1 commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-127034019
@shivaram
The testcase passed under the UTF-8 locale(e.g. LC_ALL=ko_KR.UTF). But,
unfortunately, iconv() returns NA under "LC_ALL=C" which makes the testcase
f
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-126061999
@sun-rui @CHOIJAEHONG1 I took a look at this and it looks like the right
way to do this would be to use `iconv` and not use `rawToChar` which doesn't
parse UTF-8 correc
Github user shivaram commented on a diff in the pull request:
https://github.com/apache/spark/pull/7494#discussion_r35788337
--- Diff: R/pkg/inst/tests/test_sparkSQL.R ---
@@ -417,6 +417,32 @@ test_that("collect() and take() on a DataFrame return
the same number of rows an
e
Github user CHOIJAEHONG1 commented on a diff in the pull request:
https://github.com/apache/spark/pull/7494#discussion_r35721159
--- Diff: R/pkg/R/deserialize.R ---
@@ -56,8 +56,10 @@ readTypedObject <- function(con, type) {
readString <- function(con) {
stringLen
Github user sun-rui commented on a diff in the pull request:
https://github.com/apache/spark/pull/7494#discussion_r35407059
--- Diff: R/pkg/R/deserialize.R ---
@@ -56,8 +56,10 @@ readTypedObject <- function(con, type) {
readString <- function(con) {
stringLen <- re
Github user CHOIJAEHONG1 commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-124344827
good job, jenkins.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have thi
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-124343874
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-124343702
[Test build #38304 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38304/console)
for PR 7494 at commit
[`bc469d8`](https://github.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-124324608
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-124322251
[Test build #38304 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38304/consoleFull)
for PR 7494 at commit
[`bc469d8`](https://gith
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-124321558
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not h
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-124321636
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-124319611
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-124319573
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not h
Github user CHOIJAEHONG1 commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-124172706
Adding a zero doesn't solve the problem either. It seems to have the same
effect as the one without it.
But, the below works in which the locale is "C", not UTF-
Github user sun-rui commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-123162043
Could you try adding a zero as done previously in writeString():
val utf8 = value.getBytes("UTF-8")
val len = utf8.length
out.writeInt(len + 1
Github user CHOIJAEHONG1 commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-122939278
Unfortunately, not.
I guess the string in the testcase should be changed to be native form.
```
readString <- function(con) {
stringLen <- read
Github user sun-rui commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-122867464
yeah, rawToChar() is needed. Then does it work now?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If yo
Github user CHOIJAEHONG1 commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-122866982
@sun-rui
I reproduced the same error as the testcase's with
```
$) export LC_ALL=C
$) ./run-tests.sh
```
It needs to call rawToChar(), R give
Github user sun-rui commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-122837016
I think readString() in deserialize.R should be updated accordingly. Could
you try:
string <- readBin(...)
Encoding(string) <- "UTF-8"
string <- enc2nat
Github user CHOIJAEHONG1 commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-122643950
I am not sure about `readString`, but the teatcase, which verifies the
intactness of unicode characters in a native dataframe making a round trip to
Spark's DataFra
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-122581352
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-122581339
[Test build #37723 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37723/console)
for PR 7494 at commit
[`5325cef`](https://github.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-122564017
[Test build #37723 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37723/consoleFull)
for PR 7494 at commit
[`5325cef`](https://gith
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-122563983
@CHOIJAEHONG1 Thanks for the PR. Are we sure we don't need to change
`readString` in
https://github.com/apache/spark/blob/1017908205b7690dc0b0ed4753b36fab5641f7ac/R/pkg
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-122563838
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-122563832
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not h
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-122563657
Jenkins, ok to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this f
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-122535673
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your pr
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7494#issuecomment-122535662
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your pr
GitHub user CHOIJAEHONG1 opened a pull request:
https://github.com/apache/spark/pull/7494
[SPARK-8951][SparkR] support Unicode characters in collect()
Spark gives an error message and does not show the output when a field of
the result DataFrame contains characters in CJK.
I cha
63 matches
Mail list logo