[jira] [Commented] (SPARK-8951) support CJK characters in collect()
[ https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731151#comment-14731151 ] Shivaram Venkataraman commented on SPARK-8951: -- Ah I should have retested this before merging - I'll send a PR to fix this now > support CJK characters in collect() > --- > > Key: SPARK-8951 > URL: https://issues.apache.org/jira/browse/SPARK-8951 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Jaehong Choi >Assignee: Jaehong Choi >Priority: Minor > Fix For: 1.6.0 > > Attachments: SerDe.scala.diff > > > Spark gives an error message and does not show the output when a field of the > result DataFrame contains characters in CJK. > I found out that SerDe in R API only supports ASCII format for strings right > now as commented in source code. > So, I fixed SerDe.scala a little to support CJK as the file attached. > I did not care efficiency, but just wanted to see if it works. > {noformat} > people.json > {"name":"가나"} > {"name":"테스트123", "age":30} > {"name":"Justin", "age":19} > df <- read.df(sqlContext, "./people.json", "json") > head(df) > Error in rawtochar(string) : embedded nul in string : '\0 \x98' > {noformat} > {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} > // NOTE: Only works for ASCII right now > def writeString(out: DataOutputStream, value: String): Unit = { > val len = value.length > out.writeInt(len + 1) // For the \0 > out.writeBytes(value) > out.writeByte(0) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8951) support CJK characters in collect()
[ https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731155#comment-14731155 ] Shivaram Venkataraman commented on SPARK-8951: -- Sent https://github.com/apache/spark/pull/8601 to fix this > support CJK characters in collect() > --- > > Key: SPARK-8951 > URL: https://issues.apache.org/jira/browse/SPARK-8951 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Jaehong Choi >Assignee: Jaehong Choi >Priority: Minor > Fix For: 1.6.0 > > Attachments: SerDe.scala.diff > > > Spark gives an error message and does not show the output when a field of the > result DataFrame contains characters in CJK. > I found out that SerDe in R API only supports ASCII format for strings right > now as commented in source code. > So, I fixed SerDe.scala a little to support CJK as the file attached. > I did not care efficiency, but just wanted to see if it works. > {noformat} > people.json > {"name":"가나"} > {"name":"테스트123", "age":30} > {"name":"Justin", "age":19} > df <- read.df(sqlContext, "./people.json", "json") > head(df) > Error in rawtochar(string) : embedded nul in string : '\0 \x98' > {noformat} > {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} > // NOTE: Only works for ASCII right now > def writeString(out: DataOutputStream, value: String): Unit = { > val len = value.length > out.writeInt(len + 1) // For the \0 > out.writeBytes(value) > out.writeByte(0) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8951) support CJK characters in collect()
[ https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731140#comment-14731140 ] Jihong MA commented on SPARK-8951: -- This commit cause R style check failure. Running R style checks Loading required package: methods Attaching package: 'SparkR' The following objects are masked from 'package:stats': filter, na.omit The following objects are masked from 'package:base': intersect, rbind, sample, subset, summary, table, transform Attaching package: 'testthat' The following object is masked from 'package:SparkR': describe R/deserialize.R:63:9: style: Trailing whitespace is superfluous. string ^ lintr checks failed. [error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-r ; received return code 1 Archiving unit tests logs... > No log files found. Attempting to post to Github... > Post successful. Build step 'Execute shell' marked build as failure Archiving artifacts Recording test results ERROR: Publisher 'Publish JUnit test result report' failed: No test report files were found. Configuration error? Test FAILed. Refer to this link for build results (access rights to CI server needed): > support CJK characters in collect() > --- > > Key: SPARK-8951 > URL: https://issues.apache.org/jira/browse/SPARK-8951 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Jaehong Choi >Assignee: Jaehong Choi >Priority: Minor > Fix For: 1.6.0 > > Attachments: SerDe.scala.diff > > > Spark gives an error message and does not show the output when a field of the > result DataFrame contains characters in CJK. > I found out that SerDe in R API only supports ASCII format for strings right > now as commented in source code. > So, I fixed SerDe.scala a little to support CJK as the file attached. > I did not care efficiency, but just wanted to see if it works. > {noformat} > people.json > {"name":"가나"} > {"name":"테스트123", "age":30} > {"name":"Justin", "age":19} > df <- read.df(sqlContext, "./people.json", "json") > head(df) > Error in rawtochar(string) : embedded nul in string : '\0 \x98' > {noformat} > {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} > // NOTE: Only works for ASCII right now > def writeString(out: DataOutputStream, value: String): Unit = { > val len = value.length > out.writeInt(len + 1) // For the \0 > out.writeBytes(value) > out.writeByte(0) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8951) support CJK characters in collect()
[ https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632408#comment-14632408 ] Apache Spark commented on SPARK-8951: - User 'CHOIJAEHONG1' has created a pull request for this issue: https://github.com/apache/spark/pull/7494 support CJK characters in collect() --- Key: SPARK-8951 URL: https://issues.apache.org/jira/browse/SPARK-8951 Project: Spark Issue Type: Bug Components: SparkR Reporter: Jaehong Choi Priority: Minor Attachments: SerDe.scala.diff Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK. I found out that SerDe in R API only supports ASCII format for strings right now as commented in source code. So, I fixed SerDe.scala a little to support CJK as the file attached. I did not care efficiency, but just wanted to see if it works. {noformat} people.json {name:가나} {name:테스트123, age:30} {name:Justin, age:19} df - read.df(sqlContext, ./people.json, json) head(df) Error in rawtochar(string) : embedded nul in string : '\0 \x98' {noformat} {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} // NOTE: Only works for ASCII right now def writeString(out: DataOutputStream, value: String): Unit = { val len = value.length out.writeInt(len + 1) // For the \0 out.writeBytes(value) out.writeByte(0) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8951) support CJK characters in collect()
[ https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621834#comment-14621834 ] Jaehong Choi commented on SPARK-8951: - Thanks for your advice. You're right. This is about supporting Unicode indeed. I'll open a PR for this issue. I didn't think about the null terminator much. I am going to figure it out as well. support CJK characters in collect() --- Key: SPARK-8951 URL: https://issues.apache.org/jira/browse/SPARK-8951 Project: Spark Issue Type: Bug Components: SparkR Reporter: Jaehong Choi Priority: Minor Attachments: SerDe.scala.diff Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK. I found out that SerDe in R API only supports ASCII format for strings right now as commented in source code. So, I fixed SerDe.scala a little to support CJK as the file attached. I did not care efficiency, but just wanted to see if it works. {noformat} people.json {name:가나} {name:테스트123, age:30} {name:Justin, age:19} df - read.df(sqlContext, ./people.json, json) head(df) Error in rawtochar(string) : embedded nul in string : '\0 \x98' {noformat} {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} // NOTE: Only works for ASCII right now def writeString(out: DataOutputStream, value: String): Unit = { val len = value.length out.writeInt(len + 1) // For the \0 out.writeBytes(value) out.writeByte(0) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8951) support CJK characters in collect()
[ https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14620809#comment-14620809 ] Shivaram Venkataraman commented on SPARK-8951: -- Yeah the right solution here is to use UTF-8 for serialization and deserialization. We already do this in the readStringBytes function https://github.com/apache/spark/blob/a1964e9d902bb31f001893da8bc81f6dce08c908/core/src/main/scala/org/apache/spark/api/r/SerDe.scala#L92 but the `writeString` needs to be fixed. Feel free to open a PR for this as specified in the contribution guide. support CJK characters in collect() --- Key: SPARK-8951 URL: https://issues.apache.org/jira/browse/SPARK-8951 Project: Spark Issue Type: Bug Components: SparkR Reporter: Jaehong Choi Priority: Minor Attachments: SerDe.scala.diff Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK. I found out that SerDe in R API only supports ASCII format for strings right now as commented in source code. So, I fixed SerDe.scala a little to support CJK as the file attached. I did not care efficiency, but just wanted to see if it works. {noformat} people.json {name:가나} {name:테스트123, age:30} {name:Justin, age:19} df - read.df(sqlContext, ./people.json, json) head(df) Error in rawtochar(string) : embedded nul in string : '\0 \x98' {noformat} {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} // NOTE: Only works for ASCII right now def writeString(out: DataOutputStream, value: String): Unit = { val len = value.length out.writeInt(len + 1) // For the \0 out.writeBytes(value) out.writeByte(0) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org