[ https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14620809#comment-14620809 ]
Shivaram Venkataraman commented on SPARK-8951: ---------------------------------------------- Yeah the right solution here is to use UTF-8 for serialization and deserialization. We already do this in the readStringBytes function https://github.com/apache/spark/blob/a1964e9d902bb31f001893da8bc81f6dce08c908/core/src/main/scala/org/apache/spark/api/r/SerDe.scala#L92 but the `writeString` needs to be fixed. Feel free to open a PR for this as specified in the contribution guide. > support CJK characters in collect() > ----------------------------------- > > Key: SPARK-8951 > URL: https://issues.apache.org/jira/browse/SPARK-8951 > Project: Spark > Issue Type: Bug > Components: SparkR > Reporter: Jaehong Choi > Priority: Minor > Attachments: SerDe.scala.diff > > > Spark gives an error message and does not show the output when a field of the > result DataFrame contains characters in CJK. > I found out that SerDe in R API only supports ASCII format for strings right > now as commented in source code. > So, I fixed SerDe.scala a little to support CJK as the file attached. > I did not care efficiency, but just wanted to see if it works. > {noformat} > people.json > {"name":"가나"} > {"name":"테스트123", "age":30} > {"name":"Justin", "age":19} > df <- read.df(sqlContext, "./people.json", "json") > head(df) > Error in rawtochar(string) : embedded nul in string : '\0 \x98' > {noformat} > {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} > // NOTE: Only works for ASCII right now > def writeString(out: DataOutputStream, value: String): Unit = { > val len = value.length > out.writeInt(len + 1) // For the \0 > out.writeBytes(value) > out.writeByte(0) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org