[ https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jaehong Choi updated SPARK-8951: -------------------------------- Description: Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK. I found out that SerDe in R API only supports ASCII format right now. So, I fixed SerDe.scala a little to support CJK as the patch attached. \\ {noformat} people.json {"name":"가나"} {"name":"테스트123", "age":30} {"name":"Justin", "age":19} df <- read.df(sqlContext, "./people.json", "json") head(df) Error in rawtochar(string) : embedded nul in string : '\0 \x98' {noformat} {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} // NOTE: Only works for ASCII right now def writeString(out: DataOutputStream, value: String): Unit = { val len = value.length out.writeInt(len + 1) // For the \0 out.writeBytes(value) out.writeByte(0) {code} was: Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK. I found out that SerDe in R API only supports ASCII format right now. So, I fixed SerDe.scala a little to support CJK as the patch attached. {noformat} people.json {"name":"가나"} {"name":"테스트123", "age":30} {"name":"Justin", "age":19} df <- read.df(sqlContext, "./people.json", "json") head(df) Error in rawtochar(string) : embedded nul in string : '\0 \x98' {noformat} {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} // NOTE: Only works for ASCII right now def writeString(out: DataOutputStream, value: String): Unit = { val len = value.length out.writeInt(len + 1) // For the \0 out.writeBytes(value) out.writeByte(0) {code} > support CJK characters in collect() > ----------------------------------- > > Key: SPARK-8951 > URL: https://issues.apache.org/jira/browse/SPARK-8951 > Project: Spark > Issue Type: Bug > Components: SparkR > Reporter: Jaehong Choi > > Spark gives an error message and does not show the output when a field of the > result DataFrame contains characters in CJK. > I found out that SerDe in R API only supports ASCII format right now. > So, I fixed SerDe.scala a little to support CJK as the patch attached. > \\ > {noformat} > people.json > {"name":"가나"} > {"name":"테스트123", "age":30} > {"name":"Justin", "age":19} > df <- read.df(sqlContext, "./people.json", "json") > head(df) > Error in rawtochar(string) : embedded nul in string : '\0 \x98' > {noformat} > {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} > // NOTE: Only works for ASCII right now > def writeString(out: DataOutputStream, value: String): Unit = { > val len = value.length > out.writeInt(len + 1) // For the \0 > out.writeBytes(value) > out.writeByte(0) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org