[jira] [Commented] (SPARK-8951) support CJK characters in collect()

2015-09-04 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731151#comment-14731151
 ] 

Shivaram Venkataraman commented on SPARK-8951:
--

Ah I should have retested this before merging - I'll send a PR to fix this now

> support CJK characters in collect()
> ---
>
> Key: SPARK-8951
> URL: https://issues.apache.org/jira/browse/SPARK-8951
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Jaehong Choi
>Assignee: Jaehong Choi
>Priority: Minor
> Fix For: 1.6.0
>
> Attachments: SerDe.scala.diff
>
>
> Spark gives an error message and does not show the output when a field of the 
> result DataFrame contains characters in CJK.
> I found out that SerDe in R API only supports ASCII format for strings right 
> now as commented in source code.  
> So, I fixed SerDe.scala a little to support CJK as the file attached. 
> I did not care efficiency, but just wanted to see if it works.
> {noformat}
> people.json
> {"name":"가나"}
> {"name":"테스트123", "age":30}
> {"name":"Justin", "age":19}
> df <- read.df(sqlContext, "./people.json", "json")
> head(df)
> Error in rawtochar(string) : embedded nul in string : '\0 \x98'
> {noformat}
> {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala}
>   // NOTE: Only works for ASCII right now
>   def writeString(out: DataOutputStream, value: String): Unit = {
> val len = value.length
> out.writeInt(len + 1) // For the \0
> out.writeBytes(value)
> out.writeByte(0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8951) support CJK characters in collect()

2015-09-04 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731155#comment-14731155
 ] 

Shivaram Venkataraman commented on SPARK-8951:
--

Sent https://github.com/apache/spark/pull/8601 to fix this

> support CJK characters in collect()
> ---
>
> Key: SPARK-8951
> URL: https://issues.apache.org/jira/browse/SPARK-8951
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Jaehong Choi
>Assignee: Jaehong Choi
>Priority: Minor
> Fix For: 1.6.0
>
> Attachments: SerDe.scala.diff
>
>
> Spark gives an error message and does not show the output when a field of the 
> result DataFrame contains characters in CJK.
> I found out that SerDe in R API only supports ASCII format for strings right 
> now as commented in source code.  
> So, I fixed SerDe.scala a little to support CJK as the file attached. 
> I did not care efficiency, but just wanted to see if it works.
> {noformat}
> people.json
> {"name":"가나"}
> {"name":"테스트123", "age":30}
> {"name":"Justin", "age":19}
> df <- read.df(sqlContext, "./people.json", "json")
> head(df)
> Error in rawtochar(string) : embedded nul in string : '\0 \x98'
> {noformat}
> {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala}
>   // NOTE: Only works for ASCII right now
>   def writeString(out: DataOutputStream, value: String): Unit = {
> val len = value.length
> out.writeInt(len + 1) // For the \0
> out.writeBytes(value)
> out.writeByte(0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8951) support CJK characters in collect()

2015-09-04 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731140#comment-14731140
 ] 

Jihong MA commented on SPARK-8951:
--

This commit cause R style check failure. 


Running R style checks

Loading required package: methods

Attaching package: 'SparkR'

The following objects are masked from 'package:stats':

filter, na.omit

The following objects are masked from 'package:base':

intersect, rbind, sample, subset, summary, table, transform


Attaching package: 'testthat'

The following object is masked from 'package:SparkR':

describe

R/deserialize.R:63:9: style: Trailing whitespace is superfluous.
  string 
^
lintr checks failed.
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-r ; 
received return code 1
Archiving unit tests logs...
> No log files found.
Attempting to post to Github...
 > Post successful.
Build step 'Execute shell' marked build as failure
Archiving artifacts
Recording test results
ERROR: Publisher 'Publish JUnit test result report' failed: No test report 
files were found. Configuration error?
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

> support CJK characters in collect()
> ---
>
> Key: SPARK-8951
> URL: https://issues.apache.org/jira/browse/SPARK-8951
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Jaehong Choi
>Assignee: Jaehong Choi
>Priority: Minor
> Fix For: 1.6.0
>
> Attachments: SerDe.scala.diff
>
>
> Spark gives an error message and does not show the output when a field of the 
> result DataFrame contains characters in CJK.
> I found out that SerDe in R API only supports ASCII format for strings right 
> now as commented in source code.  
> So, I fixed SerDe.scala a little to support CJK as the file attached. 
> I did not care efficiency, but just wanted to see if it works.
> {noformat}
> people.json
> {"name":"가나"}
> {"name":"테스트123", "age":30}
> {"name":"Justin", "age":19}
> df <- read.df(sqlContext, "./people.json", "json")
> head(df)
> Error in rawtochar(string) : embedded nul in string : '\0 \x98'
> {noformat}
> {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala}
>   // NOTE: Only works for ASCII right now
>   def writeString(out: DataOutputStream, value: String): Unit = {
> val len = value.length
> out.writeInt(len + 1) // For the \0
> out.writeBytes(value)
> out.writeByte(0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8951) support CJK characters in collect()

2015-07-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632408#comment-14632408
 ] 

Apache Spark commented on SPARK-8951:
-

User 'CHOIJAEHONG1' has created a pull request for this issue:
https://github.com/apache/spark/pull/7494

 support CJK characters in collect()
 ---

 Key: SPARK-8951
 URL: https://issues.apache.org/jira/browse/SPARK-8951
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Jaehong Choi
Priority: Minor
 Attachments: SerDe.scala.diff


 Spark gives an error message and does not show the output when a field of the 
 result DataFrame contains characters in CJK.
 I found out that SerDe in R API only supports ASCII format for strings right 
 now as commented in source code.  
 So, I fixed SerDe.scala a little to support CJK as the file attached. 
 I did not care efficiency, but just wanted to see if it works.
 {noformat}
 people.json
 {name:가나}
 {name:테스트123, age:30}
 {name:Justin, age:19}
 df - read.df(sqlContext, ./people.json, json)
 head(df)
 Error in rawtochar(string) : embedded nul in string : '\0 \x98'
 {noformat}
 {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala}
   // NOTE: Only works for ASCII right now
   def writeString(out: DataOutputStream, value: String): Unit = {
 val len = value.length
 out.writeInt(len + 1) // For the \0
 out.writeBytes(value)
 out.writeByte(0)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8951) support CJK characters in collect()

2015-07-10 Thread Jaehong Choi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621834#comment-14621834
 ] 

Jaehong Choi commented on SPARK-8951:
-

Thanks for your advice.
You're right. This is about supporting Unicode indeed. I'll open a PR for this 
issue. 
I didn't think about the null terminator much. I am going to figure it out as 
well. 

 support CJK characters in collect()
 ---

 Key: SPARK-8951
 URL: https://issues.apache.org/jira/browse/SPARK-8951
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Jaehong Choi
Priority: Minor
 Attachments: SerDe.scala.diff


 Spark gives an error message and does not show the output when a field of the 
 result DataFrame contains characters in CJK.
 I found out that SerDe in R API only supports ASCII format for strings right 
 now as commented in source code.  
 So, I fixed SerDe.scala a little to support CJK as the file attached. 
 I did not care efficiency, but just wanted to see if it works.
 {noformat}
 people.json
 {name:가나}
 {name:테스트123, age:30}
 {name:Justin, age:19}
 df - read.df(sqlContext, ./people.json, json)
 head(df)
 Error in rawtochar(string) : embedded nul in string : '\0 \x98'
 {noformat}
 {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala}
   // NOTE: Only works for ASCII right now
   def writeString(out: DataOutputStream, value: String): Unit = {
 val len = value.length
 out.writeInt(len + 1) // For the \0
 out.writeBytes(value)
 out.writeByte(0)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8951) support CJK characters in collect()

2015-07-09 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14620809#comment-14620809
 ] 

Shivaram Venkataraman commented on SPARK-8951:
--

Yeah the right solution here is to use UTF-8 for serialization and 
deserialization. We already do this in the readStringBytes function 
https://github.com/apache/spark/blob/a1964e9d902bb31f001893da8bc81f6dce08c908/core/src/main/scala/org/apache/spark/api/r/SerDe.scala#L92
 but the `writeString` needs to be fixed. Feel free to open a PR for this as 
specified in the contribution guide.

 support CJK characters in collect()
 ---

 Key: SPARK-8951
 URL: https://issues.apache.org/jira/browse/SPARK-8951
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Jaehong Choi
Priority: Minor
 Attachments: SerDe.scala.diff


 Spark gives an error message and does not show the output when a field of the 
 result DataFrame contains characters in CJK.
 I found out that SerDe in R API only supports ASCII format for strings right 
 now as commented in source code.  
 So, I fixed SerDe.scala a little to support CJK as the file attached. 
 I did not care efficiency, but just wanted to see if it works.
 {noformat}
 people.json
 {name:가나}
 {name:테스트123, age:30}
 {name:Justin, age:19}
 df - read.df(sqlContext, ./people.json, json)
 head(df)
 Error in rawtochar(string) : embedded nul in string : '\0 \x98'
 {noformat}
 {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala}
   // NOTE: Only works for ASCII right now
   def writeString(out: DataOutputStream, value: String): Unit = {
 val len = value.length
 out.writeInt(len + 1) // For the \0
 out.writeBytes(value)
 out.writeByte(0)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org