[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2017-05-09 Thread cjuexuan
Github user cjuexuan closed the pull request at:

https://github.com/apache/spark/pull/16428


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94273866
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -573,6 +573,7 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* indicates a timestamp format. Custom date formats follow the formats 
at
* `java.text.SimpleDateFormat`. This applies to timestamp type.
* 
+   * `encoding`(default `utf-8`) save dataFrame 2 csv by giving 
encoding
--- End diff --

Oh, also, it seems the newly added option here should be put in ..

```

...

```

so that this can be rendered fine in Java API documentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-31 Thread cjuexuan
Github user cjuexuan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94273798
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -905,4 +906,21 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
   checkAnswer(df, Row(1, null))
 }
   }
+
+  test("save data with gb18030") {
+withTempPath{ path =>
+  Seq(("1", "中文"))
+.toDF("num", "lanaguage")
+.write
+.option("encoding", "GB18030")
+.option("header", "true")
+.csv(path.getAbsolutePath)
+  val df = spark.read
+.option("header", "true")
+.option("encoding", "GB18030")
+.csv(path.getAbsolutePath)
+
+  checkAnswer(df, Row("1", "中文"))
--- End diff --

sounds good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94273737
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -573,6 +573,7 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* indicates a timestamp format. Custom date formats follow the formats 
at
* `java.text.SimpleDateFormat`. This applies to timestamp type.
* 
+   * `encoding`(default `utf-8`) save dataFrame 2 csv by giving 
encoding
--- End diff --

looks good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94273685
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -905,4 +906,21 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
   checkAnswer(df, Row(1, null))
 }
   }
+
+  test("save data with gb18030") {
+withTempPath{ path =>
+  Seq(("1", "中文"))
+.toDF("num", "lanaguage")
+.write
+.option("encoding", "GB18030")
+.option("header", "true")
+.csv(path.getAbsolutePath)
+  val df = spark.read
+.option("header", "true")
+.option("encoding", "GB18030")
+.csv(path.getAbsolutePath)
+
+  checkAnswer(df, Row("1", "中文"))
--- End diff --

Could we write this something like as below:

```scala
// scalastyle:off
val df = Seq(("1", "中文")).toDF("num", "lanaguage")
// scalastyle:on
df.write
  .option("header", "true")
  .option("encoding", "GB18030")
  .csv(path.getAbsolutePath)

val readBack = spark.read
  .option("header", "true")
  .option("encoding", "GB18030")
  .csv(path.getAbsolutePath)
 
checkAnswer(df, readBack)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-31 Thread cjuexuan
Github user cjuexuan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94273679
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -573,6 +573,7 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* indicates a timestamp format. Custom date formats follow the formats 
at
* `java.text.SimpleDateFormat`. This applies to timestamp type.
* 
+   * `encoding`(default `utf-8`) save dataFrame 2 csv by giving 
encoding
--- End diff --

@HyukjinKwon  what about
 ```
 `encoding` (default `UTF-8`): encodes the CSV files by the given 
encoding
   * type.  
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94273677
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -677,6 +677,8 @@ def csv(self, path, mode=None, compression=None, 
sep=None, quote=None, escape=No
 snappy and deflate).
 :param sep: sets the single character as a separator for each 
field and value. If None is
 set, it uses the default value, ``,``.
+:param encoding: sets writer CSV files by the given encoding type. 
If None is set,
+ it uses the default value, ``UTF-8``.
--- End diff --

Here too, let's resemble the one in `DataFrameReader` above in this file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94273678
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -905,4 +906,21 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
   checkAnswer(df, Row(1, null))
 }
   }
+
+  test("save data with gb18030") {
+withTempPath{ path =>
--- End diff --

nit: it should be `withTempPath { path =>`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-31 Thread cjuexuan
Github user cjuexuan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94273572
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -573,6 +573,7 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* indicates a timestamp format. Custom date formats follow the formats 
at
* `java.text.SimpleDateFormat`. This applies to timestamp type.
* 
+   * `encoding`(default `utf-8`) save dataFrame 2 csv by giving 
encoding
--- End diff --

@HyukjinKwon  ok,I refine it 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94273548
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -573,6 +573,7 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* indicates a timestamp format. Custom date formats follow the formats 
at
* `java.text.SimpleDateFormat`. This applies to timestamp type.
* 
+   * `encoding`(default `utf-8`) save dataFrame 2 csv by giving 
encoding
--- End diff --

Could we just resemble the documentation in `DataFrameReader` just for 
consistency?

```
 `encoding` (default `UTF-8`): decodes the CSV files by the given 
encoding
   * type.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94273531
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -33,6 +33,7 @@ import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
 import org.apache.spark.sql.types._
 
+//noinspection ScalaStyle
--- End diff --

We can disable only the lines with the block as below if you need this for 
non-ascii characters: 

```scala
// scalastyle:off
...
// scalastyle:on
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94273423
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -659,7 +659,7 @@ def text(self, path, compression=None):
 self._jwrite.text(path)
 
 @since(2.0)
-def csv(self, path, mode=None, compression=None, sep=None, quote=None, 
escape=None,
+def csv(self, path, mode=None, compression=None, sep=None, 
encoding=None, quote=None, escape=None,
--- End diff --

We need to place this new option at the end. Otherwise, it will breaks 
existing codes that use this options as a positional argument (not keyword 
argument).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-30 Thread cjuexuan
Github user cjuexuan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94239683
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -573,6 +573,7 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* indicates a timestamp format. Custom date formats follow the formats 
at
* `java.text.SimpleDateFormat`. This applies to timestamp type.
* 
+   * `writeEncoding`(default `utf-8`) save dataFrame 2 csv by giving 
encoding
--- End diff --

ok,I will write my unit test and modify this pull request


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94239452
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -573,6 +573,7 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* indicates a timestamp format. Custom date formats follow the formats 
at
* `java.text.SimpleDateFormat`. This applies to timestamp type.
* 
+   * `writeEncoding`(default `utf-8`) save dataFrame 2 csv by giving 
encoding
--- End diff --

We also should add the same documentation in `readwriter.py`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-30 Thread cjuexuan
Github user cjuexuan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94239100
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -71,7 +71,9 @@ private[csv] class CSVOptions(@transient private val 
parameters: CaseInsensitive
   val delimiter = CSVTypeCast.toChar(
 parameters.getOrElse("sep", parameters.getOrElse("delimiter", ",")))
   private val parseMode = parameters.getOrElse("mode", "PERMISSIVE")
-  val charset = parameters.getOrElse("encoding",
+  val readCharSet = parameters.getOrElse("encoding",
+parameters.getOrElse("charset", StandardCharsets.UTF_8.name()))
+  val writeCharSet = parameters.getOrElse("writeEncoding",
--- End diff --

@HyukjinKwon I think so


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16428#discussion_r94238157
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -71,7 +71,9 @@ private[csv] class CSVOptions(@transient private val 
parameters: CaseInsensitive
   val delimiter = CSVTypeCast.toChar(
 parameters.getOrElse("sep", parameters.getOrElse("delimiter", ",")))
   private val parseMode = parameters.getOrElse("mode", "PERMISSIVE")
-  val charset = parameters.getOrElse("encoding",
+  val readCharSet = parameters.getOrElse("encoding",
+parameters.getOrElse("charset", StandardCharsets.UTF_8.name()))
+  val writeCharSet = parameters.getOrElse("writeEncoding",
--- End diff --

I think we should not necessarily introduce additional option. We could 
just use `charset` variable because other options such as `nullValue` are 
already applied to both reading and writing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param

2016-12-28 Thread cjuexuan
GitHub user cjuexuan opened a pull request:

https://github.com/apache/spark/pull/16428

[SPARK-19018][SQL] ADD csv write charset param

## What changes were proposed in this pull request?

add csv write charset support 
[jira](https://issues.apache.org/jira/browse/SPARK-19018)

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cjuexuan/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16428.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16428


commit fafbb0682c44c02a86b7597b42cf4b407a030761
Author: todd.chen 
Date:   2016-12-29T02:12:25Z

[SPARK-19018][SQL] ADD csv write charset param




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org