[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user cjuexuan closed the pull request at: https://github.com/apache/spark/pull/16428 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94273866 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -573,6 +573,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * indicates a timestamp format. Custom date formats follow the formats at * `java.text.SimpleDateFormat`. This applies to timestamp type. * + * `encoding`(default `utf-8`) save dataFrame 2 csv by giving encoding --- End diff -- Oh, also, it seems the newly added option here should be put in .. ``` ... ``` so that this can be rendered fine in Java API documentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user cjuexuan commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94273798 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -905,4 +906,21 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { checkAnswer(df, Row(1, null)) } } + + test("save data with gb18030") { +withTempPath{ path => + Seq(("1", "ä¸æ")) +.toDF("num", "lanaguage") +.write +.option("encoding", "GB18030") +.option("header", "true") +.csv(path.getAbsolutePath) + val df = spark.read +.option("header", "true") +.option("encoding", "GB18030") +.csv(path.getAbsolutePath) + + checkAnswer(df, Row("1", "ä¸æ")) --- End diff -- sounds good --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94273737 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -573,6 +573,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * indicates a timestamp format. Custom date formats follow the formats at * `java.text.SimpleDateFormat`. This applies to timestamp type. * + * `encoding`(default `utf-8`) save dataFrame 2 csv by giving encoding --- End diff -- looks good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94273685 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -905,4 +906,21 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { checkAnswer(df, Row(1, null)) } } + + test("save data with gb18030") { +withTempPath{ path => + Seq(("1", "ä¸æ")) +.toDF("num", "lanaguage") +.write +.option("encoding", "GB18030") +.option("header", "true") +.csv(path.getAbsolutePath) + val df = spark.read +.option("header", "true") +.option("encoding", "GB18030") +.csv(path.getAbsolutePath) + + checkAnswer(df, Row("1", "ä¸æ")) --- End diff -- Could we write this something like as below: ```scala // scalastyle:off val df = Seq(("1", "ä¸æ")).toDF("num", "lanaguage") // scalastyle:on df.write .option("header", "true") .option("encoding", "GB18030") .csv(path.getAbsolutePath) val readBack = spark.read .option("header", "true") .option("encoding", "GB18030") .csv(path.getAbsolutePath) checkAnswer(df, readBack) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user cjuexuan commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94273679 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -573,6 +573,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * indicates a timestamp format. Custom date formats follow the formats at * `java.text.SimpleDateFormat`. This applies to timestamp type. * + * `encoding`(default `utf-8`) save dataFrame 2 csv by giving encoding --- End diff -- @HyukjinKwon what about ``` `encoding` (default `UTF-8`): encodes the CSV files by the given encoding * type. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94273677 --- Diff: python/pyspark/sql/readwriter.py --- @@ -677,6 +677,8 @@ def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=No snappy and deflate). :param sep: sets the single character as a separator for each field and value. If None is set, it uses the default value, ``,``. +:param encoding: sets writer CSV files by the given encoding type. If None is set, + it uses the default value, ``UTF-8``. --- End diff -- Here too, let's resemble the one in `DataFrameReader` above in this file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94273678 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -905,4 +906,21 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { checkAnswer(df, Row(1, null)) } } + + test("save data with gb18030") { +withTempPath{ path => --- End diff -- nit: it should be `withTempPath { path =>`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user cjuexuan commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94273572 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -573,6 +573,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * indicates a timestamp format. Custom date formats follow the formats at * `java.text.SimpleDateFormat`. This applies to timestamp type. * + * `encoding`(default `utf-8`) save dataFrame 2 csv by giving encoding --- End diff -- @HyukjinKwon ok,I refine it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94273548 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -573,6 +573,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * indicates a timestamp format. Custom date formats follow the formats at * `java.text.SimpleDateFormat`. This applies to timestamp type. * + * `encoding`(default `utf-8`) save dataFrame 2 csv by giving encoding --- End diff -- Could we just resemble the documentation in `DataFrameReader` just for consistency? ``` `encoding` (default `UTF-8`): decodes the CSV files by the given encoding * type. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94273531 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -33,6 +33,7 @@ import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} import org.apache.spark.sql.types._ +//noinspection ScalaStyle --- End diff -- We can disable only the lines with the block as below if you need this for non-ascii characters: ```scala // scalastyle:off ... // scalastyle:on ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94273423 --- Diff: python/pyspark/sql/readwriter.py --- @@ -659,7 +659,7 @@ def text(self, path, compression=None): self._jwrite.text(path) @since(2.0) -def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=None, +def csv(self, path, mode=None, compression=None, sep=None, encoding=None, quote=None, escape=None, --- End diff -- We need to place this new option at the end. Otherwise, it will breaks existing codes that use this options as a positional argument (not keyword argument). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user cjuexuan commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94239683 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -573,6 +573,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * indicates a timestamp format. Custom date formats follow the formats at * `java.text.SimpleDateFormat`. This applies to timestamp type. * + * `writeEncoding`(default `utf-8`) save dataFrame 2 csv by giving encoding --- End diff -- ok,I will write my unit test and modify this pull request --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94239452 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -573,6 +573,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * indicates a timestamp format. Custom date formats follow the formats at * `java.text.SimpleDateFormat`. This applies to timestamp type. * + * `writeEncoding`(default `utf-8`) save dataFrame 2 csv by giving encoding --- End diff -- We also should add the same documentation in `readwriter.py`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user cjuexuan commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94239100 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -71,7 +71,9 @@ private[csv] class CSVOptions(@transient private val parameters: CaseInsensitive val delimiter = CSVTypeCast.toChar( parameters.getOrElse("sep", parameters.getOrElse("delimiter", ","))) private val parseMode = parameters.getOrElse("mode", "PERMISSIVE") - val charset = parameters.getOrElse("encoding", + val readCharSet = parameters.getOrElse("encoding", +parameters.getOrElse("charset", StandardCharsets.UTF_8.name())) + val writeCharSet = parameters.getOrElse("writeEncoding", --- End diff -- @HyukjinKwon I think so --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16428#discussion_r94238157 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -71,7 +71,9 @@ private[csv] class CSVOptions(@transient private val parameters: CaseInsensitive val delimiter = CSVTypeCast.toChar( parameters.getOrElse("sep", parameters.getOrElse("delimiter", ","))) private val parseMode = parameters.getOrElse("mode", "PERMISSIVE") - val charset = parameters.getOrElse("encoding", + val readCharSet = parameters.getOrElse("encoding", +parameters.getOrElse("charset", StandardCharsets.UTF_8.name())) + val writeCharSet = parameters.getOrElse("writeEncoding", --- End diff -- I think we should not necessarily introduce additional option. We could just use `charset` variable because other options such as `nullValue` are already applied to both reading and writing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16428: [SPARK-19018][SQL] ADD csv write charset param
GitHub user cjuexuan opened a pull request: https://github.com/apache/spark/pull/16428 [SPARK-19018][SQL] ADD csv write charset param ## What changes were proposed in this pull request? add csv write charset support [jira](https://issues.apache.org/jira/browse/SPARK-19018) (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cjuexuan/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16428.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16428 commit fafbb0682c44c02a86b7597b42cf4b407a030761 Author: todd.chen Date: 2016-12-29T02:12:25Z [SPARK-19018][SQL] ADD csv write charset param --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org