Repository: spark Updated Branches: refs/heads/branch-2.0 a6428292f -> 705172202
[SPARK-13425][SQL] Documentation for CSV datasource options ## What changes were proposed in this pull request? This PR adds the explanation and documentation for CSV options for reading and writing. ## How was this patch tested? Style tests with `./dev/run_tests` for documentation style. Author: hyukjinkwon <gurwls...@gmail.com> Author: Hyukjin Kwon <gurwls...@gmail.com> Closes #12817 from HyukjinKwon/SPARK-13425. (cherry picked from commit a832cef11233c6357c7ba7ede387b432e6b0ed71) Signed-off-by: Reynold Xin <r...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/70517220 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/70517220 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/70517220 Branch: refs/heads/branch-2.0 Commit: 7051722023b98f1720142c7b3b41948d275ea455 Parents: a642829 Author: hyukjinkwon <gurwls...@gmail.com> Authored: Sun May 1 19:05:20 2016 -0700 Committer: Reynold Xin <r...@databricks.com> Committed: Sun May 1 19:05:32 2016 -0700 ---------------------------------------------------------------------- python/pyspark/sql/readwriter.py | 52 ++++++++++++++++++++ .../org/apache/spark/sql/DataFrameReader.scala | 47 ++++++++++++++++-- .../org/apache/spark/sql/DataFrameWriter.scala | 8 +++ 3 files changed, 103 insertions(+), 4 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/70517220/python/pyspark/sql/readwriter.py ---------------------------------------------------------------------- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index ed9e716..cc5e93d 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -282,6 +282,45 @@ class DataFrameReader(object): :param paths: string, or list of strings, for input path(s). + You can set the following CSV-specific options to deal with CSV files: + * ``sep`` (default ``,``): sets the single character as a separator \ + for each field and value. + * ``charset`` (default ``UTF-8``): decodes the CSV files by the given \ + encoding type. + * ``quote`` (default ``"``): sets the single character used for escaping \ + quoted values where the separator can be part of the value. + * ``escape`` (default ``\``): sets the single character used for escaping quotes \ + inside an already quoted value. + * ``comment`` (default empty string): sets the single character used for skipping \ + lines beginning with this character. By default, it is disabled. + * ``header`` (default ``false``): uses the first line as names of columns. + * ``ignoreLeadingWhiteSpace`` (default ``false``): defines whether or not leading \ + whitespaces from values being read should be skipped. + * ``ignoreTrailingWhiteSpace`` (default ``false``): defines whether or not trailing \ + whitespaces from values being read should be skipped. + * ``nullValue`` (default empty string): sets the string representation of a null value. + * ``nanValue`` (default ``NaN``): sets the string representation of a non-number \ + value. + * ``positiveInf`` (default ``Inf``): sets the string representation of a positive \ + infinity value. + * ``negativeInf`` (default ``-Inf``): sets the string representation of a negative \ + infinity value. + * ``dateFormat`` (default ``None``): sets the string that indicates a date format. \ + Custom date formats follow the formats at ``java.text.SimpleDateFormat``. This \ + applies to both date type and timestamp type. By default, it is None which means \ + trying to parse times and date by ``java.sql.Timestamp.valueOf()`` and \ + ``java.sql.Date.valueOf()``. + * ``maxColumns`` (default ``20480``): defines a hard limit of how many columns \ + a record can have. + * ``maxCharsPerColumn`` (default ``1000000``): defines the maximum number of \ + characters allowed for any given value being read. + * ``mode`` (default ``PERMISSIVE``): allows a mode for dealing with corrupt records \ + during parsing. + * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted record. \ + When a schema is set by user, it sets ``null`` for extra fields. + * ``DROPMALFORMED`` : ignores the whole corrupted records. + * ``FAILFAST`` : throws an exception when it meets corrupted records. + >>> df = sqlContext.read.csv('python/test_support/sql/ages.csv') >>> df.dtypes [('C0', 'string'), ('C1', 'string')] @@ -663,6 +702,19 @@ class DataFrameWriter(object): known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate). + You can set the following CSV-specific options to deal with CSV files: + * ``sep`` (default ``,``): sets the single character as a separator \ + for each field and value. + * ``quote`` (default ``"``): sets the single character used for escaping \ + quoted values where the separator can be part of the value. + * ``escape`` (default ``\``): sets the single character used for escaping quotes \ + inside an already quoted value. + * ``header`` (default ``false``): writes the names of columns as the first line. + * ``nullValue`` (default empty string): sets the string representation of a null value. + * ``compression``: compression codec to use when saving to file. This can be one of \ + the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and \ + deflate). + >>> df.write.csv(os.path.join(tempfile.mkdtemp(), 'data')) """ self.mode(mode) http://git-wip-us.apache.org/repos/asf/spark/blob/70517220/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---------------------------------------------------------------------- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index 3d43f20..2d4a68f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -290,7 +290,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers * (e.g. 00012)</li> * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records - * during parsing.<li> + * during parsing.</li> * <ul> * <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts the * malformed string into a new field configured by `columnNameOfCorruptRecord`. When @@ -300,7 +300,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * </ul> * <li>`columnNameOfCorruptRecord` (default `_corrupt_record`): allows renaming the new field * having malformed string created by `PERMISSIVE` mode. This overrides - * `spark.sql.columnNameOfCorruptRecord`.<li> + * `spark.sql.columnNameOfCorruptRecord`.</li> * * @since 1.4.0 */ @@ -326,7 +326,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * <li>`allowBackslashEscapingAnyCharacter` (default `false`): allows accepting quoting of all * character using backslash quoting mechanism</li> * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records - * during parsing.<li> + * during parsing.</li> * <ul> * <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts the * malformed string into a new field configured by `columnNameOfCorruptRecord`. When @@ -336,7 +336,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * </ul> * <li>`columnNameOfCorruptRecord` (default `_corrupt_record`): allows renaming the new field * having malformed string created by `PERMISSIVE` mode. This overrides - * `spark.sql.columnNameOfCorruptRecord`.<li> + * `spark.sql.columnNameOfCorruptRecord`.</li> * * @since 1.6.0 */ @@ -393,6 +393,45 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * This function goes through the input once to determine the input schema. To avoid going * through the entire data once, specify the schema explicitly using [[schema]]. * + * You can set the following CSV-specific options to deal with CSV files: + * <li>`sep` (default `,`): sets the single character as a separator for each + * field and value.</li> + * <li>`encoding` (default `UTF-8`): decodes the CSV files by the given encoding + * type.</li> + * <li>`quote` (default `"`): sets the single character used for escaping quoted values where + * the separator can be part of the value.</li> + * <li>`escape` (default `\`): sets the single character used for escaping quotes inside + * an already quoted value.</li> + * <li>`comment` (default empty string): sets the single character used for skipping lines + * beginning with this character. By default, it is disabled.</li> + * <li>`header` (default `false`): uses the first line as names of columns.</li> + * <li>`ignoreLeadingWhiteSpace` (default `false`): defines whether or not leading whitespaces + * from values being read should be skipped.</li> + * <li>`ignoreTrailingWhiteSpace` (default `fDataFraalse`): defines whether or not trailing + * whitespaces from values being read should be skipped.</li> + * <li>`nullValue` (default empty string): sets the string representation of a null value.</li> + * <li>`nanValue` (default `NaN`): sets the string representation of a non-number" value.</li> + * <li>`positiveInf` (default `Inf`): sets the string representation of a positive infinity + * value.</li> + * <li>`negativeInf` (default `-Inf`): sets the string representation of a negative infinity + * value.</li> + * <li>`dateFormat` (default `null`): sets the string that indicates a date format. Custom date + * formats follow the formats at `java.text.SimpleDateFormat`. This applies to both date type + * and timestamp type. By default, it is `null` which means trying to parse times and date by + * `java.sql.Timestamp.valueOf()` and `java.sql.Date.valueOf()`.</li> + * <li>`maxColumns` (default `20480`): defines a hard limit of how many columns + * a record can have.</li> + * <li>`maxCharsPerColumn` (default `1000000`): defines the maximum number of characters allowed + * for any given value being read.</li> + * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records + * during parsing.</li> + * <ul> + * <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record. When + * a schema is set by user, it sets `null` for extra fields.</li> + * <li>`DROPMALFORMED` : ignores the whole corrupted records.</li> + * <li>`FAILFAST` : throws an exception when it meets corrupted records.</li> + * </ul> + * * @since 2.0.0 */ @scala.annotation.varargs http://git-wip-us.apache.org/repos/asf/spark/blob/70517220/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---------------------------------------------------------------------- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala index 28f5ccd..a57d47d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala @@ -606,6 +606,14 @@ final class DataFrameWriter private[sql](df: DataFrame) { * }}} * * You can set the following CSV-specific option(s) for writing CSV files: + * <li>`sep` (default `,`): sets the single character as a separator for each + * field and value.</li> + * <li>`quote` (default `"`): sets the single character used for escaping quoted values where + * the separator can be part of the value.</li> + * <li>`escape` (default `\`): sets the single character used for escaping quotes inside + * an already quoted value.</li> + * <li>`header` (default `false`): writes the names of columns as the first line.</li> + * <li>`nullValue` (default empty string): sets the string representation of a null value.</li> * <li>`compression` (default `null`): compression codec to use when saving to file. This can be * one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`, * `snappy` and `deflate`). </li> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org