spark git commit: [SPARK-13425][SQL] Documentation for CSV datasource options

rxin Sun, 01 May 2016 19:05:43 -0700

Repository: spark
Updated Branches:
  refs/heads/master a6428292f -> a832cef11



[SPARK-13425][SQL] Documentation for CSV datasource options

## What changes were proposed in this pull request?

This PR adds the explanation and documentation for CSV options for reading and 
writing.

## How was this patch tested?

Style tests with `./dev/run_tests` for documentation style.

Author: hyukjinkwon <gurwls...@gmail.com>
Author: Hyukjin Kwon <gurwls...@gmail.com>

Closes #12817 from HyukjinKwon/SPARK-13425.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a832cef1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a832cef1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a832cef1

Branch: refs/heads/master
Commit: a832cef11233c6357c7ba7ede387b432e6b0ed71
Parents: a642829
Author: hyukjinkwon <gurwls...@gmail.com>
Authored: Sun May 1 19:05:20 2016 -0700
Committer: Reynold Xin <r...@databricks.com>
Committed: Sun May 1 19:05:20 2016 -0700

----------------------------------------------------------------------
 python/pyspark/sql/readwriter.py                | 52 ++++++++++++++++++++
 .../org/apache/spark/sql/DataFrameReader.scala  | 47 ++++++++++++++++--
 .../org/apache/spark/sql/DataFrameWriter.scala  |  8 +++
 3 files changed, 103 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/a832cef1/python/pyspark/sql/readwriter.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index ed9e716..cc5e93d 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -282,6 +282,45 @@ class DataFrameReader(object):
 
         :param paths: string, or list of strings, for input path(s).
 
+        You can set the following CSV-specific options to deal with CSV files:
+            * ``sep`` (default ``,``): sets the single character as a 
separator \
+                for each field and value.
+            * ``charset`` (default ``UTF-8``): decodes the CSV files by the 
given \
+                encoding type.
+            * ``quote`` (default ``"``): sets the single character used for 
escaping \
+                quoted values where the separator can be part of the value.
+            * ``escape`` (default ``\``): sets the single character used for 
escaping quotes \
+                inside an already quoted value.
+            * ``comment`` (default empty string): sets the single character 
used for skipping \
+                lines beginning with this character. By default, it is 
disabled.
+            * ``header`` (default ``false``): uses the first line as names of 
columns.
+            * ``ignoreLeadingWhiteSpace`` (default ``false``): defines whether 
or not leading \
+                whitespaces from values being read should be skipped.
+            * ``ignoreTrailingWhiteSpace`` (default ``false``): defines 
whether or not trailing \
+                whitespaces from values being read should be skipped.
+            * ``nullValue`` (default empty string): sets the string 
representation of a null value.
+            * ``nanValue`` (default ``NaN``): sets the string representation 
of a non-number \
+                value.
+            * ``positiveInf`` (default ``Inf``): sets the string 
representation of a positive \
+                infinity value.
+            * ``negativeInf`` (default ``-Inf``): sets the string 
representation of a negative \
+                infinity value.
+            * ``dateFormat`` (default ``None``): sets the string that 
indicates a date format. \
+                Custom date formats follow the formats at 
``java.text.SimpleDateFormat``. This \
+                applies to both date type and timestamp type. By default, it 
is None which means \
+                trying to parse times and date by 
``java.sql.Timestamp.valueOf()`` and \
+                ``java.sql.Date.valueOf()``.
+            * ``maxColumns`` (default ``20480``): defines a hard limit of how 
many columns \
+                a record can have.
+            * ``maxCharsPerColumn`` (default ``1000000``): defines the maximum 
number of \
+                characters allowed for any given value being read.
+            * ``mode`` (default ``PERMISSIVE``): allows a mode for dealing 
with corrupt records \
+                during parsing.
+                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets 
a corrupted record. \
+                    When a schema is set by user, it sets ``null`` for extra 
fields.
+                * ``DROPMALFORMED`` : ignores the whole corrupted records.
+                * ``FAILFAST`` : throws an exception when it meets corrupted 
records.
+
         >>> df = sqlContext.read.csv('python/test_support/sql/ages.csv')
         >>> df.dtypes
         [('C0', 'string'), ('C1', 'string')]
@@ -663,6 +702,19 @@ class DataFrameWriter(object):
                             known case-insensitive shorten names (none, bzip2, 
gzip, lz4,
                             snappy and deflate).
 
+        You can set the following CSV-specific options to deal with CSV files:
+            * ``sep`` (default ``,``): sets the single character as a 
separator \
+                for each field and value.
+            * ``quote`` (default ``"``): sets the single character used for 
escaping \
+                quoted values where the separator can be part of the value.
+            * ``escape`` (default ``\``): sets the single character used for 
escaping quotes \
+                inside an already quoted value.
+            * ``header`` (default ``false``): writes the names of columns as 
the first line.
+            * ``nullValue`` (default empty string): sets the string 
representation of a null value.
+            * ``compression``: compression codec to use when saving to file. 
This can be one of \
+                the known case-insensitive shorten names (none, bzip2, gzip, 
lz4, snappy and \
+                deflate).
+
         >>> df.write.csv(os.path.join(tempfile.mkdtemp(), 'data'))
         """
         self.mode(mode)

http://git-wip-us.apache.org/repos/asf/spark/blob/a832cef1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 3d43f20..2d4a68f 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -290,7 +290,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
    * <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in 
numbers
    * (e.g. 00012)</li>
    * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt 
records
-   * during parsing.<li>
+   * during parsing.</li>
    * <ul>
    *  <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted 
record, and puts the
    *  malformed string into a new field configured by 
`columnNameOfCorruptRecord`. When
@@ -300,7 +300,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
    * </ul>
    * <li>`columnNameOfCorruptRecord` (default `_corrupt_record`): allows 
renaming the new field
    * having malformed string created by `PERMISSIVE` mode. This overrides
-   * `spark.sql.columnNameOfCorruptRecord`.<li>
+   * `spark.sql.columnNameOfCorruptRecord`.</li>
    *
    * @since 1.4.0
    */
@@ -326,7 +326,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
    * <li>`allowBackslashEscapingAnyCharacter` (default `false`): allows 
accepting quoting of all
    * character using backslash quoting mechanism</li>
    * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt 
records
-   * during parsing.<li>
+   * during parsing.</li>
    * <ul>
    *  <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted 
record, and puts the
    *  malformed string into a new field configured by 
`columnNameOfCorruptRecord`. When
@@ -336,7 +336,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
    * </ul>
    * <li>`columnNameOfCorruptRecord` (default `_corrupt_record`): allows 
renaming the new field
    * having malformed string created by `PERMISSIVE` mode. This overrides
-   * `spark.sql.columnNameOfCorruptRecord`.<li>
+   * `spark.sql.columnNameOfCorruptRecord`.</li>
    *
    * @since 1.6.0
    */
@@ -393,6 +393,45 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
    * This function goes through the input once to determine the input schema. 
To avoid going
    * through the entire data once, specify the schema explicitly using 
[[schema]].
    *
+   * You can set the following CSV-specific options to deal with CSV files:
+   * <li>`sep` (default `,`): sets the single character as a separator for each
+   * field and value.</li>
+   * <li>`encoding` (default `UTF-8`): decodes the CSV files by the given 
encoding
+   * type.</li>
+   * <li>`quote` (default `"`): sets the single character used for escaping 
quoted values where
+   * the separator can be part of the value.</li>
+   * <li>`escape` (default `\`): sets the single character used for escaping 
quotes inside
+   * an already quoted value.</li>
+   * <li>`comment` (default empty string): sets the single character used for 
skipping lines
+   * beginning with this character. By default, it is disabled.</li>
+   * <li>`header` (default `false`): uses the first line as names of 
columns.</li>
+   * <li>`ignoreLeadingWhiteSpace` (default `false`): defines whether or not 
leading whitespaces
+   * from values being read should be skipped.</li>
+   * <li>`ignoreTrailingWhiteSpace` (default `fDataFraalse`): defines whether 
or not trailing
+   * whitespaces from values being read should be skipped.</li>
+   * <li>`nullValue` (default empty string): sets the string representation of 
a null value.</li>
+   * <li>`nanValue` (default `NaN`): sets the string representation of a 
non-number" value.</li>
+   * <li>`positiveInf` (default `Inf`): sets the string representation of a 
positive infinity
+   * value.</li>
+   * <li>`negativeInf` (default `-Inf`): sets the string representation of a 
negative infinity
+   * value.</li>
+   * <li>`dateFormat` (default `null`): sets the string that indicates a date 
format. Custom date
+   * formats follow the formats at `java.text.SimpleDateFormat`. This applies 
to both date type
+   * and timestamp type. By default, it is `null` which means trying to parse 
times and date by
+   * `java.sql.Timestamp.valueOf()` and `java.sql.Date.valueOf()`.</li>
+   * <li>`maxColumns` (default `20480`): defines a hard limit of how many 
columns
+   * a record can have.</li>
+   * <li>`maxCharsPerColumn` (default `1000000`): defines the maximum number 
of characters allowed
+   * for any given value being read.</li>
+   * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt 
records
+   *    during parsing.</li>
+   * <ul>
+   *   <li>`PERMISSIVE` : sets other fields to `null` when it meets a 
corrupted record. When
+   *     a schema is set by user, it sets `null` for extra fields.</li>
+   *   <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
+   *   <li>`FAILFAST` : throws an exception when it meets corrupted 
records.</li>
+   * </ul>
+   *
    * @since 2.0.0
    */
   @scala.annotation.varargs

http://git-wip-us.apache.org/repos/asf/spark/blob/a832cef1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
index 28f5ccd..a57d47d 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
@@ -606,6 +606,14 @@ final class DataFrameWriter private[sql](df: DataFrame) {
    * }}}
    *
    * You can set the following CSV-specific option(s) for writing CSV files:
+   * <li>`sep` (default `,`): sets the single character as a separator for each
+   * field and value.</li>
+   * <li>`quote` (default `"`): sets the single character used for escaping 
quoted values where
+   * the separator can be part of the value.</li>
+   * <li>`escape` (default `\`): sets the single character used for escaping 
quotes inside
+   * an already quoted value.</li>
+   * <li>`header` (default `false`): writes the names of columns as the first 
line.</li>
+   * <li>`nullValue` (default empty string): sets the string representation of 
a null value.</li>
    * <li>`compression` (default `null`): compression codec to use when saving 
to file. This can be
    * one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, 
`lz4`,
    * `snappy` and `deflate`). </li>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-13425][SQL] Documentation for CSV datasource options

Reply via email to