This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 73d4f67 [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page 73d4f67 is described below commit 73d4f67145dd7fbad282a9608ac2ac0f31c4b385 Author: itholic <haejoon....@databricks.com> AuthorDate: Tue Jun 1 10:58:49 2021 +0900 [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move CSV data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for CSV data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "CSV Files" page <img width="970" alt="Screen Shot 2021-05-27 at 12 35 36 PM" src="https://user-images.githubusercontent.com/44108233/119762269-586a8c80-bee8-11eb-8443-ae5b3c7a685c.png"> - Python <img width="785" alt="Screen Shot 2021-05-25 at 4 12 10 PM" src="https://user-images.githubusercontent.com/44108233/119455390-83cc6a80-bd74-11eb-9156-65785ae27db0.png"> - Scala <img width="718" alt="Screen Shot 2021-05-25 at 4 12 39 PM" src="https://user-images.githubusercontent.com/44108233/119455414-89c24b80-bd74-11eb-9775-aeda549d081e.png"> - Java <img width="667" alt="Screen Shot 2021-05-25 at 4 13 09 PM" src="https://user-images.githubusercontent.com/44108233/119455422-8d55d280-bd74-11eb-97e8-86c1eabeadc2.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32658 from itholic/SPARK-35433. Authored-by: itholic <haejoon....@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> --- docs/sql-data-sources-csv.md | 216 +++++++++++++++- docs/sql-data-sources-text.md | 2 +- python/pyspark/sql/readwriter.py | 277 +++------------------ python/pyspark/sql/streaming.py | 170 ++----------- .../org/apache/spark/sql/DataFrameReader.scala | 116 +-------- .../org/apache/spark/sql/DataFrameWriter.scala | 45 +--- .../scala/org/apache/spark/sql/functions.scala | 28 ++- .../spark/sql/streaming/DataStreamReader.scala | 99 +------- 8 files changed, 314 insertions(+), 639 deletions(-) diff --git a/docs/sql-data-sources-csv.md b/docs/sql-data-sources-csv.md index d5390e5..2fe8f77 100644 --- a/docs/sql-data-sources-csv.md +++ b/docs/sql-data-sources-csv.md @@ -21,8 +21,6 @@ license: | Spark SQL provides `spark.read().csv("file_name")` to read a file or directory of files in CSV format into Spark DataFrame, and `dataframe.write().csv("path")` to write to a CSV file. Function `option()` can be used to customize the behavior of reading or writing, such as controlling behavior of the header, delimiter character, character set, and so on. -<!--TODO: add `option()` document reference--> - <div class="codetabs"> <div data-lang="scala" markdown="1"> @@ -38,3 +36,217 @@ Spark SQL provides `spark.read().csv("file_name")` to read a file or directory o </div> </div> + +## Data Source Option + +Data source options of CSV can be set via: +* the `.option`/`.options` methods of + * `DataFrameReader` + * `DataFrameWriter` + * `DataStreamReader` + * `DataStreamWriter` +* the built-in functions below + * `from_csv` + * `to_csv` + * `schema_of_csv` +* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) + + +<table class="table"> + <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr> + <tr> + <td><code>sep</code></td> + <td>,</td> + <td>Sets a separator for each field and value. This separator can be one or more characters.</td> + <td>read/write</td> + </tr> + <tr> + <td><code>encoding</code></td> + <td>UTF-8</td> + <td>For reading, decodes the CSV files by the given encoding type. For writing, specifies encoding (charset) of saved CSV files</td> + <td>read/write</td> + </tr> + <tr> + <td><code>quote</code></td> + <td>"</td> + <td>Sets a single character used for escaping quoted values where the separator can be part of the value. For reading, if you would like to turn off quotations, you need to set not <code>null</code> but an empty string. For writing, if an empty string is set, it uses <code>u0000</code> (null character).</td> + <td>read/write</td> + </tr> + <tr> + <td><code>quoteAll</code></td> + <td>false</td> + <td>A flag indicating whether all values should always be enclosed in quotes. Default is to only escape values containing a quote character.</td> + <td>write</td> + </tr> + <tr> + <td><code>escape</code></td> + <td>\</td> + <td>Sets a single character used for escaping quotes inside an already quoted value.</td> + <td>read/write</td> + </tr> + <tr> + <td><code>escapeQuotes</code></td> + <td>true</td> + <td>A flag indicating whether values containing quotes should always be enclosed in quotes. Default is to escape all values containing a quote character.</td> + <td>write</td> + </tr> + <tr> + <td><code>comment</code></td> + <td></td> + <td>Sets a single character used for skipping lines beginning with this character. By default, it is disabled.</td> + <td>read</td> + </tr> + <tr> + <td><code>header</code></td> + <td>false</td> + <td>For reading, uses the first line as names of columns. For writing, writes the names of columns as the first line. Note that if the given path is a RDD of Strings, this header option will remove all lines same with the header if exists.</td> + <td>read/write</td> + </tr> + <tr> + <td><code>inferSchema</code></td> + <td>false</td> + <td>Infers the input schema automatically from data. It requires one extra pass over the data.</td> + <td>read</td> + </tr> + <tr> + <td><code>enforceSchema</code></td> + <td>true</td> + <td>If it is set to <code>true</code>, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to <code>false</code>, the schema will be validated against all headers in CSV files in the case when the <code>header</code> option is set to <code>true</code>. Field names in the schema and column names in CSV headers are checked by their positions taking into account <code>spark.sql.caseSensitive</code> [...] + <td>read</td> + </tr> + <tr> + <td><code>ignoreLeadingWhiteSpace</code></td> + <td><code>false</code> (for reading), <code>true</code> (for writing)</td> + <td>A flag indicating whether or not leading whitespaces from values being read/written should be skipped.</td> + <td>read/write</td> + </tr> + <tr> + <td><code>ignoreTrailingWhiteSpace</code></td> + <td><code>false</code> (for reading), <code>true</code> (for writing)</td> + <td>A flag indicating whether or not trailing whitespaces from values being read/written should be skipped.</td> + <td>read/write</td> + </tr> + <tr> + <td><code>nullValue</code></td> + <td></td> + <td>Sets the string representation of a null value. Since 2.0.1, this <code>nullValue</code> param applies to all supported types including the string type.</td> + <td>read/write</td> + </tr> + <tr> + <td><code>nanValue</code></td> + <td>NaN</td> + <td>Sets the string representation of a non-number value.</td> + <td>read</td> + </tr> + <tr> + <td><code>positiveInf</code></td> + <td>Inf</td> + <td>Sets the string representation of a positive infinity value.</td> + <td>read</td> + </tr> + <tr> + <td><code>negativeInf</code></td> + <td>-Inf</td> + <td>Sets the string representation of a negative infinity value.</td> + <td>read</td> + </tr> + <tr> + <td><code>dateFormat</code></td> + <td>yyyy-MM-dd</td> + <td>Sets the string that indicates a date format. Custom date formats follow the formats at <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">Datetime Patterns</a>. This applies to date type.</td> + <td>read/write</td> + </tr> + <tr> + <td><code>timestampFormat</code></td> + <td>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</td> + <td>Sets the string that indicates a timestamp format. Custom date formats follow the formats at <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">Datetime Patterns</a>. This applies to timestamp type.</td> + <td>read/write</td> + </tr> + <tr> + <td><code>maxColumns</code></td> + <td>20480</td> + <td>Defines a hard limit of how many columns a record can have.</td> + <td>read</td> + </tr> + <tr> + <td><code>maxCharsPerColumn</code></td> + <td>-1</td> + <td>Defines the maximum number of characters allowed for any given value being read. By default, it is -1 meaning unlimited length</td> + <td>read</td> + </tr> + <tr> + <td><code>mode</code></td> + <td>PERMISSIVE</td> + <td>Allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes. Note that Spark tries to parse only required columns in CSV under column pruning. Therefore, corrupt records can be different based on required set of fields. This behavior can be controlled by <code>spark.sql.csv.parser.columnPruning.enabled</code> (enabled by default).<br> + <ul> + <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the malformed string into a field configured by <code>columnNameOfCorruptRecord</code>, and sets malformed fields to <code>null</code>. To keep corrupt records, an user can set a string type field named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. A record with less/more tokens than schema is not a corrupted record to [...] + <li><code>DROPMALFORMED</code>: ignores the whole corrupted records.</li> + <li><code>FAILFAST</code>: throws an exception when it meets corrupted records.</li> + </ul> + </td> + <td>read</td> + </tr> + <tr> + <td><code>columnNameOfCorruptRecord</code></td> + <td>The value specified in <code>spark.sql.columnNameOfCorruptRecord</code></td> + <td>Allows renaming the new field having malformed string created by <code>PERMISSIVE</code> mode. This overrides <code>spark.sql.columnNameOfCorruptRecord</code>.</td> + <td>read</td> + </tr> + <tr> + <td><code>multiLine</code></td> + <td>false</td> + <td>Parse one record, which may span multiple lines, per file.</td> + <td>read</td> + </tr> + <tr> + <td><code>charToEscapeQuoteEscaping</code></td> + <td><code>escape</code> or <code>\0</code></td> + <td>Sets a single character used for escaping the escape for the quote character. The default value is escape character when escape and quote characters are different, <code>\0</code> otherwise.</td> + <td>read/write</td> + </tr> + <tr> + <td><code>samplingRatio</code></td> + <td>1.0</td> + <td>Defines fraction of rows used for schema inferring.</td> + <td>read</td> + </tr> + <tr> + <td><code>emptyValue</code></td> + <td><code></code> (for reading), <code>""</code> (for writing)</td> + <td>Sets the string representation of an empty value.</td> + <td>read/write</td> + </tr> + <tr> + <td><code>locale</code></td> + <td>en-US</td> + <td>Sets a locale as language tag in IETF BCP 47 format. For instance, this is used while parsing dates and timestamps.</td> + <td>read</td> + </tr> + <tr> + <td><code>lineSep</code></td> + <td><code>\r</code>, <code>\r\n</code> and <code>\n</code> (for reading), <code>\n</code> (for writing)</td> + <td>Defines the line separator that should be used for parsing/writing. Maximum length is 1 character.</td> + <td>read/write</td> + </tr> + <tr> + <td><code>unescapedQuoteHandling</code></td> + <td>STOP_AT_DELIMITER</td> + <td>Defines how the CsvParser will handle values with unescaped quotes.<br> + <ul> + <li><code>STOP_AT_CLOSING_QUOTE</code>: If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found.</li> + <li><code>BACK_TO_DELIMITER</code>: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters of the current parsed value until the delimiter is found. If no delimiter is found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found.</li> + <li><code>STOP_AT_DELIMITER</code>: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until the delimiter or a line ending is found in the input.</li> + <li><code>SKIP_VALUE</code>: If unescaped quotes are found in the input, the content parsed for the given value will be skipped and the value set in nullValue will be produced instead.</li> + <li><code>RAISE_ERROR</code>: If unescaped quotes are found in the input, a TextParsingException will be thrown.</li> + </ul> + </td> + <td>read</td> + </tr> + <tr> + <td><code>compression</code></td> + <td>(none)</td> + <td>Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (<code>none</code>, <code>bzip2</code>, <code>gzip</code>, <code>lz4</code>, <code>snappy</code> and <code>deflate</code>).</td> + <td>write</td> + </tr> +</table> +Other generic options can be found in <a href="https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html">Generic File Source Options</a>. diff --git a/docs/sql-data-sources-text.md b/docs/sql-data-sources-text.md index d72b543..fac874a 100644 --- a/docs/sql-data-sources-text.md +++ b/docs/sql-data-sources-text.md @@ -45,7 +45,7 @@ Data source options of text can be set via: * `DataFrameWriter` * `DataStreamReader` * `DataStreamWriter` - * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) +* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) <table class="table"> <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr> diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 7719d48..f9e3734 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -195,9 +195,11 @@ class DataFrameReader(OptionUtils): ---------------- Extra options For the extra options, refer to - `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option>`_ # noqa + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option>`_ in the version you use. + .. # noqa + Examples -------- >>> df1 = spark.read.json('python/test_support/sql/people.json') @@ -273,9 +275,11 @@ class DataFrameReader(OptionUtils): ---------------- **options For the extra options, refer to - `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option>`_ # noqa + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option>`_ in the version you use. + .. # noqa + Examples -------- >>> df = spark.read.parquet('python/test_support/sql/parquet_partitioned') @@ -318,9 +322,11 @@ class DataFrameReader(OptionUtils): ---------------- Extra options For the extra options, refer to - `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option>`_ # noqa + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option>`_ in the version you use. + .. # noqa + Examples -------- >>> df = spark.read.text('python/test_support/sql/text-test.txt') @@ -364,172 +370,15 @@ class DataFrameReader(OptionUtils): schema : :class:`pyspark.sql.types.StructType` or str, optional an optional :class:`pyspark.sql.types.StructType` for the input schema or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``). - sep : str, optional - sets a separator (one or more characters) for each field and value. If None is - set, it uses the default value, ``,``. - encoding : str, optional - decodes the CSV files by the given encoding type. If None is set, - it uses the default value, ``UTF-8``. - quote : str, optional - sets a single character used for escaping quoted values where the - separator can be part of the value. If None is set, it uses the default - value, ``"``. If you would like to turn off quotations, you need to set an - empty string. - escape : str, optional - sets a single character used for escaping quotes inside an already - quoted value. If None is set, it uses the default value, ``\``. - comment : str, optional - sets a single character used for skipping lines beginning with this - character. By default (None), it is disabled. - header : str or bool, optional - uses the first line as names of columns. If None is set, it uses the - default value, ``false``. - - .. note:: if the given path is a RDD of Strings, this header - option will remove all lines same with the header if exists. - - inferSchema : str or bool, optional - infers the input schema automatically from data. It requires one extra - pass over the data. If None is set, it uses the default value, ``false``. - enforceSchema : str or bool, optional - If it is set to ``true``, the specified or inferred schema will be - forcibly applied to datasource files, and headers in CSV files will be - ignored. If the option is set to ``false``, the schema will be - validated against all headers in CSV files or the first header in RDD - if the ``header`` option is set to ``true``. Field names in the schema - and column names in CSV headers are checked by their positions - taking into account ``spark.sql.caseSensitive``. If None is set, - ``true`` is used by default. Though the default value is ``true``, - it is recommended to disable the ``enforceSchema`` option - to avoid incorrect results. - ignoreLeadingWhiteSpace : str or bool, optional - A flag indicating whether or not leading whitespaces from - values being read should be skipped. If None is set, it - uses the default value, ``false``. - ignoreTrailingWhiteSpace : str or bool, optional - A flag indicating whether or not trailing whitespaces from - values being read should be skipped. If None is set, it - uses the default value, ``false``. - nullValue : str, optional - sets the string representation of a null value. If None is set, it uses - the default value, empty string. Since 2.0.1, this ``nullValue`` param - applies to all supported types including the string type. - nanValue : str, optional - sets the string representation of a non-number value. If None is set, it - uses the default value, ``NaN``. - positiveInf : str, optional - sets the string representation of a positive infinity value. If None - is set, it uses the default value, ``Inf``. - negativeInf : str, optional - sets the string representation of a negative infinity value. If None - is set, it uses the default value, ``Inf``. - dateFormat : str, optional - sets the string that indicates a date format. Custom date formats - follow the formats at - `datetime pattern <https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. # noqa - This applies to date type. If None is set, it uses the - default value, ``yyyy-MM-dd``. - timestampFormat : str, optional - sets the string that indicates a timestamp format. - Custom date formats follow the formats at - `datetime pattern <https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. # noqa - This applies to timestamp type. If None is set, it uses the - default value, ``yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]``. - maxColumns : str or int, optional - defines a hard limit of how many columns a record can have. If None is - set, it uses the default value, ``20480``. - maxCharsPerColumn : str or int, optional - defines the maximum number of characters allowed for any given - value being read. If None is set, it uses the default value, - ``-1`` meaning unlimited length. - maxMalformedLogPerPartition : str or int, optional - this parameter is no longer used since Spark 2.2.0. - If specified, it is ignored. - mode : str, optional - allows a mode for dealing with corrupt records during parsing. If None is - set, it uses the default value, ``PERMISSIVE``. Note that Spark tries to - parse only required columns in CSV under column pruning. Therefore, corrupt - records can be different based on required set of fields. This behavior can - be controlled by ``spark.sql.csv.parser.columnPruning.enabled`` - (enabled by default). - - * ``PERMISSIVE``: when it meets a corrupted record, puts the malformed string \ - into a field configured by ``columnNameOfCorruptRecord``, and sets malformed \ - fields to ``null``. To keep corrupt records, an user can set a string type \ - field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \ - schema does not have the field, it drops corrupt records during parsing. \ - A record with less/more tokens than schema is not a corrupted record to CSV. \ - When it meets a record having fewer tokens than the length of the schema, \ - sets ``null`` to extra fields. When the record has more tokens than the \ - length of the schema, it drops extra tokens. - * ``DROPMALFORMED``: ignores the whole corrupted records. - * ``FAILFAST``: throws an exception when it meets corrupted records. - - columnNameOfCorruptRecord : str, optional - allows renaming the new field having malformed string - created by ``PERMISSIVE`` mode. This overrides - ``spark.sql.columnNameOfCorruptRecord``. If None is set, - it uses the value specified in - ``spark.sql.columnNameOfCorruptRecord``. - multiLine : str or bool, optional - parse records, which may span multiple lines. If None is - set, it uses the default value, ``false``. - charToEscapeQuoteEscaping : str, optional - sets a single character used for escaping the escape for - the quote character. If None is set, the default value is - escape character when escape and quote characters are - different, ``\0`` otherwise. - samplingRatio : str or float, optional - defines fraction of rows used for schema inferring. - If None is set, it uses the default value, ``1.0``. - emptyValue : str, optional - sets the string representation of an empty value. If None is set, it uses - the default value, empty string. - locale : str, optional - sets a locale as language tag in IETF BCP 47 format. If None is set, - it uses the default value, ``en-US``. For instance, ``locale`` is used while - parsing dates and timestamps. - lineSep : str, optional - defines the line separator that should be used for parsing. If None is - set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``. - Maximum length is 1 character. - pathGlobFilter : str or bool, optional - an optional glob pattern to only include files with paths matching - the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`. - It does not change the behavior of - `partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa - recursiveFileLookup : str or bool, optional - recursively scan a directory for files. Using this option disables - `partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa - - modification times occurring before the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - modifiedBefore (batch only) : an optional timestamp to only include files with - modification times occurring before the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - modifiedAfter (batch only) : an optional timestamp to only include files with - modification times occurring after the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - unescapedQuoteHandling : str, optional - defines how the CsvParser will handle values with unescaped quotes. If None is - set, it uses the default value, ``STOP_AT_DELIMITER``. - - * ``STOP_AT_CLOSING_QUOTE``: If unescaped quotes are found in the input, accumulate - the quote character and proceed parsing the value as a quoted value, until a closing - quote is found. - * ``BACK_TO_DELIMITER``: If unescaped quotes are found in the input, consider the value - as an unquoted value. This will make the parser accumulate all characters of the current - parsed value until the delimiter is found. If no delimiter is found in the value, the - parser will continue accumulating characters from the input until a delimiter or line - ending is found. - * ``STOP_AT_DELIMITER``: If unescaped quotes are found in the input, consider the value - as an unquoted value. This will make the parser accumulate all characters until the - delimiter or a line ending is found in the input. - * ``SKIP_VALUE``: If unescaped quotes are found in the input, the content parsed - for the given value will be skipped and the value set in nullValue will be produced - instead. - * ``RAISE_ERROR``: If unescaped quotes are found in the input, a TextParsingException - will be thrown. + + Other Parameters + ---------------- + Extra options + For the extra options, refer to + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option>`_ + in the version you use. + + .. # noqa Examples -------- @@ -595,9 +444,11 @@ class DataFrameReader(OptionUtils): ---------------- Extra options For the extra options, refer to - `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option>`_ # noqa + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option>`_ in the version you use. + .. # noqa + Examples -------- >>> df = spark.read.orc('python/test_support/sql/orc_partitioned') @@ -963,9 +814,11 @@ class DataFrameWriter(OptionUtils): ---------------- Extra options For the extra options, refer to - `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option>`_ # noqa + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option>`_ in the version you use. + .. # noqa + Examples -------- >>> df.write.json(os.path.join(tempfile.mkdtemp(), 'data')) @@ -1000,9 +853,11 @@ class DataFrameWriter(OptionUtils): ---------------- Extra options For the extra options, refer to - `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option>`_ # noqa + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option>`_ in the version you use. + .. # noqa + Examples -------- >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data')) @@ -1028,9 +883,11 @@ class DataFrameWriter(OptionUtils): ---------------- Extra options For the extra options, refer to - `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option>`_ # noqa + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option>`_ in the version you use. + .. # noqa + The DataFrame must have only one column that is of string type. Each row becomes a new line in the output file. """ @@ -1058,68 +915,14 @@ class DataFrameWriter(OptionUtils): * ``error`` or ``errorifexists`` (default case): Throw an exception if data already \ exists. - compression : str, optional - compression codec to use when saving to file. This can be one of the - known case-insensitive shorten names (none, bzip2, gzip, lz4, - snappy and deflate). - sep : str, optional - sets a separator (one or more characters) for each field and value. If None is - set, it uses the default value, ``,``. - quote : str, optional - sets a single character used for escaping quoted values where the - separator can be part of the value. If None is set, it uses the default - value, ``"``. If an empty string is set, it uses ``u0000`` (null character). - escape : str, optional - sets a single character used for escaping quotes inside an already - quoted value. If None is set, it uses the default value, ``\`` - escapeQuotes : str or bool, optional - a flag indicating whether values containing quotes should always - be enclosed in quotes. If None is set, it uses the default value - ``true``, escaping all values containing a quote character. - quoteAll : str or bool, optional - a flag indicating whether all values should always be enclosed in - quotes. If None is set, it uses the default value ``false``, - only escaping values containing a quote character. - header : str or bool, optional - writes the names of columns as the first line. If None is set, it uses - the default value, ``false``. - nullValue : str, optional - sets the string representation of a null value. If None is set, it uses - the default value, empty string. - dateFormat : str, optional - sets the string that indicates a date format. Custom date formats follow - the formats at - `datetime pattern <https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. # noqa - This applies to date type. If None is set, it uses the - default value, ``yyyy-MM-dd``. - timestampFormat : str, optional - sets the string that indicates a timestamp format. - Custom date formats follow the formats at - `datetime pattern <https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. # noqa - This applies to timestamp type. If None is set, it uses the - default value, ``yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]``. - ignoreLeadingWhiteSpace : str or bool, optional - a flag indicating whether or not leading whitespaces from - values being written should be skipped. If None is set, it - uses the default value, ``true``. - ignoreTrailingWhiteSpace : str or bool, optional - a flag indicating whether or not trailing whitespaces from - values being written should be skipped. If None is set, it - uses the default value, ``true``. - charToEscapeQuoteEscaping : str, optional - sets a single character used for escaping the escape for - the quote character. If None is set, the default value is - escape character when escape and quote characters are - different, ``\0`` otherwise.. - encoding : str, optional - sets the encoding (charset) of saved csv files. If None is set, - the default UTF-8 charset will be used. - emptyValue : str, optional - sets the string representation of an empty value. If None is set, it uses - the default value, ``""``. - lineSep : str, optional - defines the line separator that should be used for writing. If None is - set, it uses the default value, ``\\n``. Maximum length is 1 character. + Other Parameters + ---------------- + Extra options + For the extra options, refer to + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option>`_ + in the version you use. + + .. # noqa Examples -------- @@ -1159,9 +962,11 @@ class DataFrameWriter(OptionUtils): ---------------- Extra options For the extra options, refer to - `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option>`_ # noqa + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option>`_ in the version you use. + .. # noqa + Examples -------- >>> orc_df = spark.read.orc('python/test_support/sql/orc_partitioned') diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py index f7ec69a..08c8934 100644 --- a/python/pyspark/sql/streaming.py +++ b/python/pyspark/sql/streaming.py @@ -484,9 +484,11 @@ class DataStreamReader(OptionUtils): ---------------- Extra options For the extra options, refer to - `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option>`_ # noqa + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option>`_ in the version you use. + .. # noqa + Notes ----- This API is evolving. @@ -524,9 +526,11 @@ class DataStreamReader(OptionUtils): ---------------- Extra options For the extra options, refer to - `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option>`_ # noqa + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option>`_ in the version you use. + .. # noqa + Examples -------- >>> orc_sdf = spark.readStream.schema(sdf_schema).orc(tempfile.mkdtemp()) @@ -558,9 +562,11 @@ class DataStreamReader(OptionUtils): ---------------- Extra options For the extra options, refer to - `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option>`_. # noqa + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option>`_. in the version you use. + .. # noqa + Examples -------- >>> parquet_sdf = spark.readStream.schema(sdf_schema).parquet(tempfile.mkdtemp()) @@ -598,9 +604,11 @@ class DataStreamReader(OptionUtils): ---------------- Extra options For the extra options, refer to - `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option>`_ # noqa + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option>`_ in the version you use. + .. # noqa + Notes ----- This API is evolving. @@ -642,154 +650,18 @@ class DataStreamReader(OptionUtils): schema : :class:`pyspark.sql.types.StructType` or str, optional an optional :class:`pyspark.sql.types.StructType` for the input schema or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``). - sep : str, optional - sets a separator (one or more characters) for each field and value. If None is - set, it uses the default value, ``,``. - encoding : str, optional - decodes the CSV files by the given encoding type. If None is set, - it uses the default value, ``UTF-8``. - quote : str, optional sets a single character used for escaping quoted values where the - separator can be part of the value. If None is set, it uses the default - value, ``"``. If you would like to turn off quotations, you need to set an - empty string. - escape : str, optional - sets a single character used for escaping quotes inside an already - quoted value. If None is set, it uses the default value, ``\``. - comment : str, optional - sets a single character used for skipping lines beginning with this - character. By default (None), it is disabled. - header : str or bool, optional - uses the first line as names of columns. If None is set, it uses the - default value, ``false``. - inferSchema : str or bool, optional - infers the input schema automatically from data. It requires one extra - pass over the data. If None is set, it uses the default value, ``false``. - enforceSchema : str or bool, optional - If it is set to ``true``, the specified or inferred schema will be - forcibly applied to datasource files, and headers in CSV files will be - ignored. If the option is set to ``false``, the schema will be - validated against all headers in CSV files or the first header in RDD - if the ``header`` option is set to ``true``. Field names in the schema - and column names in CSV headers are checked by their positions - taking into account ``spark.sql.caseSensitive``. If None is set, - ``true`` is used by default. Though the default value is ``true``, - it is recommended to disable the ``enforceSchema`` option - to avoid incorrect results. - ignoreLeadingWhiteSpace : str or bool, optional - a flag indicating whether or not leading whitespaces from - values being read should be skipped. If None is set, it - uses the default value, ``false``. - ignoreTrailingWhiteSpace : str or bool, optional - a flag indicating whether or not trailing whitespaces from - values being read should be skipped. If None is set, it - uses the default value, ``false``. - nullValue : str, optional - sets the string representation of a null value. If None is set, it uses - the default value, empty string. Since 2.0.1, this ``nullValue`` param - applies to all supported types including the string type. - nanValue : str, optional - sets the string representation of a non-number value. If None is set, it - uses the default value, ``NaN``. - positiveInf : str, optional - sets the string representation of a positive infinity value. If None - is set, it uses the default value, ``Inf``. - negativeInf : str, optional - sets the string representation of a negative infinity value. If None - is set, it uses the default value, ``Inf``. - dateFormat : str, optional - sets the string that indicates a date format. Custom date formats - follow the formats at - `datetime pattern <https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. # noqa - This applies to date type. If None is set, it uses the - default value, ``yyyy-MM-dd``. - timestampFormat : str, optional - sets the string that indicates a timestamp format. - Custom date formats follow the formats at - `datetime pattern <https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. # noqa - This applies to timestamp type. If None is set, it uses the - default value, ``yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]``. - maxColumns : str or int, optional - defines a hard limit of how many columns a record can have. If None is - set, it uses the default value, ``20480``. - maxCharsPerColumn : str or int, optional - defines the maximum number of characters allowed for any given - value being read. If None is set, it uses the default value, - ``-1`` meaning unlimited length. - maxMalformedLogPerPartition : str or int, optional - this parameter is no longer used since Spark 2.2.0. - If specified, it is ignored. - mode : str, optional - allows a mode for dealing with corrupt records during parsing. If None is - set, it uses the default value, ``PERMISSIVE``. - - * ``PERMISSIVE``: when it meets a corrupted record, puts the malformed string \ - into a field configured by ``columnNameOfCorruptRecord``, and sets malformed \ - fields to ``null``. To keep corrupt records, an user can set a string type \ - field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \ - schema does not have the field, it drops corrupt records during parsing. \ - A record with less/more tokens than schema is not a corrupted record to CSV. \ - When it meets a record having fewer tokens than the length of the schema, \ - sets ``null`` to extra fields. When the record has more tokens than the \ - length of the schema, it drops extra tokens. - * ``DROPMALFORMED``: ignores the whole corrupted records. - * ``FAILFAST``: throws an exception when it meets corrupted records. - - columnNameOfCorruptRecord : str, optional - allows renaming the new field having malformed string - created by ``PERMISSIVE`` mode. This overrides - ``spark.sql.columnNameOfCorruptRecord``. If None is set, - it uses the value specified in - ``spark.sql.columnNameOfCorruptRecord``. - multiLine : str or bool, optional - parse one record, which may span multiple lines. If None is - set, it uses the default value, ``false``. - charToEscapeQuoteEscaping : str, optional - sets a single character used for escaping the escape for - the quote character. If None is set, the default value is - escape character when escape and quote characters are - different, ``\0`` otherwise. - emptyValue : str, optional - sets the string representation of an empty value. If None is set, it uses - the default value, empty string. - locale : str, optional - sets a locale as language tag in IETF BCP 47 format. If None is set, - it uses the default value, ``en-US``. For instance, ``locale`` is used while - parsing dates and timestamps. - lineSep : str, optional - defines the line separator that should be used for parsing. If None is - set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``. - Maximum length is 1 character. - pathGlobFilter : str or bool, optional - an optional glob pattern to only include files with paths matching - the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`. - It does not change the behavior of - `partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa - recursiveFileLookup : str or bool, optional - recursively scan a directory for files. Using this option disables - `partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa - unescapedQuoteHandling : str, optional - defines how the CsvParser will handle values with unescaped quotes. If None is - set, it uses the default value, ``STOP_AT_DELIMITER``. - - * ``STOP_AT_CLOSING_QUOTE``: If unescaped quotes are found in the input, accumulate - the quote character and proceed parsing the value as a quoted value, until a closing - quote is found. - * ``BACK_TO_DELIMITER``: If unescaped quotes are found in the input, consider the value - as an unquoted value. This will make the parser accumulate all characters of the current - parsed value until the delimiter is found. If no delimiter is found in the value, the - parser will continue accumulating characters from the input until a delimiter or line - ending is found. - * ``STOP_AT_DELIMITER``: If unescaped quotes are found in the input, consider the value - as an unquoted value. This will make the parser accumulate all characters until the - delimiter or a line ending is found in the input. - * ``SKIP_VALUE``: If unescaped quotes are found in the input, the content parsed - for the given value will be skipped and the value set in nullValue will be produced - instead. - * ``RAISE_ERROR``: If unescaped quotes are found in the input, a TextParsingException - will be thrown. .. versionadded:: 2.0.0 + Other Parameters + ---------------- + Extra options + For the extra options, refer to + `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option>`_ + in the version you use. + + .. # noqa + Notes ----- This API is evolving. diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index ea84785..8a066bf 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -556,119 +556,9 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * is enabled. To avoid going through the entire data once, disable `inferSchema` option or * specify the schema explicitly using `schema`. * - * You can set the following CSV-specific options to deal with CSV files: - * <ul> - * <li>`sep` (default `,`): sets a separator for each field and value. This separator can be one - * or more characters.</li> - * <li>`encoding` (default `UTF-8`): decodes the CSV files by the given encoding - * type.</li> - * <li>`quote` (default `"`): sets a single character used for escaping quoted values where - * the separator can be part of the value. If you would like to turn off quotations, you need to - * set not `null` but an empty string. This behaviour is different from - * `com.databricks.spark.csv`.</li> - * <li>`escape` (default `\`): sets a single character used for escaping quotes inside - * an already quoted value.</li> - * <li>`charToEscapeQuoteEscaping` (default `escape` or `\0`): sets a single character used for - * escaping the escape for the quote character. The default value is escape character when escape - * and quote characters are different, `\0` otherwise.</li> - * <li>`comment` (default empty string): sets a single character used for skipping lines - * beginning with this character. By default, it is disabled.</li> - * <li>`header` (default `false`): uses the first line as names of columns.</li> - * <li>`enforceSchema` (default `true`): If it is set to `true`, the specified or inferred schema - * will be forcibly applied to datasource files, and headers in CSV files will be ignored. - * If the option is set to `false`, the schema will be validated against all headers in CSV files - * in the case when the `header` option is set to `true`. Field names in the schema - * and column names in CSV headers are checked by their positions taking into account - * `spark.sql.caseSensitive`. Though the default value is true, it is recommended to disable - * the `enforceSchema` option to avoid incorrect results.</li> - * <li>`inferSchema` (default `false`): infers the input schema automatically from data. It - * requires one extra pass over the data.</li> - * <li>`samplingRatio` (default is 1.0): defines fraction of rows used for schema inferring.</li> - * <li>`ignoreLeadingWhiteSpace` (default `false`): a flag indicating whether or not leading - * whitespaces from values being read should be skipped.</li> - * <li>`ignoreTrailingWhiteSpace` (default `false`): a flag indicating whether or not trailing - * whitespaces from values being read should be skipped.</li> - * <li>`nullValue` (default empty string): sets the string representation of a null value. Since - * 2.0.1, this applies to all supported types including the string type.</li> - * <li>`emptyValue` (default empty string): sets the string representation of an empty value.</li> - * <li>`nanValue` (default `NaN`): sets the string representation of a non-number" value.</li> - * <li>`positiveInf` (default `Inf`): sets the string representation of a positive infinity - * value.</li> - * <li>`negativeInf` (default `-Inf`): sets the string representation of a negative infinity - * value.</li> - * <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format. - * Custom date formats follow the formats at - * <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html"> - * Datetime Patterns</a>. - * This applies to date type.</li> - * <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`): sets the string that - * indicates a timestamp format. Custom date formats follow the formats at - * <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html"> - * Datetime Patterns</a>. - * This applies to timestamp type.</li> - * <li>`maxColumns` (default `20480`): defines a hard limit of how many columns - * a record can have.</li> - * <li>`maxCharsPerColumn` (default `-1`): defines the maximum number of characters allowed - * for any given value being read. By default, it is -1 meaning unlimited length</li> - * <li>`unescapedQuoteHandling` (default `STOP_AT_DELIMITER`): defines how the CsvParser - * will handle values with unescaped quotes. - * <ul> - * <li>`STOP_AT_CLOSING_QUOTE`: If unescaped quotes are found in the input, accumulate - * the quote character and proceed parsing the value as a quoted value, until a closing - * quote is found.</li> - * <li>`BACK_TO_DELIMITER`: If unescaped quotes are found in the input, consider the value - * as an unquoted value. This will make the parser accumulate all characters of the current - * parsed value until the delimiter is found. If no - * delimiter is found in the value, the parser will continue accumulating characters from - * the input until a delimiter or line ending is found.</li> - * <li>`STOP_AT_DELIMITER`: If unescaped quotes are found in the input, consider the value - * as an unquoted value. This will make the parser accumulate all characters until the - * delimiter or a line ending is found in the input.</li> - * <li>`SKIP_VALUE`: If unescaped quotes are found in the input, the content parsed - * for the given value will be skipped and the value set in nullValue will be produced - * instead.</li> - * <li>`RAISE_ERROR`: If unescaped quotes are found in the input, a TextParsingException - * will be thrown.</li> - * </ul> - * </li> - * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records - * during parsing. It supports the following case-insensitive modes. Note that Spark tries - * to parse only required columns in CSV under column pruning. Therefore, corrupt records - * can be different based on required set of fields. This behavior can be controlled by - * `spark.sql.csv.parser.columnPruning.enabled` (enabled by default). - * <ul> - * <li>`PERMISSIVE` : when it meets a corrupted record, puts the malformed string into a - * field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. - * To keep corrupt records, an user can set a string type field named - * `columnNameOfCorruptRecord` in an user-defined schema. If a schema does not have - * the field, it drops corrupt records during parsing. A record with less/more tokens - * than schema is not a corrupted record to CSV. When it meets a record having fewer - * tokens than the length of the schema, sets `null` to extra fields. When the record - * has more tokens than the length of the schema, it drops extra tokens.</li> - * <li>`DROPMALFORMED` : ignores the whole corrupted records.</li> - * <li>`FAILFAST` : throws an exception when it meets corrupted records.</li> - * </ul> - * </li> - * <li>`columnNameOfCorruptRecord` (default is the value specified in - * `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field having malformed string - * created by `PERMISSIVE` mode. This overrides `spark.sql.columnNameOfCorruptRecord`.</li> - * <li>`multiLine` (default `false`): parse one record, which may span multiple lines.</li> - * <li>`locale` (default is `en-US`): sets a locale as language tag in IETF BCP 47 format. - * For instance, this is used while parsing dates and timestamps.</li> - * <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator - * that should be used for parsing. Maximum length is 1 character.</li> - * <li>`pathGlobFilter`: an optional glob pattern to only include files with paths matching - * the pattern. The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>. - * It does not change the behavior of partition discovery.</li> - * <li>`modifiedBefore` (batch only): an optional timestamp to only include files with - * modification times occurring before the specified Time. The provided timestamp - * must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)</li> - * <li>`modifiedAfter` (batch only): an optional timestamp to only include files with - * modification times occurring after the specified Time. The provided timestamp - * must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)</li> - * <li>`recursiveFileLookup`: recursively scan a directory for files. Using this option - * disables partition discovery</li> - * </ul> + * You can find the CSV-specific options for reading CSV files in + * <a href="https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option"> + * Data Source Option</a> in the version you use. * * @since 2.0.0 */ diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala index cb10295..a8af7c8 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala @@ -850,48 +850,9 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * format("csv").save(path) * }}} * - * You can set the following CSV-specific option(s) for writing CSV files: - * <ul> - * <li>`sep` (default `,`): sets a single character as a separator for each - * field and value.</li> - * <li>`quote` (default `"`): sets a single character used for escaping quoted values where - * the separator can be part of the value. If an empty string is set, it uses `u0000` - * (null character).</li> - * <li>`escape` (default `\`): sets a single character used for escaping quotes inside - * an already quoted value.</li> - * <li>`charToEscapeQuoteEscaping` (default `escape` or `\0`): sets a single character used for - * escaping the escape for the quote character. The default value is escape character when escape - * and quote characters are different, `\0` otherwise.</li> - * <li>`escapeQuotes` (default `true`): a flag indicating whether values containing - * quotes should always be enclosed in quotes. Default is to escape all values containing - * a quote character.</li> - * <li>`quoteAll` (default `false`): a flag indicating whether all values should always be - * enclosed in quotes. Default is to only escape values containing a quote character.</li> - * <li>`header` (default `false`): writes the names of columns as the first line.</li> - * <li>`nullValue` (default empty string): sets the string representation of a null value.</li> - * <li>`emptyValue` (default `""`): sets the string representation of an empty value.</li> - * <li>`encoding` (by default it is not set): specifies encoding (charset) of saved csv - * files. If it is not set, the UTF-8 charset will be used.</li> - * <li>`compression` (default `null`): compression codec to use when saving to file. This can be - * one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`, - * `snappy` and `deflate`). </li> - * <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format. - * Custom date formats follow the formats at - * <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html"> - * Datetime Patterns</a>. - * This applies to date type.</li> - * <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`): sets the string that - * indicates a timestamp format. Custom date formats follow the formats at - * <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html"> - * Datetime Patterns</a>. - * This applies to timestamp type.</li> - * <li>`ignoreLeadingWhiteSpace` (default `true`): a flag indicating whether or not leading - * whitespaces from values being written should be skipped.</li> - * <li>`ignoreTrailingWhiteSpace` (default `true`): a flag indicating defines whether or not - * trailing whitespaces from values being written should be skipped.</li> - * <li>`lineSep` (default `\n`): defines the line separator that should be used for writing. - * Maximum length is 1 character.</li> - * </ul> + * You can find the CSV-specific options for writing CSV files in + * <a href="https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option"> + * Data Source Option</a> in the version you use. * * @since 2.0.0 */ diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala index 8a278a5..c446d6b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala @@ -4607,6 +4607,7 @@ object functions { @scala.annotation.varargs def map_concat(cols: Column*): Column = withExpr { MapConcat(cols.map(_.expr)) } + // scalastyle:off line.size.limit /** * Parses a column containing a CSV string into a `StructType` with the specified schema. * Returns `null`, in the case of an unparseable string. @@ -4615,15 +4616,21 @@ object functions { * @param schema the schema to use when parsing the CSV string * @param options options to control how the CSV is parsed. accepts the same options and the * CSV data source. + * See + * <a href= + * "https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option"> + * Data Source Option</a> in the version you use. * * @group collection_funcs * @since 3.0.0 */ + // scalastyle:on line.size.limit def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column = withExpr { val replaced = CharVarcharUtils.failIfHasCharVarchar(schema).asInstanceOf[StructType] CsvToStructs(replaced, options, e.expr) } + // scalastyle:off line.size.limit /** * (Java-specific) Parses a column containing a CSV string into a `StructType` * with the specified schema. Returns `null`, in the case of an unparseable string. @@ -4632,10 +4639,15 @@ object functions { * @param schema the schema to use when parsing the CSV string * @param options options to control how the CSV is parsed. accepts the same options and the * CSV data source. + * See + * <a href= + * "https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option"> + * Data Source Option</a> in the version you use. * * @group collection_funcs * @since 3.0.0 */ + // scalastyle:on line.size.limit def from_csv(e: Column, schema: Column, options: java.util.Map[String, String]): Column = { withExpr(new CsvToStructs(e.expr, schema.expr, options.asScala.toMap)) } @@ -4660,32 +4672,44 @@ object functions { */ def schema_of_csv(csv: Column): Column = withExpr(new SchemaOfCsv(csv.expr)) + // scalastyle:off line.size.limit /** * Parses a CSV string and infers its schema in DDL format using options. * * @param csv a foldable string column containing a CSV string. * @param options options to control how the CSV is parsed. accepts the same options and the - * json data source. See [[DataFrameReader#csv]]. + * CSV data source. + * See + * <a href= + * "https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option"> + * Data Source Option</a> in the version you use. * @return a column with string literal containing schema in DDL format. * * @group collection_funcs * @since 3.0.0 */ + // scalastyle:on line.size.limit def schema_of_csv(csv: Column, options: java.util.Map[String, String]): Column = { withExpr(SchemaOfCsv(csv.expr, options.asScala.toMap)) } + // scalastyle:off line.size.limit /** * (Java-specific) Converts a column containing a `StructType` into a CSV string with * the specified schema. Throws an exception, in the case of an unsupported type. * * @param e a column containing a struct. * @param options options to control how the struct column is converted into a CSV string. - * It accepts the same options and the json data source. + * It accepts the same options and the CSV data source. + * See + * <a href= + * "https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option"> + * Data Source Option</a> in the version you use. * * @group collection_funcs * @since 3.0.0 */ + // scalastyle:on line.size.limit def to_csv(e: Column, options: java.util.Map[String, String]): Column = withExpr { StructsToCsv(options.asScala.toMap, e.expr) } diff --git a/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala index 6c3fbaf..e6e65cd 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala @@ -239,105 +239,16 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo * is enabled. To avoid going through the entire data once, disable `inferSchema` option or * specify the schema explicitly using `schema`. * - * You can set the following CSV-specific options to deal with CSV files: + * You can set the following option(s): * <ul> * <li>`maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be * considered in every trigger.</li> - * <li>`sep` (default `,`): sets a single character as a separator for each - * field and value.</li> - * <li>`encoding` (default `UTF-8`): decodes the CSV files by the given encoding - * type.</li> - * <li>`quote` (default `"`): sets a single character used for escaping quoted values where - * the separator can be part of the value. If you would like to turn off quotations, you need to - * set not `null` but an empty string. This behaviour is different form - * `com.databricks.spark.csv`.</li> - * <li>`escape` (default `\`): sets a single character used for escaping quotes inside - * an already quoted value.</li> - * <li>`charToEscapeQuoteEscaping` (default `escape` or `\0`): sets a single character used for - * escaping the escape for the quote character. The default value is escape character when escape - * and quote characters are different, `\0` otherwise.</li> - * <li>`comment` (default empty string): sets a single character used for skipping lines - * beginning with this character. By default, it is disabled.</li> - * <li>`header` (default `false`): uses the first line as names of columns.</li> - * <li>`inferSchema` (default `false`): infers the input schema automatically from data. It - * requires one extra pass over the data.</li> - * <li>`ignoreLeadingWhiteSpace` (default `false`): a flag indicating whether or not leading - * whitespaces from values being read should be skipped.</li> - * <li>`ignoreTrailingWhiteSpace` (default `false`): a flag indicating whether or not trailing - * whitespaces from values being read should be skipped.</li> - * <li>`nullValue` (default empty string): sets the string representation of a null value. Since - * 2.0.1, this applies to all supported types including the string type.</li> - * <li>`emptyValue` (default empty string): sets the string representation of an empty value.</li> - * <li>`nanValue` (default `NaN`): sets the string representation of a non-number" value.</li> - * <li>`positiveInf` (default `Inf`): sets the string representation of a positive infinity - * value.</li> - * <li>`negativeInf` (default `-Inf`): sets the string representation of a negative infinity - * value.</li> - * <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format. - * Custom date formats follow the formats at - * <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html"> - * Datetime Patterns</a>. - * This applies to date type.</li> - * <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`): sets the string that - * indicates a timestamp format. Custom date formats follow the formats at - * <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html"> - * Datetime Patterns</a>. - * This applies to timestamp type.</li> - * <li>`maxColumns` (default `20480`): defines a hard limit of how many columns - * a record can have.</li> - * <li>`maxCharsPerColumn` (default `-1`): defines the maximum number of characters allowed - * for any given value being read. By default, it is -1 meaning unlimited length</li> - * <li>`unescapedQuoteHandling` (default `STOP_AT_DELIMITER`): defines how the CsvParser - * will handle values with unescaped quotes. - * <ul> - * <li>`STOP_AT_CLOSING_QUOTE`: If unescaped quotes are found in the input, accumulate - * the quote character and proceed parsing the value as a quoted value, until a closing - * quote is found.</li> - * <li>`BACK_TO_DELIMITER`: If unescaped quotes are found in the input, consider the value - * as an unquoted value. This will make the parser accumulate all characters of the current - * parsed value until the delimiter is found. If no delimiter is found in the value, the - * parser will continue accumulating characters from the input until a delimiter or line - * ending is found.</li> - * <li>`STOP_AT_DELIMITER`: If unescaped quotes are found in the input, consider the value - * as an unquoted value. This will make the parser accumulate all characters until the - * delimiter or a line ending is found in the input.</li> - * <li>`SKIP_VALUE`: If unescaped quotes are found in the input, the content parsed - * for the given value will be skipped and the value set in nullValue will be produced - * instead.</li> - * <li>`RAISE_ERROR`: If unescaped quotes are found in the input, a TextParsingException - * will be thrown.</li> - * </ul> - * </li> - * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records - * during parsing. It supports the following case-insensitive modes. - * <ul> - * <li>`PERMISSIVE` : when it meets a corrupted record, puts the malformed string into a - * field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. - * To keep corrupt records, an user can set a string type field named - * `columnNameOfCorruptRecord` in an user-defined schema. If a schema does not have - * the field, it drops corrupt records during parsing. A record with less/more tokens - * than schema is not a corrupted record to CSV. When it meets a record having fewer - * tokens than the length of the schema, sets `null` to extra fields. When the record - * has more tokens than the length of the schema, it drops extra tokens.</li> - * <li>`DROPMALFORMED` : ignores the whole corrupted records.</li> - * <li>`FAILFAST` : throws an exception when it meets corrupted records.</li> - * </ul> - * </li> - * <li>`columnNameOfCorruptRecord` (default is the value specified in - * `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field having malformed string - * created by `PERMISSIVE` mode. This overrides `spark.sql.columnNameOfCorruptRecord`.</li> - * <li>`multiLine` (default `false`): parse one record, which may span multiple lines.</li> - * <li>`locale` (default is `en-US`): sets a locale as language tag in IETF BCP 47 format. - * For instance, this is used while parsing dates and timestamps.</li> - * <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator - * that should be used for parsing. Maximum length is 1 character.</li> - * <li>`pathGlobFilter`: an optional glob pattern to only include files with paths matching - * the pattern. The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>. - * It does not change the behavior of partition discovery.</li> - * <li>`recursiveFileLookup`: recursively scan a directory for files. Using this option - * disables partition discovery</li> * </ul> * + * You can find the CSV-specific options for reading CSV file stream in + * <a href="https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option"> + * Data Source Option</a> in the version you use. + * * @since 2.0.0 */ def csv(path: String): DataFrame = format("csv").load(path) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org