[GitHub] [spark] cchighman commented on a change in pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

GitBox Wed, 04 Nov 2020 04:51:57 -0800


cchighman commented on a change in pull request #28841:
URL: https://github.com/apache/spark/pull/28841#discussion_r517320323




##########
File path: python/pyspark/sql/readwriter.py
##########
@@ -263,62 +263,79 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
             allows a mode for dealing with corrupt records during parsing. If 
None is
                      set, it uses the default value, ``PERMISSIVE``.
 
-                * ``PERMISSIVE``: when it meets a corrupted record, puts the 
malformed string \
-                  into a field configured by ``columnNameOfCorruptRecord``, 
and sets malformed \
-                  fields to ``null``. To keep corrupt records, an user can set 
a string type \
-                  field named ``columnNameOfCorruptRecord`` in an user-defined 
schema. If a \
-                  schema does not have the field, it drops corrupt records 
during parsing. \
-                  When inferring a schema, it implicitly adds a 
``columnNameOfCorruptRecord`` \
-                  field in an output schema.
-                *  ``DROPMALFORMED``: ignores the whole corrupted records.
-                *  ``FAILFAST``: throws an exception when it meets corrupted 
records.
-
-        :param columnNameOfCorruptRecord: allows renaming the new field having 
malformed string
-                                          created by ``PERMISSIVE`` mode. This 
overrides
-                                          
``spark.sql.columnNameOfCorruptRecord``. If None is set,
-                                          it uses the value specified in
-                                          
``spark.sql.columnNameOfCorruptRecord``.
-        :param dateFormat: sets the string that indicates a date format. 
Custom date formats
-                           follow the formats at `datetime pattern`_.
-                           This applies to date type. If None is set, it uses 
the
-                           default value, ``yyyy-MM-dd``.
-        :param timestampFormat: sets the string that indicates a timestamp 
format.
-                                Custom date formats follow the formats at 
`datetime pattern`_.
-                                This applies to timestamp type. If None is 
set, it uses the
-                                default value, 
``yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]``.
-        :param multiLine: parse one record, which may span multiple lines, per 
file. If None is
-                          set, it uses the default value, ``false``.
-        :param allowUnquotedControlChars: allows JSON Strings to contain 
unquoted control
-                                          characters (ASCII characters with 
value less than 32,
-                                          including tab and line feed 
characters) or not.
-        :param encoding: allows to forcibly set one of standard basic or 
extended encoding for
-                         the JSON files. For example UTF-16BE, UTF-32LE. If 
None is set,
-                         the encoding of input JSON will be detected 
automatically
-                         when the multiLine option is set to ``true``.
-        :param lineSep: defines the line separator that should be used for 
parsing. If None is
-                        set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``.
-        :param samplingRatio: defines fraction of input JSON objects used for 
schema inferring.
-                              If None is set, it uses the default value, 
``1.0``.
-        :param dropFieldIfAllNull: whether to ignore column of all null values 
or empty
-                                   array/struct during schema inference. If 
None is set, it
-                                   uses the default value, ``false``.
-        :param locale: sets a locale as language tag in IETF BCP 47 format. If 
None is set,
-                       it uses the default value, ``en-US``. For instance, 
``locale`` is used while
-                       parsing dates and timestamps.
-        :param pathGlobFilter: an optional glob pattern to only include files 
with paths matching
-                               the pattern. The syntax follows 
`org.apache.hadoop.fs.GlobFilter`.
-                               It does not change the behavior of `partition 
discovery`_.
-        :param modifiedBefore: an optional timestamp to only include files with
-                    modification times occurring before the specified time. 
The provided timestamp
-                    must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 
2020-06-01T13:00:00)
-        :param modifiedAfter: an optional timestamp to only include files with
-                    modification times occurring after the specified time. The 
provided timestamp
-                    must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 
2020-06-01T13:00:00)
-        :param recursiveFileLookup: recursively scan a directory for files. 
Using this option
-                                    disables `partition discovery`_.
-        :param allowNonNumericNumbers: allows JSON parser to recognize set of 
"Not-a-Number" (NaN)
-                                       tokens as legal floating number values. 
If None is set,
-                                       it uses the default value, ``true``.
+            * ``PERMISSIVE``: when it meets a corrupted record, puts the 
malformed string \
+              into a field configured by ``columnNameOfCorruptRecord``, and 
sets malformed \
+              fields to ``null``. To keep corrupt records, an user can set a 
string type \
+              field named ``columnNameOfCorruptRecord`` in an user-defined 
schema. If a \
+              schema does not have the field, it drops corrupt records during 
parsing. \
+              When inferring a schema, it implicitly adds a 
``columnNameOfCorruptRecord`` \
+              field in an output schema.
+            *  ``DROPMALFORMED``: ignores the whole corrupted records.
+            *  ``FAILFAST``: throws an exception when it meets corrupted 
records.
+
+        columnNameOfCorruptRecord: str, optional
+            allows renaming the new field having malformed string
+            created by ``PERMISSIVE`` mode. This overrides
+            ``spark.sql.columnNameOfCorruptRecord``. If None is set,
+            it uses the value specified in
+            ``spark.sql.columnNameOfCorruptRecord``.
+        dateFormat : str, optional
+            sets the string that indicates a date format. Custom date formats
+            follow the formats at
+            `datetime pattern 
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_.  # noqa
+            This applies to date type. If None is set, it uses the
+            default value, ``yyyy-MM-dd``.
+        timestampFormat : str, optional
+            sets the string that indicates a timestamp format.
+            Custom date formats follow the formats at
+            `datetime pattern 
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_.  # noqa
+            This applies to timestamp type. If None is set, it uses the
+            default value, ``yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]``.
+        multiLine : str or bool, optional
+            parse one record, which may span multiple lines, per file. If None 
is
+            set, it uses the default value, ``false``.
+        allowUnquotedControlChars : str or bool, optional
+            allows JSON Strings to contain unquoted control
+            characters (ASCII characters with value less than 32,
+            including tab and line feed characters) or not.
+        encoding : str or bool, optional
+            allows to forcibly set one of standard basic or extended encoding 
for
+            the JSON files. For example UTF-16BE, UTF-32LE. If None is set,
+            the encoding of input JSON will be detected automatically
+            when the multiLine option is set to ``true``.
+        lineSep : str, optional
+            defines the line separator that should be used for parsing. If 
None is
+            set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``.
+        samplingRatio : str or float, optional
+            defines fraction of input JSON objects used for schema inferring.
+            If None is set, it uses the default value, ``1.0``.
+        dropFieldIfAllNull : str or bool, optional
+            whether to ignore column of all null values or empty
+            array/struct during schema inference. If None is set, it
+            uses the default value, ``false``.
+        locale : str, optional
+            sets a locale as language tag in IETF BCP 47 format. If None is 
set,
+            it uses the default value, ``en-US``. For instance, ``locale`` is 
used while
+            parsing dates and timestamps.
+        pathGlobFilter : str or bool, optional
+            an optional glob pattern to only include files with paths matching
+            the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`.
+            It does not change the behavior of
+            `partition discovery 
<https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_.
  # noqa
+        modifiedBefore : an optional timestamp to only include files with

Review comment:
       Net difference is addition of lines 325-330




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cchighman commented on a change in pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

Reply via email to