[spark] branch branch-2.4 updated: [SPARK-28058][DOC] Add a note to doc of mode of CSV for column pruning

gurwls223 Mon, 17 Jun 2019 21:49:48 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-2.4 by this push:
     new f4efcbf  [SPARK-28058][DOC] Add a note to doc of mode of CSV for 
column pruning
f4efcbf is described below

commit f4efcbf367b23e0e3e85cd3e094641c70eb17463
Author: Liang-Chi Hsieh <vii...@gmail.com>
AuthorDate: Tue Jun 18 13:48:32 2019 +0900

    [SPARK-28058][DOC] Add a note to doc of mode of CSV for column pruning
    
    ## What changes were proposed in this pull request?
    
    When using `DROPMALFORMED` mode, corrupted records aren't dropped if 
malformed columns aren't read. This behavior is due to CSV parser column 
pruning. Current doc of `DROPMALFORMED` doesn't mention the effect of column 
pruning. Users will be confused by the fact that `DROPMALFORMED` mode doesn't 
work as expected.
    
    Column pruning also affects other modes. This is a doc improvement to add a 
note to doc of `mode` to explain it.
    
    ## How was this patch tested?
    
    N/A. This is just doc change.
    
    Closes #24894 from viirya/SPARK-28058.
    
    Authored-by: Liang-Chi Hsieh <vii...@gmail.com>
    Signed-off-by: HyukjinKwon <gurwls...@apache.org>
    (cherry picked from commit b7bdc3111ec2778d7d54d09ba339d893250aa65d)
    Signed-off-by: HyukjinKwon <gurwls...@apache.org>
---
 python/pyspark/sql/readwriter.py                                   | 6 +++++-
 sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala | 5 ++++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index c25426c..ea7cc80 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -417,7 +417,11 @@ class DataFrameReader(OptionUtils):
         :param maxMalformedLogPerPartition: this parameter is no longer used 
since Spark 2.2.0.
                                             If specified, it is ignored.
         :param mode: allows a mode for dealing with corrupt records during 
parsing. If None is
-                     set, it uses the default value, ``PERMISSIVE``.
+                     set, it uses the default value, ``PERMISSIVE``. Note that 
Spark tries to
+                     parse only required columns in CSV under column pruning. 
Therefore, corrupt
+                     records can be different based on required set of fields. 
This behavior can
+                     be controlled by 
``spark.sql.csv.parser.columnPruning.enabled``
+                     (enabled by default).
 
                 * ``PERMISSIVE`` : when it meets a corrupted record, puts the 
malformed string \
                   into a field configured by ``columnNameOfCorruptRecord``, 
and sets other \
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index 666a97d..85cd3f0 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -589,7 +589,10 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
    * <li>`maxCharsPerColumn` (default `-1`): defines the maximum number of 
characters allowed
    * for any given value being read. By default, it is -1 meaning unlimited 
length</li>
    * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt 
records
-   *    during parsing. It supports the following case-insensitive modes.
+   *    during parsing. It supports the following case-insensitive modes. Note 
that Spark tries
+   *    to parse only required columns in CSV under column pruning. Therefore, 
corrupt records
+   *    can be different based on required set of fields. This behavior can be 
controlled by
+   *    `spark.sql.csv.parser.columnPruning.enabled` (enabled by default).
    *   <ul>
    *     <li>`PERMISSIVE` : when it meets a corrupted record, puts the 
malformed string into a
    *     field configured by `columnNameOfCorruptRecord`, and sets other 
fields to `null`. To keep


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-28058][DOC] Add a note to doc of mode of CSV for column pruning

Reply via email to