This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-2.4 by this push: new f4efcbf [SPARK-28058][DOC] Add a note to doc of mode of CSV for column pruning f4efcbf is described below commit f4efcbf367b23e0e3e85cd3e094641c70eb17463 Author: Liang-Chi Hsieh <vii...@gmail.com> AuthorDate: Tue Jun 18 13:48:32 2019 +0900 [SPARK-28058][DOC] Add a note to doc of mode of CSV for column pruning ## What changes were proposed in this pull request? When using `DROPMALFORMED` mode, corrupted records aren't dropped if malformed columns aren't read. This behavior is due to CSV parser column pruning. Current doc of `DROPMALFORMED` doesn't mention the effect of column pruning. Users will be confused by the fact that `DROPMALFORMED` mode doesn't work as expected. Column pruning also affects other modes. This is a doc improvement to add a note to doc of `mode` to explain it. ## How was this patch tested? N/A. This is just doc change. Closes #24894 from viirya/SPARK-28058. Authored-by: Liang-Chi Hsieh <vii...@gmail.com> Signed-off-by: HyukjinKwon <gurwls...@apache.org> (cherry picked from commit b7bdc3111ec2778d7d54d09ba339d893250aa65d) Signed-off-by: HyukjinKwon <gurwls...@apache.org> --- python/pyspark/sql/readwriter.py | 6 +++++- sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala | 5 ++++- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index c25426c..ea7cc80 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -417,7 +417,11 @@ class DataFrameReader(OptionUtils): :param maxMalformedLogPerPartition: this parameter is no longer used since Spark 2.2.0. If specified, it is ignored. :param mode: allows a mode for dealing with corrupt records during parsing. If None is - set, it uses the default value, ``PERMISSIVE``. + set, it uses the default value, ``PERMISSIVE``. Note that Spark tries to + parse only required columns in CSV under column pruning. Therefore, corrupt + records can be different based on required set of fields. This behavior can + be controlled by ``spark.sql.csv.parser.columnPruning.enabled`` + (enabled by default). * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \ into a field configured by ``columnNameOfCorruptRecord``, and sets other \ diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index 666a97d..85cd3f0 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -589,7 +589,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * <li>`maxCharsPerColumn` (default `-1`): defines the maximum number of characters allowed * for any given value being read. By default, it is -1 meaning unlimited length</li> * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records - * during parsing. It supports the following case-insensitive modes. + * during parsing. It supports the following case-insensitive modes. Note that Spark tries + * to parse only required columns in CSV under column pruning. Therefore, corrupt records + * can be different based on required set of fields. This behavior can be controlled by + * `spark.sql.csv.parser.columnPruning.enabled` (enabled by default). * <ul> * <li>`PERMISSIVE` : when it meets a corrupted record, puts the malformed string into a * field configured by `columnNameOfCorruptRecord`, and sets other fields to `null`. To keep --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org