This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new b7bdc31 [SPARK-28058][DOC] Add a note to doc of mode of CSV for column pruning b7bdc31 is described below commit b7bdc3111ec2778d7d54d09ba339d893250aa65d Author: Liang-Chi Hsieh <vii...@gmail.com> AuthorDate: Tue Jun 18 13:48:32 2019 +0900 [SPARK-28058][DOC] Add a note to doc of mode of CSV for column pruning ## What changes were proposed in this pull request? When using `DROPMALFORMED` mode, corrupted records aren't dropped if malformed columns aren't read. This behavior is due to CSV parser column pruning. Current doc of `DROPMALFORMED` doesn't mention the effect of column pruning. Users will be confused by the fact that `DROPMALFORMED` mode doesn't work as expected. Column pruning also affects other modes. This is a doc improvement to add a note to doc of `mode` to explain it. ## How was this patch tested? N/A. This is just doc change. Closes #24894 from viirya/SPARK-28058. Authored-by: Liang-Chi Hsieh <vii...@gmail.com> Signed-off-by: HyukjinKwon <gurwls...@apache.org> --- python/pyspark/sql/readwriter.py | 6 +++++- sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala | 5 ++++- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 6413d88..aa5bf63 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -430,7 +430,11 @@ class DataFrameReader(OptionUtils): :param maxMalformedLogPerPartition: this parameter is no longer used since Spark 2.2.0. If specified, it is ignored. :param mode: allows a mode for dealing with corrupt records during parsing. If None is - set, it uses the default value, ``PERMISSIVE``. + set, it uses the default value, ``PERMISSIVE``. Note that Spark tries to + parse only required columns in CSV under column pruning. Therefore, corrupt + records can be different based on required set of fields. This behavior can + be controlled by ``spark.sql.csv.parser.columnPruning.enabled`` + (enabled by default). * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \ into a field configured by ``columnNameOfCorruptRecord``, and sets malformed \ diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index dfc6d8c..2bf9024 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -627,7 +627,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * <li>`maxCharsPerColumn` (default `-1`): defines the maximum number of characters allowed * for any given value being read. By default, it is -1 meaning unlimited length</li> * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records - * during parsing. It supports the following case-insensitive modes. + * during parsing. It supports the following case-insensitive modes. Note that Spark tries + * to parse only required columns in CSV under column pruning. Therefore, corrupt records + * can be different based on required set of fields. This behavior can be controlled by + * `spark.sql.csv.parser.columnPruning.enabled` (enabled by default). * <ul> * <li>`PERMISSIVE` : when it meets a corrupted record, puts the malformed string into a * field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org