Sreenath Chothar created SPARK-22455: ----------------------------------------
Summary: Provide an option to store the exception records/files and reasons in log files when reading data from a file-based data source. Key: SPARK-22455 URL: https://issues.apache.org/jira/browse/SPARK-22455 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.2.0 Reporter: Sreenath Chothar Provide an option to store the exception/bad records and reasons in log files when reading data from a file-based data source into a PySpark dataframe. Now only following three options are available: 1. PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a field configured by columnNameOfCorruptRecord. 2. DROPMALFORMED : ignores the whole corrupted records. 3. FAILFAST : throws an exception when it meets corrupted records. We could use first option to accumulate the corrupted records and output to a log file.But we can't use this option when input schema is inferred automatically. If the number of columns to read is too large, providing the complete schema with additional column for storing corrupted data is difficult. Instead "pyspark.sql.DataFrameReader.csv" reader functions could provide an option to redirect the bad records to configured log file path with exception details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org