[ https://issues.apache.org/jira/browse/SPARK-22455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-22455: ------------------------------ Flags: (was: Patch) Labels: (was: patch) Priority: Minor (was: Major) > Provide an option to store the exception records/files and reasons in log > files when reading data from a file-based data source. > -------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-22455 > URL: https://issues.apache.org/jira/browse/SPARK-22455 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Affects Versions: 2.2.0 > Reporter: Sreenath Chothar > Priority: Minor > > Provide an option to store the exception/bad records and reasons in log files > when reading data from a file-based data source into a PySpark dataframe. Now > only following three options are available: > 1. PERMISSIVE : sets other fields to null when it meets a corrupted record, > and puts the malformed string into a field configured by > columnNameOfCorruptRecord. > 2. DROPMALFORMED : ignores the whole corrupted records. > 3. FAILFAST : throws an exception when it meets corrupted records. > We could use first option to accumulate the corrupted records and output to a > log file.But we can't use this option when input schema is inferred > automatically. If the number of columns to read is too large, providing the > complete schema with additional column for storing corrupted data is > difficult. Instead "pyspark.sql.DataFrameReader.csv" reader functions could > provide an option to redirect the bad records to configured log file path > with exception details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org