[jira] [Updated] (SPARK-22455) Provide an option to store the exception records/files and reasons in log files when reading data from a file-based data source.
[ https://issues.apache.org/jira/browse/SPARK-22455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-22455: - Labels: bulk-closed (was: ) > Provide an option to store the exception records/files and reasons in log > files when reading data from a file-based data source. > > > Key: SPARK-22455 > URL: https://issues.apache.org/jira/browse/SPARK-22455 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Sreenath Chothar >Priority: Minor > Labels: bulk-closed > > Provide an option to store the exception/bad records and reasons in log files > when reading data from a file-based data source into a PySpark dataframe. Now > only following three options are available: > 1. PERMISSIVE : sets other fields to null when it meets a corrupted record, > and puts the malformed string into a field configured by > columnNameOfCorruptRecord. > 2. DROPMALFORMED : ignores the whole corrupted records. > 3. FAILFAST : throws an exception when it meets corrupted records. > We could use first option to accumulate the corrupted records and output to a > log file.But we can't use this option when input schema is inferred > automatically. If the number of columns to read is too large, providing the > complete schema with additional column for storing corrupted data is > difficult. Instead "pyspark.sql.DataFrameReader.csv" reader functions could > provide an option to redirect the bad records to configured log file path > with exception details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22455) Provide an option to store the exception records/files and reasons in log files when reading data from a file-based data source.
[ https://issues.apache.org/jira/browse/SPARK-22455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22455: -- Flags: (was: Patch) Labels: (was: patch) Priority: Minor (was: Major) > Provide an option to store the exception records/files and reasons in log > files when reading data from a file-based data source. > > > Key: SPARK-22455 > URL: https://issues.apache.org/jira/browse/SPARK-22455 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Sreenath Chothar >Priority: Minor > > Provide an option to store the exception/bad records and reasons in log files > when reading data from a file-based data source into a PySpark dataframe. Now > only following three options are available: > 1. PERMISSIVE : sets other fields to null when it meets a corrupted record, > and puts the malformed string into a field configured by > columnNameOfCorruptRecord. > 2. DROPMALFORMED : ignores the whole corrupted records. > 3. FAILFAST : throws an exception when it meets corrupted records. > We could use first option to accumulate the corrupted records and output to a > log file.But we can't use this option when input schema is inferred > automatically. If the number of columns to read is too large, providing the > complete schema with additional column for storing corrupted data is > difficult. Instead "pyspark.sql.DataFrameReader.csv" reader functions could > provide an option to redirect the bad records to configured log file path > with exception details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org