[ 
https://issues.apache.org/jira/browse/SPARK-22455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22455:
------------------------------
       Flags:   (was: Patch)
      Labels:   (was: patch)
    Priority: Minor  (was: Major)

> Provide an option to store the exception records/files and reasons in log 
> files when reading data from a file-based data source.
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22455
>                 URL: https://issues.apache.org/jira/browse/SPARK-22455
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 2.2.0
>            Reporter: Sreenath Chothar
>            Priority: Minor
>
> Provide an option to store the exception/bad records and reasons in log files 
> when reading data from a file-based data source into a PySpark dataframe. Now 
> only following three options are available:
> 1. PERMISSIVE : sets other fields to null when it meets a corrupted record, 
> and puts the malformed string into a field configured by 
> columnNameOfCorruptRecord.
> 2. DROPMALFORMED : ignores the whole corrupted records.
> 3. FAILFAST : throws an exception when it meets corrupted records.
> We could use first option to accumulate the corrupted records and output to a 
> log file.But we can't use this option when input schema is inferred 
> automatically. If the number of columns to read is too large, providing the 
> complete schema with additional column for storing corrupted data is 
> difficult. Instead "pyspark.sql.DataFrameReader.csv" reader functions could 
> provide an option to redirect the bad records to configured log file path 
> with exception details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to