[jira] [Updated] (SPARK-22455) Provide an option to store the exception records/files and reasons in log files when reading data from a file-based data source.

2019-05-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22455:
-
Labels: bulk-closed  (was: )

> Provide an option to store the exception records/files and reasons in log 
> files when reading data from a file-based data source.
> 
>
> Key: SPARK-22455
> URL: https://issues.apache.org/jira/browse/SPARK-22455
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Sreenath Chothar
>Priority: Minor
>  Labels: bulk-closed
>
> Provide an option to store the exception/bad records and reasons in log files 
> when reading data from a file-based data source into a PySpark dataframe. Now 
> only following three options are available:
> 1. PERMISSIVE : sets other fields to null when it meets a corrupted record, 
> and puts the malformed string into a field configured by 
> columnNameOfCorruptRecord.
> 2. DROPMALFORMED : ignores the whole corrupted records.
> 3. FAILFAST : throws an exception when it meets corrupted records.
> We could use first option to accumulate the corrupted records and output to a 
> log file.But we can't use this option when input schema is inferred 
> automatically. If the number of columns to read is too large, providing the 
> complete schema with additional column for storing corrupted data is 
> difficult. Instead "pyspark.sql.DataFrameReader.csv" reader functions could 
> provide an option to redirect the bad records to configured log file path 
> with exception details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22455) Provide an option to store the exception records/files and reasons in log files when reading data from a file-based data source.

2017-11-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22455:
--
   Flags:   (was: Patch)
  Labels:   (was: patch)
Priority: Minor  (was: Major)

> Provide an option to store the exception records/files and reasons in log 
> files when reading data from a file-based data source.
> 
>
> Key: SPARK-22455
> URL: https://issues.apache.org/jira/browse/SPARK-22455
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Sreenath Chothar
>Priority: Minor
>
> Provide an option to store the exception/bad records and reasons in log files 
> when reading data from a file-based data source into a PySpark dataframe. Now 
> only following three options are available:
> 1. PERMISSIVE : sets other fields to null when it meets a corrupted record, 
> and puts the malformed string into a field configured by 
> columnNameOfCorruptRecord.
> 2. DROPMALFORMED : ignores the whole corrupted records.
> 3. FAILFAST : throws an exception when it meets corrupted records.
> We could use first option to accumulate the corrupted records and output to a 
> log file.But we can't use this option when input schema is inferred 
> automatically. If the number of columns to read is too large, providing the 
> complete schema with additional column for storing corrupted data is 
> difficult. Instead "pyspark.sql.DataFrameReader.csv" reader functions could 
> provide an option to redirect the bad records to configured log file path 
> with exception details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org