[jira] [Updated] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files

Dale Richardson (JIRA) Sun, 29 Mar 2015 04:51:07 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dale Richardson updated SPARK-6593:
-----------------------------------
    Description: 
When reading a large amount of gzip files from HDFS eg. with  
sc.textFile("hdfs:///user/cloudera/logs*.gz"), If the hadoop input libraries 
report an exception then the entire job is canceled. As default behaviour this 
is probably for the best, but it would be nice in some circumstances where you 
know it will be ok to have the option to skip the corrupted file and continue 
the job. 


  was:
When reading a large amount of files from HDFS eg. with  
sc.textFile("hdfs:///user/cloudera/logs*.gz"). If the hadoop input libraries 
report an exception then the entire job is canceled. As default behaviour this 
is probably for the best, but it would be nice in some circumstances where you 
know it will be ok to have the option to skip the corrupted file and continue 
the job. 



> Provide option for HadoopRDD to skip corrupted files
> ----------------------------------------------------
>
>                 Key: SPARK-6593
>                 URL: https://issues.apache.org/jira/browse/SPARK-6593
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.3.0
>            Reporter: Dale Richardson
>            Priority: Minor
>
> When reading a large amount of gzip files from HDFS eg. with  
> sc.textFile("hdfs:///user/cloudera/logs*.gz"), If the hadoop input libraries 
> report an exception then the entire job is canceled. As default behaviour 
> this is probably for the best, but it would be nice in some circumstances 
> where you know it will be ok to have the option to skip the corrupted file 
> and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files

Reply via email to