[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files

2016-09-13 Thread Charles Pritchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1543#comment-1543
 ] 

Charles Pritchard commented on SPARK-6593:
--

Something appears to have changed between 2.0 and 1.5.1: in 2.0 I have files 
that will fail with "Unexpected end of input stream" whereas they read with 
1.5.1 without error. Those files also trigger exceptions with command line 
zcat/gzip.

> Provide option for HadoopRDD to skip corrupted files
> 
>
> Key: SPARK-6593
> URL: https://issues.apache.org/jira/browse/SPARK-6593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Dale Richardson
>Priority: Minor
>
> When reading a large amount of gzip files from HDFS eg. with  
> sc.textFile("hdfs:///user/cloudera/logs*.gz"), If the hadoop input libraries 
> report an exception then the entire job is canceled. As default behaviour 
> this is probably for the best, but it would be nice in some circumstances 
> where you know it will be ok to have the option to skip the corrupted file 
> and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files

2015-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14396221#comment-14396221
 ] 

Apache Spark commented on SPARK-6593:
-

User 'tigerquoll' has created a pull request for this issue:
https://github.com/apache/spark/pull/5368

 Provide option for HadoopRDD to skip corrupted files
 

 Key: SPARK-6593
 URL: https://issues.apache.org/jira/browse/SPARK-6593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Dale Richardson
Priority: Minor

 When reading a large amount of gzip files from HDFS eg. with  
 sc.textFile(hdfs:///user/cloudera/logs*.gz), If the hadoop input libraries 
 report an exception then the entire job is canceled. As default behaviour 
 this is probably for the best, but it would be nice in some circumstances 
 where you know it will be ok to have the option to skip the corrupted file 
 and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files

2015-03-29 Thread Dale Richardson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385723#comment-14385723
 ] 

Dale Richardson commented on SPARK-6593:


Changed the title and description to focus closer on my particular use case, 
which is corrupted gzip files.

 Provide option for HadoopRDD to skip corrupted files
 

 Key: SPARK-6593
 URL: https://issues.apache.org/jira/browse/SPARK-6593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Dale Richardson
Priority: Minor

 When reading a large amount of files from HDFS eg. with  
 sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries 
 report an exception then the entire job is canceled. As default behaviour 
 this is probably for the best, but it would be nice in some circumstances 
 where you know it will be ok to have the option to skip the corrupted portion 
 and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org