[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1543#comment-1543 ] Charles Pritchard commented on SPARK-6593: -- Something appears to have changed between 2.0 and 1.5.1: in 2.0 I have files that will fail with "Unexpected end of input stream" whereas they read with 1.5.1 without error. Those files also trigger exceptions with command line zcat/gzip. > Provide option for HadoopRDD to skip corrupted files > > > Key: SPARK-6593 > URL: https://issues.apache.org/jira/browse/SPARK-6593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Dale Richardson >Priority: Minor > > When reading a large amount of gzip files from HDFS eg. with > sc.textFile("hdfs:///user/cloudera/logs*.gz"), If the hadoop input libraries > report an exception then the entire job is canceled. As default behaviour > this is probably for the best, but it would be nice in some circumstances > where you know it will be ok to have the option to skip the corrupted file > and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14396221#comment-14396221 ] Apache Spark commented on SPARK-6593: - User 'tigerquoll' has created a pull request for this issue: https://github.com/apache/spark/pull/5368 Provide option for HadoopRDD to skip corrupted files Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Priority: Minor When reading a large amount of gzip files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz), If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted file and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385723#comment-14385723 ] Dale Richardson commented on SPARK-6593: Changed the title and description to focus closer on my particular use case, which is corrupted gzip files. Provide option for HadoopRDD to skip corrupted files Key: SPARK-6593 URL: https://issues.apache.org/jira/browse/SPARK-6593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Dale Richardson Priority: Minor When reading a large amount of files from HDFS eg. with sc.textFile(hdfs:///user/cloudera/logs*.gz). If the hadoop input libraries report an exception then the entire job is canceled. As default behaviour this is probably for the best, but it would be nice in some circumstances where you know it will be ok to have the option to skip the corrupted portion and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org