Josh Rosen created SPARK-11177: ---------------------------------- Summary: sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes Key: SPARK-11177 URL: https://issues.apache.org/jira/browse/SPARK-11177 Project: Spark Issue Type: Sub-task Reporter: Josh Rosen
>From a user report: {quote} When I upload a series of text files to an S3 directory and one of the files is empty (0 bytes). The sc.wholeTextFiles method stack traces. java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:506) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:285) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:245) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.collect(RDD.scala:904) {quote} It looks like this has been a longstanding issue: * http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-wholeTextFiles-error-td8872.html * https://stackoverflow.com/questions/31051107/read-multiple-files-from-a-directory-using-spark * https://forums.databricks.com/questions/1799/arrayindexoutofboundsexception-with-wholetextfiles.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org