Dhruve Ashar created SPARK-24610:
------------------------------------

             Summary: wholeTextFiles broken for small files
                 Key: SPARK-24610
                 URL: https://issues.apache.org/jira/browse/SPARK-24610
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 2.3.1, 2.2.1
            Reporter: Dhruve Ashar


Spark is unable to read small files using the wholeTextFiles method when split 
size related configs are specified - either explicitly or if they are contained 
in other config files like hive-site.xml.

For small sized files, the computed maxSplitSize by `WholeTextFileInputFormat`  
is way smaller than the default or commonly used split size of 64/128M and 
spark throws an exception while trying to read them.  

 

To reproduce the issue: 
{code:java}
$SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --conf 
"spark.hadoop.mapreduce.input.fileinputformat.split.minsize.per.node=123456"

scala> sc.wholeTextFiles("file:///etc/passwd").count
java.io.IOException: Minimum split size pernode 123456 cannot be larger than 
maximum split size 9962
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200)
at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
... 48 elided


// For hdfs
sc.wholeTextFiles("smallFile").count
java.io.IOException: Minimum split size pernode 123456 cannot be larger than 
maximum split size 15
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200)
at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
... 48 elided{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to