[ https://issues.apache.org/jira/browse/SPARK-24610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-24610: ------------------------------------ Assignee: (was: Apache Spark) > wholeTextFiles broken for small files > ------------------------------------- > > Key: SPARK-24610 > URL: https://issues.apache.org/jira/browse/SPARK-24610 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.2.1, 2.3.1 > Reporter: Dhruve Ashar > Priority: Minor > > Spark is unable to read small files using the wholeTextFiles method when > split size related configs are specified - either explicitly or if they are > contained in other config files like hive-site.xml. > For small sized files, the computed maxSplitSize by > `WholeTextFileInputFormat` is way smaller than the default or commonly used > split size of 64/128M and spark throws an exception while trying to read > them. > > To reproduce the issue: > {code:java} > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --conf > "spark.hadoop.mapreduce.input.fileinputformat.split.minsize.per.node=123456" > scala> sc.wholeTextFiles("file:///etc/passwd").count > java.io.IOException: Minimum split size pernode 123456 cannot be larger than > maximum split size 9962 > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200) > at > org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096) > at org.apache.spark.rdd.RDD.count(RDD.scala:1158) > ... 48 elided > // For hdfs > sc.wholeTextFiles("smallFile").count > java.io.IOException: Minimum split size pernode 123456 cannot be larger than > maximum split size 15 > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200) > at > org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096) > at org.apache.spark.rdd.RDD.count(RDD.scala:1158) > ... 48 elided{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org