Hi, I just found out that we can have lots of empty input partitions when reading from parquet files.
Sample code as following: val hconf = sc.hadoopConfiguration val job = new Job(hconf) FileInputFormat.setInputPaths(job, new Path("path_to_data")) ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[MyAvroType]]) val rdd = new NewHadoopRDD[Void, MyAvroType]( sc, classOf[ParquetInputFormat[MyAvroType]], classOf[Void], classOf[MyAvroType], job.getConfiguration ) val ctx = rdd.newJobContext(job.getConfiguration, new JobID()) val inputFormat = new ParquetInputFormat[MyAvroType]() inputFormat.getSplits(ctx).asScala.foreach(println) val sizes = rdd.mapPartitions { iter => List(iter.size).iterator }.collect().toList sizes.foreach(println) The splits are ok: ParquetInputSplit{part: file:/folder/test_file start: 0 end: 33554432 length: 33554432 hosts: [localhost]} ParquetInputSplit{part: file:/folder/test_file start: 33554432 end: 67108864 length: 33554432 hosts: [localhost]} ParquetInputSplit{part: file:/folder/test_file start: 67108864 end: 100663296 length: 33554432 hosts: [localhost]} ParquetInputSplit{part: file:/folder/test_file start: 100663296 end: 106022166 length: 5358870 hosts: [localhost]} However the partition sizes are: 0 4365522 0 0 Essentially one partition has all the lines. When reading using spark-sql, all is ok. I'm using spark 1.6.1 and parquet-avro 1.7.0. Thanks! -- *JU Han* Software Engineer @ Teads.tv +33 0619608888