Hi,

I just found out that we can have lots of empty input partitions when
reading from parquet files.

Sample code as following:

  val hconf = sc.hadoopConfiguration
  val job = new Job(hconf)

  FileInputFormat.setInputPaths(job, new Path("path_to_data"))
  ParquetInputFormat.setReadSupportClass(job,
classOf[AvroReadSupport[MyAvroType]])
  val rdd = new NewHadoopRDD[Void, MyAvroType](
    sc,
    classOf[ParquetInputFormat[MyAvroType]],
    classOf[Void],
    classOf[MyAvroType],
    job.getConfiguration
  )

  val ctx = rdd.newJobContext(job.getConfiguration, new JobID())
  val inputFormat = new ParquetInputFormat[MyAvroType]()

  inputFormat.getSplits(ctx).asScala.foreach(println)

  val sizes = rdd.mapPartitions { iter =>
    List(iter.size).iterator
  }.collect().toList
  sizes.foreach(println)


The splits are ok:

ParquetInputSplit{part: file:/folder/test_file start: 0 end: 33554432
length: 33554432 hosts: [localhost]}
ParquetInputSplit{part: file:/folder/test_file start: 33554432 end:
67108864 length: 33554432 hosts: [localhost]}
ParquetInputSplit{part: file:/folder/test_file start: 67108864 end:
100663296 length: 33554432 hosts: [localhost]}
ParquetInputSplit{part: file:/folder/test_file start: 100663296 end:
106022166 length: 5358870 hosts: [localhost]}

However the partition sizes are:
0
4365522
0
0

Essentially one partition has all the lines.
When reading using spark-sql, all is ok.

I'm using spark 1.6.1 and parquet-avro 1.7.0.

Thanks!
-- 
*JU Han*

Software Engineer @ Teads.tv

+33 0619608888

Reply via email to