Is there only one RowGroup for this file? You can check this by printing the 
file's metadata using the `meta` command of `parquet-cli`.

Yang Jie

发件人: zhangliyun <kelly...@126.com>
日期: 2023年3月23日 星期四 15:16
收件人: Spark Dev List <dev@spark.apache.org>
主题: please help the problem of big parquet file can not be splitted to read

hi all
  i want to ask a question about how to split the big parquet file when spark 
reading , i have a parquet file which is 1.9G. i have set
spark.sql.files.maxPartitionBytes=128000000;    it start 80 tasks( 
80*128M~1.9G) but
it seems every partition is not even. One partition read 1.9G data while others 
read only 3M(see attach pic). I have checked the compress codec of the file , 
it is snappy which can be splittable

hadoop jar parquet-tools-1.11.2.jar head 
hdfs://horton/user/yazou/VenInv/shifu_norm_emb_bert/emb_valid_sel_train_sam_1ep.parquet
23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: RecordReader 
initialized will read a total of 368273 records.
23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: at row 0. reading 
next block
23/03/22 20:42:09 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
23/03/22 20:42:09 INFO hadoop.InternalParquetRecordReader: block read in memory 
in 13227 ms. row count = 368273



the spark code like
  ```

    spark.read
      
.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat")
      .option("mergeSchema", "false")
      .load("xxxx")


  ```

Appreciate your help


Best Regards

Kelly Zhang

Reply via email to