Is there only one RowGroup for this file? You can check this by printing the file's metadata using the `meta` command of `parquet-cli`.
Yang Jie 发件人: zhangliyun <kelly...@126.com> 日期: 2023年3月23日 星期四 15:16 收件人: Spark Dev List <dev@spark.apache.org> 主题: please help the problem of big parquet file can not be splitted to read hi all i want to ask a question about how to split the big parquet file when spark reading , i have a parquet file which is 1.9G. i have set spark.sql.files.maxPartitionBytes=128000000; it start 80 tasks( 80*128M~1.9G) but it seems every partition is not even. One partition read 1.9G data while others read only 3M(see attach pic). I have checked the compress codec of the file , it is snappy which can be splittable hadoop jar parquet-tools-1.11.2.jar head hdfs://horton/user/yazou/VenInv/shifu_norm_emb_bert/emb_valid_sel_train_sam_1ep.parquet 23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 368273 records. 23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block 23/03/22 20:42:09 INFO compress.CodecPool: Got brand-new decompressor [.snappy] 23/03/22 20:42:09 INFO hadoop.InternalParquetRecordReader: block read in memory in 13227 ms. row count = 368273 the spark code like ``` spark.read .format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat") .option("mergeSchema", "false") .load("xxxx") ``` Appreciate your help Best Regards Kelly Zhang