I’m pretty sure snappy file is not splittable. That’s why you have a single task (and most likely core) reading the 1.9GB snappy file
Sent from my iPhone > On 23 Mar 2023, at 07:36, yangjie01 <yangji...@baidu.com> wrote: > > Is there only one RowGroup for this file? You can check this by printing the > file's metadata using the `meta` command of `parquet-cli`. > > Yang Jie > > 发件人: zhangliyun <kelly...@126.com> > 日期: 2023年3月23日 星期四 15:16 > 收件人: Spark Dev List <dev@spark.apache.org> > 主题: please help the problem of big parquet file can not be splitted to read > > hi all > i want to ask a question about how to split the big parquet file when spark > reading , i have a parquet file which is 1.9G. i have set > spark.sql.files.maxPartitionBytes=128000000; it start 80 tasks( > 80*128M~1.9G) but > it seems every partition is not even. One partition read 1.9G data while > others read only 3M(see attach pic). I have checked the compress codec of the > file , it is snappy which can be splittable > > hadoop jar parquet-tools-1.11.2.jar head > hdfs://horton/user/yazou/VenInv/shifu_norm_emb_bert/emb_valid_sel_train_sam_1ep.parquet > 23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: RecordReader > initialized will read a total of 368273 records. > 23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: at row 0. reading > next block > 23/03/22 20:42:09 INFO compress.CodecPool: Got brand-new decompressor > [.snappy] > 23/03/22 20:42:09 INFO hadoop.InternalParquetRecordReader: block read in > memory in 13227 ms. row count = 368273 > > > > the spark code like > ``` > > spark.read > > .format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat") > .option("mergeSchema", "false") > .load("xxxx") > > > ``` > > Appreciate your help > > > Best Regards > > Kelly Zhang