HI ~ zhangliyun
Sorry for the late reply. From the meta file you provided(line 1650: "row group
1: RC:1403968 TS:13491534645 OFFSET:4"), there is only one RowGroup in this
file if the meta file is the entire result, so it is normal for this file to be
read and handled by only one Spark task, other tasks should just finished after
reading the footer and performing a RowGroup range filter check. You can
consider controlling `parquet.block.size` when writing the paquet file to make
it has multiple RowGroups, so that it can be parallel read and handled by
multiple Spark tasks.
As for why 80 tasks were started to read files, you can study the logic of the
`org.apache.spark.sql.execution.datasources.FilePartition#maxSplitBytes`,
`maxSplitBytes` is not only determined by `spark.sql.files.maxPartitionBytes`.
Yang Jie
发件人: zhangliyun
日期: 2023年3月24日 星期五 09:09
收件人: Alfie Davidson
抄送: yangjie01 , Spark Dev List
主题: Re:Re: please help the problem of big parquet file can not be splitted to
read
@Yangjie. the meta file is attached. I use "hadoop jar
parquet-tools-1.11.2.jar meta
hdfs://horton/user/yazou/VenInv/shifu_norm_emb_bert/emb_valid_sel_train_sam_1ep.parquet"
to get the info
not sure this is what you mentioned, i did not find row group info the in the
meta file, if my command is wrong , tell me
@Alfie Davidson if like that, everything is reasonable.
At 2023-03-24 04:36:47, "Alfie Davidson" wrote:
I’m pretty sure snappy file is not splittable. That’s why you have a single
task (and most likely core) reading the 1.9GB snappy file
Sent from my iPhone
On 23 Mar 2023, at 07:36, yangjie01 wrote:
Is there only one RowGroup for this file? You can check this by printing the
file's metadata using the `meta` command of `parquet-cli`.
Yang Jie
发件人: zhangliyun
日期: 2023年3月23日 星期四 15:16
收件人: Spark Dev List
主题: please help the problem of big parquet file can not be splitted to read
hi all
i want to ask a question about how to split the big parquet file when spark
reading , i have a parquet file which is 1.9G. i have set
spark.sql.files.maxPartitionBytes=12800;it start 80 tasks(
80*128M~1.9G) but
it seems every partition is not even. One partition read 1.9G data while others
read only 3M(see attach pic). I have checked the compress codec of the file ,
it is snappy which can be splittable
hadoop jar parquet-tools-1.11.2.jar head
hdfs://horton/user/yazou/VenInv/shifu_norm_emb_bert/emb_valid_sel_train_sam_1ep.parquet
23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: RecordReader
initialized will read a total of 368273 records.
23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: at row 0. reading
next block
23/03/22 20:42:09 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
23/03/22 20:42:09 INFO hadoop.InternalParquetRecordReader: block read in memory
in 13227 ms. row count = 368273
the spark code like
```
spark.read
.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat")
.option("mergeSchema", "false")
.load("")
```
Appreciate your help
Best Regards
Kelly Zhang