I’m pretty sure snappy file is not splittable. That’s why you have a single 
task (and most likely core) reading the 1.9GB snappy file

Sent from my iPhone

> On 23 Mar 2023, at 07:36, yangjie01 <yangji...@baidu.com> wrote:
> 
> Is there only one RowGroup for this file? You can check this by printing the 
> file's metadata using the `meta` command of `parquet-cli`.
>  
> Yang Jie
>  
> 发件人: zhangliyun <kelly...@126.com>
> 日期: 2023年3月23日 星期四 15:16
> 收件人: Spark Dev List <dev@spark.apache.org>
> 主题: please help the problem of big parquet file can not be splitted to read
>  
> hi all
>   i want to ask a question about how to split the big parquet file when spark 
> reading , i have a parquet file which is 1.9G. i have set 
> spark.sql.files.maxPartitionBytes=128000000;    it start 80 tasks( 
> 80*128M~1.9G) but
> it seems every partition is not even. One partition read 1.9G data while 
> others read only 3M(see attach pic). I have checked the compress codec of the 
> file , it is snappy which can be splittable
>  
> hadoop jar parquet-tools-1.11.2.jar head 
> hdfs://horton/user/yazou/VenInv/shifu_norm_emb_bert/emb_valid_sel_train_sam_1ep.parquet
> 23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: RecordReader 
> initialized will read a total of 368273 records.
> 23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: at row 0. reading 
> next block
> 23/03/22 20:42:09 INFO compress.CodecPool: Got brand-new decompressor 
> [.snappy]
> 23/03/22 20:42:09 INFO hadoop.InternalParquetRecordReader: block read in 
> memory in 13227 ms. row count = 368273
>  
>   
>  
> the spark code like
>   ```
>  
>     spark.read
>       
> .format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat")
>       .option("mergeSchema", "false")
>       .load("xxxx")
>  
>  
>   ```
>  
> Appreciate your help
>  
>  
> Best Regards
>  
> Kelly Zhang

Reply via email to