unsubscribe

2023-03-25 Thread Raffael Bottoli Schemmer
unsubscribe


Re: please help the problem of big parquet file can not be splitted to read

2023-03-25 Thread yangjie01
HI ~ zhangliyun

Sorry for the late reply. From the meta file you provided(line 1650: "row group 
1: RC:1403968 TS:13491534645 OFFSET:4"), there is only one RowGroup in this 
file if the meta file is the entire result, so it is normal for this file to be 
read and handled by only one Spark task, other tasks should just finished after 
reading the footer and performing a RowGroup range filter check. You can 
consider controlling `parquet.block.size`  when writing the paquet file to make 
it has multiple RowGroups, so that it can be parallel read and handled by 
multiple Spark tasks.

As for why 80 tasks were started to read files, you can study the logic of the 
`org.apache.spark.sql.execution.datasources.FilePartition#maxSplitBytes`, 
`maxSplitBytes` is not only determined by `spark.sql.files.maxPartitionBytes`.

Yang Jie

发件人: zhangliyun 
日期: 2023年3月24日 星期五 09:09
收件人: Alfie Davidson 
抄送: yangjie01 , Spark Dev List 
主题: Re:Re: please help the problem of big parquet file can not be splitted to 
read







@Yangjie. the meta file is attached.  I use "hadoop jar 
parquet-tools-1.11.2.jar meta 
hdfs://horton/user/yazou/VenInv/shifu_norm_emb_bert/emb_valid_sel_train_sam_1ep.parquet"
 to get the info
 not sure this is what you mentioned, i did not find row group info the in the 
meta file, if my command is wrong , tell me
 @Alfie Davidson   if  like that, everything is reasonable.







At 2023-03-24 04:36:47, "Alfie Davidson"  wrote:
I’m pretty sure snappy file is not splittable. That’s why you have a single 
task (and most likely core) reading the 1.9GB snappy file
Sent from my iPhone


On 23 Mar 2023, at 07:36, yangjie01  wrote:
Is there only one RowGroup for this file? You can check this by printing the 
file's metadata using the `meta` command of `parquet-cli`.

Yang Jie

发件人: zhangliyun 
日期: 2023年3月23日 星期四 15:16
收件人: Spark Dev List 
主题: please help the problem of big parquet file can not be splitted to read

hi all
  i want to ask a question about how to split the big parquet file when spark 
reading , i have a parquet file which is 1.9G. i have set
spark.sql.files.maxPartitionBytes=12800;it start 80 tasks( 
80*128M~1.9G) but
it seems every partition is not even. One partition read 1.9G data while others 
read only 3M(see attach pic). I have checked the compress codec of the file , 
it is snappy which can be splittable

hadoop jar parquet-tools-1.11.2.jar head 
hdfs://horton/user/yazou/VenInv/shifu_norm_emb_bert/emb_valid_sel_train_sam_1ep.parquet
23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: RecordReader 
initialized will read a total of 368273 records.
23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: at row 0. reading 
next block
23/03/22 20:42:09 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
23/03/22 20:42:09 INFO hadoop.InternalParquetRecordReader: block read in memory 
in 13227 ms. row count = 368273



the spark code like
  ```

spark.read
  
.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat")
  .option("mergeSchema", "false")
  .load("")


  ```

Appreciate your help


Best Regards

Kelly Zhang


unsubscribe

2023-03-25 Thread 报时
unsubscribe