Hi,

When I use Dataframe’s save append function, I find that the parquet partition 
size are very different.

Part-r-00001 to 00021 are generated at the first time save append function is 
called.
Part-r-00022 to 00042 is generated at the second time save append function is 
called.

As you can see, the size of Part-r-00001 to 00021 is 200M, while the size of 
Part-r-00022 to 00042 is 700M.
But the source data is the same, which confused me.

-rw-r--r-- 1 sysplatform sysplatform 2.0K Apr 8 10:01 _common_metadata
-rw-r--r-- 1 sysplatform sysplatform 392K Apr 8 10:01 _metadata
-rw-r--r-- 1 sysplatform sysplatform 199M Apr 8 09:43 part-r-00001.parquet
-rw-r--r-- 1 sysplatform sysplatform 200M Apr 8 09:44 part-r-00002.parquet
-rw-r--r-- 1 sysplatform sysplatform 200M Apr 8 09:43 part-r-00003.parquet
-rw-r--r-- 1 sysplatform sysplatform 199M Apr 8 09:43 part-r-00004.parquet
-rw-r--r-- 1 sysplatform sysplatform 199M Apr 8 09:43 part-r-00005.parquet
-rw-r--r-- 1 sysplatform sysplatform 199M Apr 8 09:43 part-r-00006.parquet
-rw-r--r-- 1 sysplatform sysplatform 200M Apr 8 09:43 part-r-00007.parquet
-rw-r--r-- 1 sysplatform sysplatform 199M Apr 8 09:43 part-r-00008.parquet
-rw-r--r-- 1 sysplatform sysplatform 199M Apr 8 09:43 part-r-00009.parquet
-rw-r--r-- 1 sysplatform sysplatform 199M Apr 8 09:43 part-r-00010.parquet
-rw-r--r-- 1 sysplatform sysplatform 199M Apr 8 09:43 part-r-00011.parquet
-rw-r--r-- 1 sysplatform sysplatform 199M Apr 8 09:43 part-r-00012.parquet
-rw-r--r-- 1 sysplatform sysplatform 200M Apr 8 09:43 part-r-00013.parquet
-rw-r--r-- 1 sysplatform sysplatform 199M Apr 8 09:43 part-r-00014.parquet
-rw-r--r-- 1 sysplatform sysplatform 199M Apr 8 09:43 part-r-00015.parquet
-rw-r--r-- 1 sysplatform sysplatform 199M Apr 8 09:43 part-r-00016.parquet
-rw-r--r-- 1 sysplatform sysplatform 200M Apr 8 09:43 part-r-00017.parquet
-rw-r--r-- 1 sysplatform sysplatform 199M Apr 8 09:43 part-r-00018.parquet
-rw-r--r-- 1 sysplatform sysplatform 200M Apr 8 09:43 part-r-00019.parquet
-rw-r--r-- 1 sysplatform sysplatform 199M Apr 8 09:43 part-r-00020.parquet
-rw-r--r-- 1 sysplatform sysplatform 200M Apr 8 09:43 part-r-00021.parquet
-rw-r--r-- 1 sysplatform sysplatform 720M Apr 8 10:01 part-r-00022.parquet
-rw-r--r-- 1 sysplatform sysplatform 723M Apr 8 10:00 part-r-00023.parquet
-rw-r--r-- 1 sysplatform sysplatform 721M Apr 8 10:01 part-r-00024.parquet
-rw-r--r-- 1 sysplatform sysplatform 721M Apr 8 10:00 part-r-00025.parquet
-rw-r--r-- 1 sysplatform sysplatform 717M Apr 8 10:00 part-r-00026.parquet
-rw-r--r-- 1 sysplatform sysplatform 721M Apr 8 10:00 part-r-00027.parquet
-rw-r--r-- 1 sysplatform sysplatform 720M Apr 8 10:01 part-r-00028.parquet
-rw-r--r-- 1 sysplatform sysplatform 725M Apr 8 10:01 part-r-00029.parquet
-rw-r--r-- 1 sysplatform sysplatform 720M Apr 8 10:00 part-r-00030.parquet
-rw-r--r-- 1 sysplatform sysplatform 725M Apr 8 10:01 part-r-00031.parquet
-rw-r--r-- 1 sysplatform sysplatform 724M Apr 8 10:01 part-r-00032.parquet
-rw-r--r-- 1 sysplatform sysplatform 724M Apr 8 10:00 part-r-00033.parquet
-rw-r--r-- 1 sysplatform sysplatform 721M Apr 8 10:01 part-r-00034.parquet
-rw-r--r-- 1 sysplatform sysplatform 721M Apr 8 10:01 part-r-00035.parquet
-rw-r--r-- 1 sysplatform sysplatform 720M Apr 8 10:00 part-r-00036.parquet
-rw-r--r-- 1 sysplatform sysplatform 717M Apr 8 10:00 part-r-00037.parquet
-rw-r--r-- 1 sysplatform sysplatform 724M Apr 8 10:01 part-r-00038.parquet
-rw-r--r-- 1 sysplatform sysplatform 722M Apr 8 10:01 part-r-00039.parquet
-rw-r--r-- 1 sysplatform sysplatform 722M Apr 8 10:00 part-r-00040.parquet
-rw-r--r-- 1 sysplatform sysplatform 721M Apr 8 10:01 part-r-00041.parquet
-rw-r--r-- 1 sysplatform sysplatform 723M Apr 8 10:01 part-r-00042.parquet
-rw-r--r-- 1 sysplatform sysplatform 0 Apr 8 10:01 _SUCCESS

Reply via email to