[ 
https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Klichuk updated ARROW-7150:
----------------------------------
    Description: 
Having columnar storage format in mind, with gzip compression enabled, I can't 
make sense of how parquet file size is growing in my specific example.

So far without sharing a dataset (would need to create a mock one to share).
{code:java}
> # 1. read 820 rows from a parquet file
> df.read_parquet('820.parquet')
> # size of 820.parquet is 528K
> len(df)
820
> # 2. write 8200 rows to a parquet file
> df_big = pandas.concat([df] * 10).reset_index(drop=True)
> len(df_big)
8200
> df_big.to_parquet('8200.parquet', compression='gzip')
> # size of 800.parquet is 33M. Why is it 60 times bigger?
 {code}
  

Compression works better on bigger files. How come 10x1 increase with repeated 
data resulted in 60x growth of file? Insane imo.

 

Working on a periodic job that concats smaller files into bigger ones and 
doubting now whether I need this.

 

I attached 820.parquet to try out

  was:
Having columnar storage format in mind, with gzip compression enabled, I can't 
make sense of how parquet file size is growing in my specific example.

So far without sharing a dataset (would need to create a mock one to share).
{code:java}
> df = pandas.read_csv('...')
> len(df)
820
> # 1. write 820 rows to a parquet file
> df.to_parquet('820.parquet', compression='gzip)
> # size of 820.parquet is 6.1M
> # 2. write 8200 rows to a parquet file
> df_big = pandas.concat([df] * 10).reset_index(drop=True)
> len(df_big)
8200
> df_big.to_parquet('8200.parquet', compression='gzip')
> # size of 800.parquet is 320M.
 {code}
 

 

Compression works better on bigger files. How come 10x1 increase with repeated 
data resulted in 50x growth of file? Insane imo.

 

Working on a periodic job that concats smaller files into bigger ones and 
doubting now whether I need this.


> [Python] Explain parquet file size growth
> -----------------------------------------
>
>                 Key: ARROW-7150
>                 URL: https://issues.apache.org/jira/browse/ARROW-7150
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: Python
>    Affects Versions: 0.14.1
>         Environment: Mac OS X. Pyarrow==0.14.1
>            Reporter: Bogdan Klichuk
>            Priority: Major
>         Attachments: 820.parquet
>
>
> Having columnar storage format in mind, with gzip compression enabled, I 
> can't make sense of how parquet file size is growing in my specific example.
> So far without sharing a dataset (would need to create a mock one to share).
> {code:java}
> > # 1. read 820 rows from a parquet file
> > df.read_parquet('820.parquet')
> > # size of 820.parquet is 528K
> > len(df)
> 820
> > # 2. write 8200 rows to a parquet file
> > df_big = pandas.concat([df] * 10).reset_index(drop=True)
> > len(df_big)
> 8200
> > df_big.to_parquet('8200.parquet', compression='gzip')
> > # size of 800.parquet is 33M. Why is it 60 times bigger?
>  {code}
>   
> Compression works better on bigger files. How come 10x1 increase with 
> repeated data resulted in 60x growth of file? Insane imo.
>  
> Working on a periodic job that concats smaller files into bigger ones and 
> doubting now whether I need this.
>  
> I attached 820.parquet to try out



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to