asfimport opened a new issue, #400:
URL: https://github.com/apache/parquet-format/issues/400

   I often needs to create tens of milliions of small dataframes and save them 
into parquet files. all these dataframes have the same column and index 
information. and normally they have the same number of rows(around 300).  
   
   as the data is quite small, the parquet meta information is relatively large 
and it's quite a big waste of disk space, as the same meta information is 
repeated tens of millions of times.
   
   concating them into one big parquet file can save disk space, but it's not 
friendly for parallel processing of each small dataframe. 
   
    
   
   if I can save one copy of the meta information into one file, and the rest 
parquet files contains only the data. then the disk space can be saved, and 
still good for parallel processing.
   
   seems to me this is possible by design, but I couldn't find any API 
supporting this.
   
   **Reporter**: [lei 
yu](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=assassin5615)
   
   <sub>**Note**: *This issue was originally created as 
[PARQUET-2207](https://issues.apache.org/jira/browse/PARQUET-2207). Please see 
the [migration 
documentation](https://issues.apache.org/jira/browse/PARQUET-2502) for further 
details.*</sub>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to