[GitHub] [arrow] KaixiangLin opened a new issue #12008: [Question][Python] How to create a large single chunk file without loading all tables into the memory

GitBox Mon, 20 Dec 2021 15:48:41 -0800


KaixiangLin opened a new issue #12008:
URL: https://github.com/apache/arrow/issues/12008



   Hello, 
   
   We are looking for an approach to create a single chunk table due to the 
issue [here](https://issues.apache.org/jira/browse/ARROW-11989). Single chunk 
table would be much faster during indexing. 
   
   Currently, we write the the table by first load all files, convert to tables 
and then combine chunks. 
   ```python
   for ds_file in all_datasets:
           ds = pa.dataset.dataset(ds_file, format='feather')
           train_datasets.append(ds.to_table())
   combined_table = pa.concat_tables(train_datasets).combine_chunks()
   table = combined_table.cast(schema)
   with open(output_filename, "wb") as f:
         s = pa.ipc.new_stream(
             f, schema, options=pa.ipc.IpcWriteOptions(allow_64bit=True)
         )
         batches = table.to_batches()
         s.write_batch(batches[0]) 
   ```
   However, this approach takes memory size 2x original dataset size.  I wonder 
if there is a way to write the dataset one by one 
   but still ensure the single chunk? 
   
   Thank you! 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] KaixiangLin opened a new issue #12008: [Question][Python] How to create a large single chunk file without loading all tables into the memory

Reply via email to