KaixiangLin opened a new issue #12008: URL: https://github.com/apache/arrow/issues/12008
Hello, We are looking for an approach to create a single chunk table due to the issue [here](https://issues.apache.org/jira/browse/ARROW-11989). Single chunk table would be much faster during indexing. Currently, we write the the table by first load all files, convert to tables and then combine chunks. ```python for ds_file in all_datasets: ds = pa.dataset.dataset(ds_file, format='feather') train_datasets.append(ds.to_table()) combined_table = pa.concat_tables(train_datasets).combine_chunks() table = combined_table.cast(schema) with open(output_filename, "wb") as f: s = pa.ipc.new_stream( f, schema, options=pa.ipc.IpcWriteOptions(allow_64bit=True) ) batches = table.to_batches() s.write_batch(batches[0]) ``` However, this approach takes memory size 2x original dataset size. I wonder if there is a way to write the dataset one by one but still ensure the single chunk? Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
