Bit of a late reply but currently it is not possible. In theory this should be possible for a list of known IPC files. One could calculate the total size of every column from the metadata of all the files. This would require reading in the metadata (not just the footer chunk at the end but each of the schemas as well) from every file. Once that is done the aggregated buffers could be allocated once ahead of time. Each file would have to be read in and its buffers copied to the correct offset in the aggregated buffers.
I'm not certain if this would or would not be possible with parquet files (you would need to know the total uncompressed size, in Arrow format, ahead of time. I don't really know if this info is there in the metadata or not) Typically the datasets API does not know the number of files ahead of time and not all formats would support this kind of operation (e.g. CSV has no metadata) but I think it could be an interesting tool specialized for IPC files. -Weston On Mon, Dec 20, 2021 at 1:45 PM Kaixiang Lin <lkxcar...@gmail.com> wrote: > > Hello, > > We are looking for an approach to create a single chunk table due to the > issue [here](https://issues.apache.org/jira/browse/ARROW-11989). Single chunk > table would be much faster during indexing. > > Currently, we write the table by first loading all files, convert to tables > and then combine chunks. > ```python > for ds_file in all_datasets: > ds = pa.dataset.dataset(ds_file, format='feather') > train_datasets.append(ds.to_table()) > combined_table = pa.concat_tables(train_datasets).combine_chunks() > > with open(args.output + "{}.arrow".format(split), "wb") as f: > s = pa.ipc.new_stream( > f, schema, options=pa.ipc.IpcWriteOptions(allow_64bit=True) > ) > s.write_batch(batches[0]) > ``` > However, this approach takes memory size 2x original dataset size. I wonder > if there is a way to write the data one by one > but still ensure the single chunk? > > Thank you! > > -- > > Best, > Kaixiang