Re: [Question][Python] How to create a large single chunk file without loading all tables into the memory

Weston Pace Mon, 27 Dec 2021 12:15:43 -0800

Bit of a late reply but currently it is not possible.

In theory this should be possible for a list of known IPC files.  One
could calculate the total size of every column from the metadata of
all the files.  This would require reading in the metadata (not just
the footer chunk at the end but each of the schemas as well) from
every file.  Once that is done the aggregated buffers could be
allocated once ahead of time.  Each file would have to be read in and
its buffers copied to the correct offset in the aggregated buffers.

I'm not certain if this would or would not be possible with parquet
files (you would need to know the total uncompressed size, in Arrow
format, ahead of time.  I don't really know if this info is there in
the metadata or not)

Typically the datasets API does not know the number of files ahead of
time and not all formats would support this kind of operation (e.g.
CSV has no metadata) but I think it could be an interesting tool
specialized for IPC files.

-Weston

On Mon, Dec 20, 2021 at 1:45 PM Kaixiang Lin <lkxcar...@gmail.com> wrote:
>
> Hello,
>
> We are looking for an approach to create a single chunk table due to the 
> issue [here](https://issues.apache.org/jira/browse/ARROW-11989). Single chunk 
> table would be much faster during indexing.
>
> Currently, we write the table by first loading all files, convert to tables 
> and then combine chunks.
> ```python
> for ds_file in all_datasets:
>         ds = pa.dataset.dataset(ds_file, format='feather')
>         train_datasets.append(ds.to_table())
> combined_table = pa.concat_tables(train_datasets).combine_chunks()
>
> with open(args.output + "{}.arrow".format(split), "wb") as f:
>       s = pa.ipc.new_stream(
>           f, schema, options=pa.ipc.IpcWriteOptions(allow_64bit=True)
>       )
>       s.write_batch(batches[0])
> ```
> However, this approach takes memory size 2x original dataset size.  I wonder 
> if there is a way to write the data one by one
> but still ensure the single chunk?
>
> Thank you!
>
> --
>
> Best,
> Kaixiang

Re: [Question][Python] How to create a large single chunk file without loading all tables into the memory

Reply via email to