Re: How to make a parquet dataset from an input file through Random access

Wes McKinney Fri, 22 Jan 2021 13:01:13 -0800

If you have used *StreamWriter to create a buffer that you're sending
to another process, if you want to get it back into a record batch or
table, you need to read it with
pyarrow.ipc.open_stream(...).read_all() and then you can concatenate
the resulting tables with pyarrow.concat_tables (or use
Table.from_batches if you have a sequence of record batches)


Hope this helps

On Thu, Jan 21, 2021 at 6:19 AM Jonathan MERCIER
<[email protected]> wrote:
>
> Same question but more simple to understand.
>
> Using pyarrow and working with pieces of data by process (multi-process
> as workaround GIL limitation). What is the correct way to handle this task ?
>
> 1. each // process have to create create a list of records store them
> into a record batch and return this batch
>
> 2. each // process have to create an output and writer buffer , create a
> list of records store them into a record batch and write this record
> batch into the stream writer. The process return the corresponding buffer ?
>
> with the answer (1) I see how to merge all of those batch but with
> solution (2) how to merge all buffer to one once each process has
> returned their buffer ?
>
>
>
> Thanks
>
>
> --
> Jonathan MERCIER
>
> Researcher computational biology
>
> PhD, Jonathan MERCIER
>
> Centre National de Recherche en Génomique Humaine (CNRGH)
>
> Bioinformatics (LBI)
>
> 2, rue Gaston Crémieux
>
> 91057 Evry Cedex
>
> Tel :(33) 1 60 87 34 88
>
> Email :[email protected] <mailto:[email protected]>
>

Re: How to make a parquet dataset from an input file through Random access

Reply via email to