Dear,
library: pyarrow
version: current stable
I try to read a file supporting random access and convert its data into
a parquet dataset using pyarrow.
Thus I make a pool executor to process input data asynchronously.
Each process read through a stream and return at the end a
pyarrow.lib.Buffer . How to merge all those buffer in order to get one
Table ?
I know how to do it from one buffer but not a collection of buffers:
- RecordBatchStreamReader(source).read_all()
My current code :
sink = BufferOutputStream()
writer = RecordBatchStreamWriter(sink, a_schema)
# In order to call write at around 1Mo of data and reuse the buffer
buffer = (
list(None for _ in range(0, 70)),
list(None for _ in range(0, 70)),
)
buffer_index=0
while iterator:
try:
buffer[0][buffer_index] = iterator.a buffer[1][buffer_index] =
iterator.b
except IndexError:
batch = record_batch([ buffer[0][:buffer_index],
buffer[1][:buffer_index]],
a_schema)
writer.write(batch)
buffer_index=0
if index != 0:
batch = record_batch([ buffer[0][:buffer_index],
buffer[1][:buffer_index]],
a_schema)
writer.write(batch)
writer.close()
return sink.getvalue()
Thanks for your help
best regards
--
Researcher computational biology
PhD, Jonathan MERCIER
Bioinformatics (LBI)
2, rue Gaston
Crémieux
91057 Evry Cedex
Tel :(33) 1 60 87 83 44
Email :[email protected]