Good morning,
I am experiencing problems with the RecordBatches stored in plasma in a
particular situation.
If I return a RecordBatch as result of a python function, I am able to read
just the metadata, while I get an error when reading the columns.
For example, the following code
def retrieve1():
client = plasma.connect('test', "", 0)
key = "keynumber1keynumber1"
pid = plasma.ObjectID(bytearray(key,'UTF-8'))
[buff] = client .get_buffers([pid])
batch = pa.RecordBatchStreamReader(buff).read_next_batch()
return batch
batch = retrieve1()
print(batch)
print(batch.schema)
print(batch[0])
Represents a simple python code in which a function is in charge of retrieving
the RecordBatch from the plasma store, and then returns it to the caller.
Running the previous example I get:
<pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48>
FIELD1: int32
metadata
--------
{}
<pyarrow.lib.Int32Array object at 0x7fd0ebfc0f98>
[
1,
12,
23,
3,
21,
34
]
<pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48>
FIELD1: int32
metadata
--------
{}
Errore di segmentazione (core dump creato)
If I retrieve and use the data in the same part of the code (as I do in the
function retrieve1(), but it also works when I put everything in the main
program.) everything runs without problems.
Also the problem seems to be related to the particular case in which I retrieve
the RecordBatch from the plasma store, since the following (simpler) code:
def create():
test1 = [1, 12, 23, 3, 21, 34]
test1 = pa.array(test1, pa.int32())
batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
print(batch)
print(batch.schema)
print(batch[0])
return batch
batch1 = create()
print(batch1)
print(batch1.schema)
print(batch1[0])
Prints:
<pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598>
FIELD1: int32
<pyarrow.lib.Int32Array object at 0x7f5f691fd9a8>
[
1,
12,
23,
3,
21,
34
]
<pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598>
FIELD1: int32
<pyarrow.lib.Int32Array object at 0x7f5f7e29f318>
[
1,
12,
23,
3,
21,
34
]
Which is what I expect.
Is this issue known or am I doing something wrong when retrieving the
RecordBatch from plasma?
Also I would like to pinpoint the fact that this problem was as easy to find as
hard to re-create. For this reason, there can be other situations in which the
same problem arises that I did not experienced, since I mostly deal with plasma
and I’ve been using only python so long: the description I gave is not intended
to be complete.
Thank you,
Alberto