R: [Python] Retrieving a RecordBatch from plasma inside a function

ALBERTO Bocchinfuso Tue, 06 Feb 2018 01:24:50 -0800

Hi,

I’m using python 3.5.2 and pyarrow 0.8.0


As key, I put a string of 20 bytes, of course. I’m doing it differently from 
the canonical way since I’m no more using python 2.7, but python 3, and this 
seemed to me to be the right way to create a string of 20 bytes.
The full code is:

import pyarrow as pa
import pyarrow.plasma as plasma

def retrieve1():
             client = plasma.connect('test', "", 0)

             key = "keynumber1keynumber1"
             pid = plasma.ObjectID(bytearray(key,'UTF-8'))

             [buff] = client .get_buffers([pid])
             batch = pa.RecordBatchStreamReader(buff).read_next_batch()

             print(batch)
             print(batch.schema)
             print(batch[0])

             return batch

client = plasma.connect('test', "", 0)

test1 = [1, 12, 23, 3, 21, 34]
test1 = pa.array(test1, pa.int32())

batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])

key = "keynumber1keynumber1"
pid = plasma.ObjectID(bytearray(key,'UTF-8'))
sink = pa.MockOutputStream()
stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
stream_writer.write_batch(batch)
stream_writer.close()

bff = client.create(pid, sink.size())

stream = pa.FixedSizeBufferWriter(bff)
writer = pa.RecordBatchStreamWriter(stream, batch.schema)
writer.write_batch(batch)
client.seal(pid)

batch = retrieve1()
print(batch)
print(batch.schema)
print(batch[0])

I hope this helps,
thank you

Da: Philipp Moritz<mailto:[email protected]>
Inviato: martedì 6 febbraio 2018 00:00
A: [email protected]<mailto:[email protected]>
Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a function

Hey Alberto,

Thanks for your message! I'm trying to reproduce it.

Can you attach the code you use to write the batch into the store?

Also can you say which version of Python and Arrow you are using? On my
installation, I get

```

In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-5-fbec5bb33c33> in <module>()

----> 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))


plasma.pyx in pyarrow.plasma.ObjectID.__cinit__()


ValueError: Object ID must by 20 bytes, is keynumber1keynumber1
```

(the canonical way to do this would be plasma.ObjectID(b
"keynumber1keynumber1"))

Best,
Philipp.

On Mon, Feb 5, 2018 at 10:09 AM, ALBERTO Bocchinfuso <
[email protected]> wrote:

> Good morning,
>
> I am experiencing problems with the RecordBatches stored in plasma in a
> particular situation.
>
> If I return a RecordBatch as result of a python function, I am able to
> read just the metadata, while I get an error when reading the columns.
>
> For example, the following code
> def retrieve1():
>         client = plasma.connect('test', "", 0)
>
>         key = "keynumber1keynumber1"
>         pid = plasma.ObjectID(bytearray(key,'UTF-8'))
>
>         [buff] = client .get_buffers([pid])
>         batch = pa.RecordBatchStreamReader(buff).read_next_batch()
>         return batch
>
> batch = retrieve1()
> print(batch)
> print(batch.schema)
> print(batch[0])
>
> Represents a simple python code in which a function is in charge of
> retrieving the RecordBatch from the plasma store, and then returns it to
> the caller. Running the previous example I get:
> <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48>
> FIELD1: int32
> metadata
> --------
> {}
> <pyarrow.lib.Int32Array object at 0x7fd0ebfc0f98>
> [
>   1,
>   12,
>   23,
>   3,
>   21,
>   34
> ]
> <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48>
> FIELD1: int32
> metadata
> --------
> {}
> Errore di segmentazione (core dump creato)
>
>
> If I retrieve and use the data in the same part of the code (as I do in
> the function retrieve1(), but it also works when I put everything in the
> main program.) everything runs without problems.
>
> Also the problem seems to be related to the particular case in which I
> retrieve the RecordBatch from the plasma store, since the following
> (simpler) code:
> def create():
>         test1 = [1, 12, 23, 3, 21, 34]
>         test1 = pa.array(test1, pa.int32())
>
>         batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
>         print(batch)
>         print(batch.schema)
>         print(batch[0])
>         return batch
>
> batch1 = create()
> print(batch1)
> print(batch1.schema)
> print(batch1[0])
>
> Prints:
>
> <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598>
> FIELD1: int32
> <pyarrow.lib.Int32Array object at 0x7f5f691fd9a8>
> [
>   1,
>   12,
>   23,
>   3,
>   21,
>   34
> ]
> <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598>
> FIELD1: int32
> <pyarrow.lib.Int32Array object at 0x7f5f7e29f318>
> [
>   1,
>   12,
>   23,
>   3,
>   21,
>   34
> ]
>
> Which is what I expect.
>
> Is this issue known or am I doing something wrong when retrieving the
> RecordBatch from plasma?
>
> Also I would like to pinpoint the fact that this problem was as easy to
> find as hard to re-create. For this reason, there can be other situations
> in which the same problem arises that I did not experienced, since I mostly
> deal with plasma and I’ve been using only python so long: the description I
> gave is not intended to be complete.
>
> Thank you,
> Alberto
>

R: [Python] Retrieving a RecordBatch from plasma inside a function

Reply via email to