Thanks! I can indeed reproduce this problem. I'm a bit busy right now and
plan to look into it on the weekend.
Here is the preliminary backtrace for everybody interested:
CESS (code=1, address=0x111138158)
frame #0: 0x000000010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:
-> 0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi
0x10e645800 <+32>: callq 0x10e698170 ; symbol stub for:
PyInt_FromLong
0x10e645805 <+37>: testq %rax, %rax
0x10e645808 <+40>: je 0x10e64580c ; <+44>
(lldb) bt
* thread #1: tid = 0xf1378e, 0x000000010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28,
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,
address=0x111138158)
* frame #0: 0x000000010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28
frame #1: 0x000000010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*)
+ 133
frame #2: 0x000000010e613b25
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933
frame #3: 0x000000010c2f83bc libpython2.7.dylib`PyObject_Repr + 60
frame #4: 0x000000010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx +
22305
On Tue, Feb 6, 2018 at 1:24 AM, ALBERTO Bocchinfuso <
[email protected]> wrote:
> Hi,
>
> I’m using python 3.5.2 and pyarrow 0.8.0
>
> As key, I put a string of 20 bytes, of course. I’m doing it differently
> from the canonical way since I’m no more using python 2.7, but python 3,
> and this seemed to me to be the right way to create a string of 20 bytes.
> The full code is:
>
> import pyarrow as pa
> import pyarrow.plasma as plasma
>
> def retrieve1():
> client = plasma.connect('test', "", 0)
>
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
>
> [buff] = client .get_buffers([pid])
> batch = pa.RecordBatchStreamReader(buff).read_next_batch()
>
> print(batch)
> print(batch.schema)
> print(batch[0])
>
> return batch
>
> client = plasma.connect('test', "", 0)
>
> test1 = [1, 12, 23, 3, 21, 34]
> test1 = pa.array(test1, pa.int32())
>
> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
>
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> sink = pa.MockOutputStream()
> stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> stream_writer.write_batch(batch)
> stream_writer.close()
>
> bff = client.create(pid, sink.size())
>
> stream = pa.FixedSizeBufferWriter(bff)
> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> writer.write_batch(batch)
> client.seal(pid)
>
> batch = retrieve1()
> print(batch)
> print(batch.schema)
> print(batch[0])
>
> I hope this helps,
> thank you
>
> Da: Philipp Moritz<mailto:[email protected]>
> Inviato: martedì 6 febbraio 2018 00:00
> A: [email protected]<mailto:[email protected]>
> Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a
> function
>
> Hey Alberto,
>
> Thanks for your message! I'm trying to reproduce it.
>
> Can you attach the code you use to write the batch into the store?
>
> Also can you say which version of Python and Arrow you are using? On my
> installation, I get
>
> ```
>
> In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
>
> ------------------------------------------------------------
> ---------------
>
> ValueError Traceback (most recent call last)
>
> <ipython-input-5-fbec5bb33c33> in <module>()
>
> ----> 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
>
>
> plasma.pyx in pyarrow.plasma.ObjectID.__cinit__()
>
>
> ValueError: Object ID must by 20 bytes, is keynumber1keynumber1
> ```
>
> (the canonical way to do this would be plasma.ObjectID(b
> "keynumber1keynumber1"))
>
> Best,
> Philipp.
>
> On Mon, Feb 5, 2018 at 10:09 AM, ALBERTO Bocchinfuso <
> [email protected]> wrote:
>
> > Good morning,
> >
> > I am experiencing problems with the RecordBatches stored in plasma in a
> > particular situation.
> >
> > If I return a RecordBatch as result of a python function, I am able to
> > read just the metadata, while I get an error when reading the columns.
> >
> > For example, the following code
> > def retrieve1():
> > client = plasma.connect('test', "", 0)
> >
> > key = "keynumber1keynumber1"
> > pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> >
> > [buff] = client .get_buffers([pid])
> > batch = pa.RecordBatchStreamReader(buff).read_next_batch()
> > return batch
> >
> > batch = retrieve1()
> > print(batch)
> > print(batch.schema)
> > print(batch[0])
> >
> > Represents a simple python code in which a function is in charge of
> > retrieving the RecordBatch from the plasma store, and then returns it to
> > the caller. Running the previous example I get:
> > <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48>
> > FIELD1: int32
> > metadata
> > --------
> > {}
> > <pyarrow.lib.Int32Array object at 0x7fd0ebfc0f98>
> > [
> > 1,
> > 12,
> > 23,
> > 3,
> > 21,
> > 34
> > ]
> > <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48>
> > FIELD1: int32
> > metadata
> > --------
> > {}
> > Errore di segmentazione (core dump creato)
> >
> >
> > If I retrieve and use the data in the same part of the code (as I do in
> > the function retrieve1(), but it also works when I put everything in the
> > main program.) everything runs without problems.
> >
> > Also the problem seems to be related to the particular case in which I
> > retrieve the RecordBatch from the plasma store, since the following
> > (simpler) code:
> > def create():
> > test1 = [1, 12, 23, 3, 21, 34]
> > test1 = pa.array(test1, pa.int32())
> >
> > batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
> > print(batch)
> > print(batch.schema)
> > print(batch[0])
> > return batch
> >
> > batch1 = create()
> > print(batch1)
> > print(batch1.schema)
> > print(batch1[0])
> >
> > Prints:
> >
> > <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598>
> > FIELD1: int32
> > <pyarrow.lib.Int32Array object at 0x7f5f691fd9a8>
> > [
> > 1,
> > 12,
> > 23,
> > 3,
> > 21,
> > 34
> > ]
> > <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598>
> > FIELD1: int32
> > <pyarrow.lib.Int32Array object at 0x7f5f7e29f318>
> > [
> > 1,
> > 12,
> > 23,
> > 3,
> > 21,
> > 34
> > ]
> >
> > Which is what I expect.
> >
> > Is this issue known or am I doing something wrong when retrieving the
> > RecordBatch from plasma?
> >
> > Also I would like to pinpoint the fact that this problem was as easy to
> > find as hard to re-create. For this reason, there can be other situations
> > in which the same problem arises that I did not experienced, since I
> mostly
> > deal with plasma and I’ve been using only python so long: the
> description I
> > gave is not intended to be complete.
> >
> > Thank you,
> > Alberto
> >
>
>