Re: [Python] Retrieving a RecordBatch from plasma inside a function

Philipp Moritz Wed, 21 Feb 2018 13:55:43 -0800

I created one here: https://issues.apache.org/jira/browse/ARROW-2195


On Wed, Feb 21, 2018 at 8:11 AM, Wes McKinney <wesmck...@gmail.com> wrote:

> Can we create a JIRA to track this issue?
>
> On Wed, Feb 21, 2018 at 5:04 AM, ALBERTO Bocchinfuso
> <alberto_boc...@hotmail.it> wrote:
> > Hi,
> >
> > Have you had any news on this issue?
> > Do you plan to solve it for the next releases of Arrow, or is there any
> way to avoid the problem?
> >
> > Thanks in advance,
> > Alberto
> > Da: Philipp Moritz<mailto:pcmor...@gmail.com>
> > Inviato: venerdì 9 febbraio 2018 00:30
> > A: dev@arrow.apache.org<mailto:dev@arrow.apache.org>
> > Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a
> function
> >
> > Thanks! I can indeed reproduce this problem. I'm a bit busy right now and
> > plan to look into it on the weekend.
> >
> > Here is the preliminary backtrace for everybody interested:
> >
> > CESS (code=1, address=0x111138158)
> >
> >     frame #0: 0x000000010e6457fc
> > lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) +
> 28
> >
> > lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:
> >
> > ->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi
> >
> >     0x10e645800 <+32>: callq  0x10e698170               ; symbol stub
> for:
> > PyInt_FromLong
> >
> >     0x10e645805 <+37>: testq  %rax, %rax
> >
> >     0x10e645808 <+40>: je     0x10e64580c               ; <+44>
> >
> > (lldb) bt
> >
> > * thread #1: tid = 0xf1378e, 0x000000010e6457fc
> > lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) +
> 28,
> > queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,
> > address=0x111138158)
> >
> >   * frame #0: 0x000000010e6457fc
> > lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) +
> 28
> >
> >     frame #1: 0x000000010e5ccd35 lib.so`__Pyx_PyObject_
> CallNoArg(_object*)
> > + 133
> >
> >     frame #2: 0x000000010e613b25
> > lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933
> >
> >     frame #3: 0x000000010c2f83bc libpython2.7.dylib`PyObject_Repr + 60
> >
> >     frame #4: 0x000000010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx +
> > 22305
> >
> > On Tue, Feb 6, 2018 at 1:24 AM, ALBERTO Bocchinfuso <
> > alberto_boc...@hotmail.it> wrote:
> >
> >> Hi,
> >>
> >> I’m using python 3.5.2 and pyarrow 0.8.0
> >>
> >> As key, I put a string of 20 bytes, of course. I’m doing it differently
> >> from the canonical way since I’m no more using python 2.7, but python 3,
> >> and this seemed to me to be the right way to create a string of 20
> bytes.
> >> The full code is:
> >>
> >> import pyarrow as pa
> >> import pyarrow.plasma as plasma
> >>
> >> def retrieve1():
> >>              client = plasma.connect('test', "", 0)
> >>
> >>              key = "keynumber1keynumber1"
> >>              pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> >>
> >>              [buff] = client .get_buffers([pid])
> >>              batch = pa.RecordBatchStreamReader(buff).read_next_batch()
> >>
> >>              print(batch)
> >>              print(batch.schema)
> >>              print(batch[0])
> >>
> >>              return batch
> >>
> >> client = plasma.connect('test', "", 0)
> >>
> >> test1 = [1, 12, 23, 3, 21, 34]
> >> test1 = pa.array(test1, pa.int32())
> >>
> >> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
> >>
> >> key = "keynumber1keynumber1"
> >> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> >> sink = pa.MockOutputStream()
> >> stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> >> stream_writer.write_batch(batch)
> >> stream_writer.close()
> >>
> >> bff = client.create(pid, sink.size())
> >>
> >> stream = pa.FixedSizeBufferWriter(bff)
> >> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> >> writer.write_batch(batch)
> >> client.seal(pid)
> >>
> >> batch = retrieve1()
> >> print(batch)
> >> print(batch.schema)
> >> print(batch[0])
> >>
> >> I hope this helps,
> >> thank you
> >>
> >> Da: Philipp Moritz<mailto:pcmor...@gmail.com>
> >> Inviato: martedì 6 febbraio 2018 00:00
> >> A: dev@arrow.apache.org<mailto:dev@arrow.apache.org>
> >> Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a
> >> function
> >>
> >> Hey Alberto,
> >>
> >> Thanks for your message! I'm trying to reproduce it.
> >>
> >> Can you attach the code you use to write the batch into the store?
> >>
> >> Also can you say which version of Python and Arrow you are using? On my
> >> installation, I get
> >>
> >> ```
> >>
> >> In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
> >>
> >> ------------------------------------------------------------
> >> ---------------
> >>
> >> ValueError                                Traceback (most recent call
> last)
> >>
> >> <ipython-input-5-fbec5bb33c33> in <module>()
> >>
> >> ----> 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
> >>
> >>
> >> plasma.pyx in pyarrow.plasma.ObjectID.__cinit__()
> >>
> >>
> >> ValueError: Object ID must by 20 bytes, is keynumber1keynumber1
> >> ```
> >>
> >> (the canonical way to do this would be plasma.ObjectID(b
> >> "keynumber1keynumber1"))
> >>
> >> Best,
> >> Philipp.
> >>
> >> On Mon, Feb 5, 2018 at 10:09 AM, ALBERTO Bocchinfuso <
> >> alberto_boc...@hotmail.it> wrote:
> >>
> >> > Good morning,
> >> >
> >> > I am experiencing problems with the RecordBatches stored in plasma in
> a
> >> > particular situation.
> >> >
> >> > If I return a RecordBatch as result of a python function, I am able to
> >> > read just the metadata, while I get an error when reading the columns.
> >> >
> >> > For example, the following code
> >> > def retrieve1():
> >> >         client = plasma.connect('test', "", 0)
> >> >
> >> >         key = "keynumber1keynumber1"
> >> >         pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> >> >
> >> >         [buff] = client .get_buffers([pid])
> >> >         batch = pa.RecordBatchStreamReader(buff).read_next_batch()
> >> >         return batch
> >> >
> >> > batch = retrieve1()
> >> > print(batch)
> >> > print(batch.schema)
> >> > print(batch[0])
> >> >
> >> > Represents a simple python code in which a function is in charge of
> >> > retrieving the RecordBatch from the plasma store, and then returns it
> to
> >> > the caller. Running the previous example I get:
> >> > <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48>
> >> > FIELD1: int32
> >> > metadata
> >> > --------
> >> > {}
> >> > <pyarrow.lib.Int32Array object at 0x7fd0ebfc0f98>
> >> > [
> >> >   1,
> >> >   12,
> >> >   23,
> >> >   3,
> >> >   21,
> >> >   34
> >> > ]
> >> > <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48>
> >> > FIELD1: int32
> >> > metadata
> >> > --------
> >> > {}
> >> > Errore di segmentazione (core dump creato)
> >> >
> >> >
> >> > If I retrieve and use the data in the same part of the code (as I do
> in
> >> > the function retrieve1(), but it also works when I put everything in
> the
> >> > main program.) everything runs without problems.
> >> >
> >> > Also the problem seems to be related to the particular case in which I
> >> > retrieve the RecordBatch from the plasma store, since the following
> >> > (simpler) code:
> >> > def create():
> >> >         test1 = [1, 12, 23, 3, 21, 34]
> >> >         test1 = pa.array(test1, pa.int32())
> >> >
> >> >         batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
> >> >         print(batch)
> >> >         print(batch.schema)
> >> >         print(batch[0])
> >> >         return batch
> >> >
> >> > batch1 = create()
> >> > print(batch1)
> >> > print(batch1.schema)
> >> > print(batch1[0])
> >> >
> >> > Prints:
> >> >
> >> > <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598>
> >> > FIELD1: int32
> >> > <pyarrow.lib.Int32Array object at 0x7f5f691fd9a8>
> >> > [
> >> >   1,
> >> >   12,
> >> >   23,
> >> >   3,
> >> >   21,
> >> >   34
> >> > ]
> >> > <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598>
> >> > FIELD1: int32
> >> > <pyarrow.lib.Int32Array object at 0x7f5f7e29f318>
> >> > [
> >> >   1,
> >> >   12,
> >> >   23,
> >> >   3,
> >> >   21,
> >> >   34
> >> > ]
> >> >
> >> > Which is what I expect.
> >> >
> >> > Is this issue known or am I doing something wrong when retrieving the
> >> > RecordBatch from plasma?
> >> >
> >> > Also I would like to pinpoint the fact that this problem was as easy
> to
> >> > find as hard to re-create. For this reason, there can be other
> situations
> >> > in which the same problem arises that I did not experienced, since I
> >> mostly
> >> > deal with plasma and I’ve been using only python so long: the
> >> description I
> >> > gave is not intended to be complete.
> >> >
> >> > Thank you,
> >> > Alberto
> >> >
> >>
> >>
> >
>

Re: [Python] Retrieving a RecordBatch from plasma inside a function

Reply via email to