Re: Pandas Block Manager

Nicholas White Wed, 27 Jan 2021 16:27:36 -0800

Hi all - just pinging this thread given the later discussions on the PR
<https://github.com/apache/arrow/pull/8644>. I am proposing a backwards
(but not forwards) compatible change to the spec to strike this line out When
serializing Arrow data for interprocess communication, these alignment and
padding requirements are enforced and want to gauge the general reaction to
this. The tradeoff is roughly ecosystem fragmentation vs. improved
performance for a single, albeit common, workflow.


@Micah the fixed size list would lose metadata like column headers - and my
motivating example is saving the IPC-format Arrow files to disk. If you
have to add your own solution to keep track of the metadata separately to
the Arrow file then you don't really get anything from using Arrow and
might as well use, e.g., `np.save` on each dtype in the blockmanager and
keep the metadata in a JSON file.

@Joris agree that a non-consolidating block manager is a good solution, and
am following your progress on 39146
<https://github.com/pandas-dev/pandas/issues/39146> etc! However, I see
you've discussed performance a lot on the mailing list (archive
<https://mail.python.org/pipermail/pandas-dev/2020-May/001244.html>) - if
an array manager got you faster data loading but slower analysis it'd be a
poor trade-off to the block manager's slower data loading and faster
analysis. Your notebook
<https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>doesn't
have any group-by-and-aggregate timings - do you have any yet that you
think are representative?

Thanks -

Nick


On Fri, 13 Nov 2020 at 10:49, Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> As Micah and Wes pointed out on the PR, this alignment/padding are
> requirements of the format specification. For reference, see here:
>
> https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding
> That's also the reason that I said earlier in this thread that such
> zero-copy conversion to pandas you want to achieve is "basically
> impossible", as far as I know (when not using the split_block option).
>
> The suggestion of Micah to use a fixed size list column for each set
> of columns with the same data type could work to achieve this. But
> then of course you need to construct the Arrow table specifically for
> this (and you no longer have a normal table with columns), and will
> have to write a custom arrow->pandas conversion to get the buffer of
> the list column, view it as a 2D numpy array, and put this in a pandas
> Block.
>
> I know it is not a satisfactory solution *right now*, but longer term,
> I think the best bet for getting this zero-copy conversion (which I
> would also like to see!) is the non-consolidating manager that stores
> 1D arrays under the hood in pandas, which I mentioned earlier in the
> thread.
>
> Joris
>
> On Fri, 13 Nov 2020 at 00:21, Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> > Hi Nicholas,
> > I don't think allowing for flexibility of non 8 byte aligned types is a
> > good idea.  The specification explicitly calls out the alignment
> > requirements and allowing for writers to output different non-aligned
> > values potentially breaks other implementations.
> >
> > I'm not sure of your exact use-case but another approach to consider is
> to
> > store the values in a single Arrow column as either a list or a fixed
> size
> > list and look into doing zero copy from that to the corresponding pandas
> > memory (this is hypothetical, again I don't have enough context on
> > pandas/numpy memory layouts).
> >
> > -Micah
> >
> > On Thu, Nov 12, 2020 at 3:01 PM Nicholas White <n.j.wh...@gmail.com>
> wrote:
> >
> > > OK got everything to work, https://github.com/apache/arrow/pull/8644
> > > (part of ARROW-10573 now) is ready for review. I've updated the test
> case
> > > to show it is possible to zero-copy a pandas DataFrame! The next step
> is to
> > > dig into `arrow_to_pandas.cc` to make it work automagically...
> > >
> > > On Wed, 11 Nov 2020 at 22:52, Nicholas White <n.j.wh...@gmail.com>
> wrote:
> > >
> > >> Thanks all, this has been interesting. I've made a patch that sort-of
> > >> does what I want[1] - I hope the test case is clear! I made the batch
> > >> writer use the `alignment` field that was already in the
> `IpcWriteOptions`
> > >> to align the buffers, instead of fixing their alignment at 8. Arrow
> then
> > >> writes out the buffers consecutively, so you can map them as a 2D
> memory
> > >> array like I wanted. There's one problem though...the test case
> thinks the
> > >> arrow data is invalid as it can't read the metadata properly (error
> below).
> > >> Do you have any idea why? I think it's because Arrow puts the
> metadata at
> > >> the end of the file after the now-unaligned buffers yet assumes the
> > >> metadata is still 8-byte aligned (which it probably no longer is).
> > >>
> > >> Nick
> > >>
> > >> ````
> > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _ _
> > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> > >> pyarrow/ipc.pxi:494: in pyarrow.lib.RecordBatchReader.read_all
> > >>     check_status(self.reader.get().ReadAll(&table))
> > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _ _
> > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> > >>
> > >> >   raise ArrowInvalid(message)
> > >> E   pyarrow.lib.ArrowInvalid: Expected to read 117703432 metadata
> bytes,
> > >> but only read 19
> > >> ````
> > >>
> > >> [1] https://github.com/apache/arrow/pull/8644
> > >>
> > >>
>

Re: Pandas Block Manager

Reply via email to