Re: Pandas Block Manager

Joris Van den Bossche Wed, 11 Nov 2020 01:55:57 -0800

Hi Weston,

When starting with a 2D ndarray, the conversion from numpy to pandas
DataFrame (`pd.DataFrame(arr)`) is actually zero copy. But, pandas
takes a transposed view on the original array (that's the reason the C
contiguous flag changes), to ensure the column are the first dimension
of the stored array under the hood.


This is the reason for the oddities you noted, see some more inline
comments below.

On Wed, 11 Nov 2020 at 09:31, Weston Pace <weston.p...@gmail.com> wrote:
>
> Nick, it appears converting the ndarray to a dataframe clears the
> contiguous flag even though it doesn't actually change the underlying
> array.  At least, this is what I'm seeing with my testing.  My guess
> is this is what is causing arrow to do a copy (arrow is indeed doing a
> new allocation here, this is why you see the 64 byte padded
> differences).  I am not enough of a pandas expert to provide any
> further guidance but maybe someone else may know what is happening
> here.

This is related to the above: arrow does not create an array zero-copy
from a dataframe column (when created that way from the numpy array),
because the column data is not contiguous in memory (numpy array is by
default row major, and we need the first column). Thus, arrow needs to
make a copy when creating an array to ensure contiguous memory for its
array. So that is the reason you see the different memory addresses
below.

>
> arr = np.random.randint(10, size=(5,5))
> old_address = arr.__array_interface__['data'][0]
> df = pd.DataFrame(arr)
> print(arr.flags)
> pa_arr = pa.array(df[0].values)
> print(df[0].values.flags)
> df_address = df[0].values.__array_interface__['data'][0]
> new_address = pa_arr.buffers()[1].address
> print(f'Old address={old_address}\nDf  address={df_address}\nNew
> address={new_address}')
>
> #   C_CONTIGUOUS : True
> #   F_CONTIGUOUS : False
> #   OWNDATA : True
> #   WRITEABLE : True
> #   ALIGNED : True
> #   WRITEBACKIFCOPY : False
> #   UPDATEIFCOPY : False
> #
> #   C_CONTIGUOUS : False
> #   F_CONTIGUOUS : False
> #   OWNDATA : False
> #   WRITEABLE : True
> #   ALIGNED : True
> #   WRITEBACKIFCOPY : False
> #   UPDATEIFCOPY : False
> #
> # Old address=2297872094880
> # Df  address=2297872094880
> # New address=7932699743552
>
> Conversion from the numpy array directly does seem to perform a zero
> copy operation:

Here, you are actually taking the first row of the numpy array, not
the first column. That's the reason why in this case it *is* possible
to do the conversion zero-copy.

>
> arr = np.random.randint(10, size=(5,5))
> old_address = arr.__array_interface__['data'][0]
> pa_arr = pa.array(arr[0])
> new_address = pa_arr.buffers()[1].address
> print(f'Old address={old_address}\nNew address={new_address}')
>
> # Old address=2297872094096
> # New address=2297872094096
>
> As an even further oddity consider:
>
> arr = np.random.randint(10, size=(5,5))
> old_address = arr.__array_interface__['data'][0]
> df = pd.DataFrame(arr)
> print(f'ndarray address: {old_address}')
> for i in range(5):
>     addr = df[i].values.__array_interface__['data'][0]
>     print(f'DF column {i}    : {addr}')
>
> # ndarray address: 2297872094880
> # DF column 0    : 2297872094880
> # DF column 1    : 2297872094884
> # DF column 2    : 2297872094888
> # DF column 3    : 2297872094892
> # DF column 4    : 2297872094896
>
> The pandas "values" seem to be giving very odd addresses (I would
> expect them to be 20 bytes apart not 4).

This is again due to the array stored under the hood in pandas being
transposed / original array being row major.

Note that if you create a DataFrame differently (not starting from a
2D numpy array, but eg from a dictionary of 1D arrays), this will be
different, and pandas will create the backing 2D array natively using
column order, and thus columns of the DataFrame will use contiguous
memory.

Joris

>
> On Tue, Nov 10, 2020 at 1:52 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
> >
> > Sorry, I should clarify, I'm not familiar with zero copy from Pandas to
> > Arrow, so there might be something else going on here.  But once an arrow
> > file is written out, buffers will be padded/aligned to 8 bytes.
> >
> > In general, I think relying on exact memory translation from systems that
> > aren't used arrow, might require copies.
> >
> > On Tue, Nov 10, 2020 at 3:49 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > My question is: why are these addresses not 40 bytes apart from each 
> > > other?
> > >> What's in the gaps between the buffers? It's not null bitsets - there's
> > >> only one buffer for each column. Thanks -
> > >
> > >
> > > All buffers are padded to at least 8 bytes (and per the spec 64 is
> > > recommended).
> > >
> > > On Tue, Nov 10, 2020 at 3:39 PM Nicholas White <n.j.wh...@gmail.com>
> > > wrote:
> > >
> > >> I've done a bit more digging. This code:
> > >> ````
> > >> df = pd.DataFrame(np.random.randint(10, size=(5, 5)))
> > >> table = pa.Table.from_pandas(df)
> > >> mem = []
> > >> for c in table.columns:
> > >>     buf = c.chunks[0].buffers()[1]
> > >>     mem.append((buf.address, buf.size))
> > >> sorted(mem)
> > >> ````
> > >> ...prints...
> > >> ````
> > >>
> > >> [(140262915478912, 40),
> > >>  (140262915479232, 40),
> > >>  (140262915479296, 40),
> > >>  (140262915479360, 40),
> > >>  (140262915479424, 40)]
> > >>
> > >> ````
> > >> My question is: why are these addresses not 40 bytes apart from each
> > >> other?
> > >> What's in the gaps between the buffers? It's not null bitsets - there's
> > >> only one buffer for each column. Thanks -
> > >>
> > >> Nick
> > >>
> > >

Re: Pandas Block Manager

Reply via email to