Hi Weston, When starting with a 2D ndarray, the conversion from numpy to pandas DataFrame (`pd.DataFrame(arr)`) is actually zero copy. But, pandas takes a transposed view on the original array (that's the reason the C contiguous flag changes), to ensure the column are the first dimension of the stored array under the hood.
This is the reason for the oddities you noted, see some more inline comments below. On Wed, 11 Nov 2020 at 09:31, Weston Pace <weston.p...@gmail.com> wrote: > > Nick, it appears converting the ndarray to a dataframe clears the > contiguous flag even though it doesn't actually change the underlying > array. At least, this is what I'm seeing with my testing. My guess > is this is what is causing arrow to do a copy (arrow is indeed doing a > new allocation here, this is why you see the 64 byte padded > differences). I am not enough of a pandas expert to provide any > further guidance but maybe someone else may know what is happening > here. This is related to the above: arrow does not create an array zero-copy from a dataframe column (when created that way from the numpy array), because the column data is not contiguous in memory (numpy array is by default row major, and we need the first column). Thus, arrow needs to make a copy when creating an array to ensure contiguous memory for its array. So that is the reason you see the different memory addresses below. > > arr = np.random.randint(10, size=(5,5)) > old_address = arr.__array_interface__['data'][0] > df = pd.DataFrame(arr) > print(arr.flags) > pa_arr = pa.array(df[0].values) > print(df[0].values.flags) > df_address = df[0].values.__array_interface__['data'][0] > new_address = pa_arr.buffers()[1].address > print(f'Old address={old_address}\nDf address={df_address}\nNew > address={new_address}') > > # C_CONTIGUOUS : True > # F_CONTIGUOUS : False > # OWNDATA : True > # WRITEABLE : True > # ALIGNED : True > # WRITEBACKIFCOPY : False > # UPDATEIFCOPY : False > # > # C_CONTIGUOUS : False > # F_CONTIGUOUS : False > # OWNDATA : False > # WRITEABLE : True > # ALIGNED : True > # WRITEBACKIFCOPY : False > # UPDATEIFCOPY : False > # > # Old address=2297872094880 > # Df address=2297872094880 > # New address=7932699743552 > > Conversion from the numpy array directly does seem to perform a zero > copy operation: Here, you are actually taking the first row of the numpy array, not the first column. That's the reason why in this case it *is* possible to do the conversion zero-copy. > > arr = np.random.randint(10, size=(5,5)) > old_address = arr.__array_interface__['data'][0] > pa_arr = pa.array(arr[0]) > new_address = pa_arr.buffers()[1].address > print(f'Old address={old_address}\nNew address={new_address}') > > # Old address=2297872094096 > # New address=2297872094096 > > As an even further oddity consider: > > arr = np.random.randint(10, size=(5,5)) > old_address = arr.__array_interface__['data'][0] > df = pd.DataFrame(arr) > print(f'ndarray address: {old_address}') > for i in range(5): > addr = df[i].values.__array_interface__['data'][0] > print(f'DF column {i} : {addr}') > > # ndarray address: 2297872094880 > # DF column 0 : 2297872094880 > # DF column 1 : 2297872094884 > # DF column 2 : 2297872094888 > # DF column 3 : 2297872094892 > # DF column 4 : 2297872094896 > > The pandas "values" seem to be giving very odd addresses (I would > expect them to be 20 bytes apart not 4). This is again due to the array stored under the hood in pandas being transposed / original array being row major. Note that if you create a DataFrame differently (not starting from a 2D numpy array, but eg from a dictionary of 1D arrays), this will be different, and pandas will create the backing 2D array natively using column order, and thus columns of the DataFrame will use contiguous memory. Joris > > On Tue, Nov 10, 2020 at 1:52 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > Sorry, I should clarify, I'm not familiar with zero copy from Pandas to > > Arrow, so there might be something else going on here. But once an arrow > > file is written out, buffers will be padded/aligned to 8 bytes. > > > > In general, I think relying on exact memory translation from systems that > > aren't used arrow, might require copies. > > > > On Tue, Nov 10, 2020 at 3:49 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > My question is: why are these addresses not 40 bytes apart from each > > > other? > > >> What's in the gaps between the buffers? It's not null bitsets - there's > > >> only one buffer for each column. Thanks - > > > > > > > > > All buffers are padded to at least 8 bytes (and per the spec 64 is > > > recommended). > > > > > > On Tue, Nov 10, 2020 at 3:39 PM Nicholas White <n.j.wh...@gmail.com> > > > wrote: > > > > > >> I've done a bit more digging. This code: > > >> ```` > > >> df = pd.DataFrame(np.random.randint(10, size=(5, 5))) > > >> table = pa.Table.from_pandas(df) > > >> mem = [] > > >> for c in table.columns: > > >> buf = c.chunks[0].buffers()[1] > > >> mem.append((buf.address, buf.size)) > > >> sorted(mem) > > >> ```` > > >> ...prints... > > >> ```` > > >> > > >> [(140262915478912, 40), > > >> (140262915479232, 40), > > >> (140262915479296, 40), > > >> (140262915479360, 40), > > >> (140262915479424, 40)] > > >> > > >> ```` > > >> My question is: why are these addresses not 40 bytes apart from each > > >> other? > > >> What's in the gaps between the buffers? It's not null bitsets - there's > > >> only one buffer for each column. Thanks - > > >> > > >> Nick > > >> > > >