Re: [DISCUSS] IPC buffer layout for Null type

Wes McKinney Thu, 19 Sep 2019 12:35:50 -0700

OK, my preference, therefore, would be to rebase and merge my patch
without bothering with backwards compatibility code. The situations
where there would be an issue are fairly esoteric.


https://github.com/apache/arrow/pull/5287

On Thu, Sep 19, 2019 at 2:29 PM Antoine Pitrou <[email protected]> wrote:
>
>
> Well, this is an incompatible IPC change, so ideally it should be done
> now, not later.
>
> Regards
>
> Antoine.
>
>
> On Thu, 19 Sep 2019 14:08:37 -0500
> Wes McKinney <[email protected]> wrote:
>
> > I'm concerned about rushing through any patch for this for 0.15.0, but
> > each release with the status quo increases the risk of making changes.
> > Thoughts?
> >
> > On Fri, Sep 6, 2019 at 12:59 PM Wes McKinney <[email protected]> wrote:
> > >
> > > On Fri, Sep 6, 2019 at 12:57 PM Micah Kornfield <[email protected]> 
> > > wrote:
> > > >
> > > > >
> > > > > We can't because the buffer layout is not transmitted -- 
> > > > > implementations
> > > > > make assumptions about what Buffer values correspond to each field. 
> > > > > The
> > > > > only thing we could do to signal the change would be to increase the
> > > > > metadata version from V4 to V5.
> > > >
> > > > If we do this within 0.15.0 we could infer from the padding of messages.
> > > >
> > >
> > > That's true. I'd be OK adding backward compatibility code (that we can
> > > probably remove later) to my patch...
> > >
> > > I'm not sure about the other implementations. I think for non-C++
> > > implementations because they don't have much application code that can
> > > produce Null arrays that they should simply use the no-buffers layout
> > >
> > > > On Fri, Sep 6, 2019 at 10:16 AM Wes McKinney <[email protected]> 
> > > > wrote:
> > > >
> > > > > On Fri, Sep 6, 2019, 12:08 PM Antoine Pitrou <[email protected]> 
> > > > > wrote:
> > > > >
> > > > > >
> > > > > > Null can also come up when converting a column with only NA values 
> > > > > > in a
> > > > > > CSV file.  I don't remember for sure, but I think the same can 
> > > > > > happen
> > > > > > with JSON files as well.
> > > > > >
> > > > > > Can't we accept both forms when reading?  It sounds like it should 
> > > > > > be
> > > > > > reasonably easy.
> > > > > >
> > > > >
> > > > > We can't because the buffer layout is not transmitted -- 
> > > > > implementations
> > > > > make assumptions about what Buffer values correspond to each field. 
> > > > > The
> > > > > only thing we could do to signal the change would be to increase the
> > > > > metadata version from V4 to V5.
> > > > >
> > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > > >
> > > > > >
> > > > > > Le 06/09/2019 à 17:36, Wes McKinney a écrit :
> > > > > > > hi Micah,
> > > > > > >
> > > > > > > Null wouldn't come up that often in practice. It could happen when
> > > > > > > converting from pandas, for example
> > > > > > >
> > > > > > > In [8]: df = pd.DataFrame({'col1': np.array([np.nan] * 10,
> > > > > > dtype=object)})
> > > > > > >
> > > > > > > In [9]: t = pa.table(df)
> > > > > > >
> > > > > > > In [10]: t
> > > > > > > Out[10]:
> > > > > > > pyarrow.Table
> > > > > > > col1: null
> > > > > > > metadata
> > > > > > > --------
> > > > > > > {b'pandas': b'{"index_columns": [{"kind": "range", "name": null,
> > > > > > "start": 0, "'
> > > > > > >             b'stop": 10, "step": 1}], "column_indexes": [{"name": 
> > > > > > > null,
> > > > > > "field'
> > > > > > >             b'_name": null, "pandas_type": "unicode", 
> > > > > > > "numpy_type":
> > > > > > "object", '
> > > > > > >             b'"metadata": {"encoding": "UTF-8"}}], "columns": 
> > > > > > > [{"name":
> > > > > > "col1"'
> > > > > > >             b', "field_name": "col1", "pandas_type": "empty",
> > > > > > "numpy_type": "o'
> > > > > > >             b'bject", "metadata": null}], "creator": {"library":
> > > > > > "pyarrow", "v'
> > > > > > >             b'ersion": "0.14.1.dev464+g40d08a751"}, 
> > > > > > > "pandas_version":
> > > > > > "0.24.2"'
> > > > > > >             b'}'}
> > > > > > >
> > > > > > > I'm inclined to make the change without worrying about backwards
> > > > > > > compatibility. If people have been persisting data against the
> > > > > > > recommendations of the project, the remedy is to use an older 
> > > > > > > version
> > > > > > > of the library to read the files and write them to something else
> > > > > > > (like Parquet format) in the meantime.
> > > > > > >
> > > > > > > Obviously come 1.0.0 we'll begin to make compatibility guarantees 
> > > > > > > so
> > > > > > > this will be less of an issue.
> > > > > > >
> > > > > > > - Wes
> > > > > > >
> > > > > > > On Thu, Sep 5, 2019 at 11:14 PM Micah Kornfield 
> > > > > > > <[email protected]
> > > > > >
> > > > > > wrote:
> > > > > > >>
> > > > > > >> Hi Wes and others,
> > > > > > >> I don't have a sense of where Null arrays get created in the 
> > > > > > >> existing
> > > > > > code
> > > > > > >> base?
> > > > > > >>
> > > > > > >> Also, do you think it is worth the effort make this backwards
> > > > > > compatible.
> > > > > > >> We could in theory tie the buffer count to having the 
> > > > > > >> continuation
> > > > > value
> > > > > > >> for alignment.
> > > > > > >>
> > > > > > >> The one area were I'm slightly concerned is we seem to have 
> > > > > > >> users in
> > > > > the
> > > > > > >> wild who are depending on backwards compatibility, and I'm try to
> > > > > better
> > > > > > >> understand the odds that we break them.
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Micah
> > > > > > >>
> > > > > > >> On Thu, Sep 5, 2019 at 7:25 AM Wes McKinney <[email protected]>
> > > > > > wrote:
> > > > > > >>
> > > > > > >>> hi folks,
> > > > > > >>>
> > > > > > >>> One of the as-yet-untested (in integration tests) parts of the
> > > > > > >>> columnar specification is the Null layout. In C++ we 
> > > > > > >>> additionally
> > > > > > >>> implemented this by writing two length-0 "placeholder" buffers 
> > > > > > >>> in the
> > > > > > >>> RecordBatch data header, but since the Null layout has no memory
> > > > > > >>> allocated nor any buffers in-memory it may be more proper to 
> > > > > > >>> write no
> > > > > > >>> buffers (since the length of the Null layout is all you need to
> > > > > > >>> reconstruct it). There are 3 implementations of the placeholder
> > > > > > >>> version (C++, Go, JS, maybe also C#) but it never got 
> > > > > > >>> implemented in
> > > > > > >>> Java. While technically this would break old serialized data, I 
> > > > > > >>> would
> > > > > > >>> not expect this to be very frequently occurring in many of the
> > > > > > >>> currently-deployed Arrow applications
> > > > > > >>>
> > > > > > >>> Here is my C++ patch
> > > > > > >>>
> > > > > > >>> https://github.com/apache/arrow/pull/5287
> > > > > > >>>
> > > > > > >>> I'm not sure we need to formalize this with a vote but I'm 
> > > > > > >>> interested
> > > > > > >>> in the community's feedback on how to proceed here.
> > > > > > >>>
> > > > > > >>> - Wes
> > > > > > >>>
> > > > > >
> > > > >
> >
>
>
>

Re: [DISCUSS] IPC buffer layout for Null type

Reply via email to