Re: [DISCUSS] IPC buffer layout for Null type

Wes McKinney Thu, 19 Sep 2019 12:10:16 -0700

I'm concerned about rushing through any patch for this for 0.15.0, but
each release with the status quo increases the risk of making changes.
Thoughts?


On Fri, Sep 6, 2019 at 12:59 PM Wes McKinney <[email protected]> wrote:
>
> On Fri, Sep 6, 2019 at 12:57 PM Micah Kornfield <[email protected]> wrote:
> >
> > >
> > > We can't because the buffer layout is not transmitted -- implementations
> > > make assumptions about what Buffer values correspond to each field. The
> > > only thing we could do to signal the change would be to increase the
> > > metadata version from V4 to V5.
> >
> > If we do this within 0.15.0 we could infer from the padding of messages.
> >
>
> That's true. I'd be OK adding backward compatibility code (that we can
> probably remove later) to my patch...
>
> I'm not sure about the other implementations. I think for non-C++
> implementations because they don't have much application code that can
> produce Null arrays that they should simply use the no-buffers layout
>
> > On Fri, Sep 6, 2019 at 10:16 AM Wes McKinney <[email protected]> wrote:
> >
> > > On Fri, Sep 6, 2019, 12:08 PM Antoine Pitrou <[email protected]> wrote:
> > >
> > > >
> > > > Null can also come up when converting a column with only NA values in a
> > > > CSV file.  I don't remember for sure, but I think the same can happen
> > > > with JSON files as well.
> > > >
> > > > Can't we accept both forms when reading?  It sounds like it should be
> > > > reasonably easy.
> > > >
> > >
> > > We can't because the buffer layout is not transmitted -- implementations
> > > make assumptions about what Buffer values correspond to each field. The
> > > only thing we could do to signal the change would be to increase the
> > > metadata version from V4 to V5.
> > >
> > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 06/09/2019 à 17:36, Wes McKinney a écrit :
> > > > > hi Micah,
> > > > >
> > > > > Null wouldn't come up that often in practice. It could happen when
> > > > > converting from pandas, for example
> > > > >
> > > > > In [8]: df = pd.DataFrame({'col1': np.array([np.nan] * 10,
> > > > dtype=object)})
> > > > >
> > > > > In [9]: t = pa.table(df)
> > > > >
> > > > > In [10]: t
> > > > > Out[10]:
> > > > > pyarrow.Table
> > > > > col1: null
> > > > > metadata
> > > > > --------
> > > > > {b'pandas': b'{"index_columns": [{"kind": "range", "name": null,
> > > > "start": 0, "'
> > > > >             b'stop": 10, "step": 1}], "column_indexes": [{"name": 
> > > > > null,
> > > > "field'
> > > > >             b'_name": null, "pandas_type": "unicode", "numpy_type":
> > > > "object", '
> > > > >             b'"metadata": {"encoding": "UTF-8"}}], "columns": 
> > > > > [{"name":
> > > > "col1"'
> > > > >             b', "field_name": "col1", "pandas_type": "empty",
> > > > "numpy_type": "o'
> > > > >             b'bject", "metadata": null}], "creator": {"library":
> > > > "pyarrow", "v'
> > > > >             b'ersion": "0.14.1.dev464+g40d08a751"}, "pandas_version":
> > > > "0.24.2"'
> > > > >             b'}'}
> > > > >
> > > > > I'm inclined to make the change without worrying about backwards
> > > > > compatibility. If people have been persisting data against the
> > > > > recommendations of the project, the remedy is to use an older version
> > > > > of the library to read the files and write them to something else
> > > > > (like Parquet format) in the meantime.
> > > > >
> > > > > Obviously come 1.0.0 we'll begin to make compatibility guarantees so
> > > > > this will be less of an issue.
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Thu, Sep 5, 2019 at 11:14 PM Micah Kornfield <[email protected]
> > > >
> > > > wrote:
> > > > >>
> > > > >> Hi Wes and others,
> > > > >> I don't have a sense of where Null arrays get created in the existing
> > > > code
> > > > >> base?
> > > > >>
> > > > >> Also, do you think it is worth the effort make this backwards
> > > > compatible.
> > > > >> We could in theory tie the buffer count to having the continuation
> > > value
> > > > >> for alignment.
> > > > >>
> > > > >> The one area were I'm slightly concerned is we seem to have users in
> > > the
> > > > >> wild who are depending on backwards compatibility, and I'm try to
> > > better
> > > > >> understand the odds that we break them.
> > > > >>
> > > > >> Thanks,
> > > > >> Micah
> > > > >>
> > > > >> On Thu, Sep 5, 2019 at 7:25 AM Wes McKinney <[email protected]>
> > > > wrote:
> > > > >>
> > > > >>> hi folks,
> > > > >>>
> > > > >>> One of the as-yet-untested (in integration tests) parts of the
> > > > >>> columnar specification is the Null layout. In C++ we additionally
> > > > >>> implemented this by writing two length-0 "placeholder" buffers in 
> > > > >>> the
> > > > >>> RecordBatch data header, but since the Null layout has no memory
> > > > >>> allocated nor any buffers in-memory it may be more proper to write 
> > > > >>> no
> > > > >>> buffers (since the length of the Null layout is all you need to
> > > > >>> reconstruct it). There are 3 implementations of the placeholder
> > > > >>> version (C++, Go, JS, maybe also C#) but it never got implemented in
> > > > >>> Java. While technically this would break old serialized data, I 
> > > > >>> would
> > > > >>> not expect this to be very frequently occurring in many of the
> > > > >>> currently-deployed Arrow applications
> > > > >>>
> > > > >>> Here is my C++ patch
> > > > >>>
> > > > >>> https://github.com/apache/arrow/pull/5287
> > > > >>>
> > > > >>> I'm not sure we need to formalize this with a vote but I'm 
> > > > >>> interested
> > > > >>> in the community's feedback on how to proceed here.
> > > > >>>
> > > > >>> - Wes
> > > > >>>
> > > >
> > >

Re: [DISCUSS] IPC buffer layout for Null type

Reply via email to